ActiveState Code

Recipe 475187: regular expression point of failure


Debug non-matching regular expressions by finding out the maximum parts of the pattern and the text that do match.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import re

def sub_re(pattern):
    for offset in range(len(pattern)+1,0,-1):
        try:
            re_obj = re.compile(pattern[:offset])
        except re.error: # syntax error in re part
            continue
        yield offset, re_obj

def partial_pattern_match(pattern, text):
    good_pattern_offset = 0
    good_text_offset = 0
    for re_offset, re_obj in sub_re(pattern):
        match = re_obj.match(text)
        if match:
            good_pattern_offset = re_offset
            good_text_offset = match.end()
            return good_pattern_offset, good_text_offset
    return good_pattern_offset, good_text_offset

if __name__ == "__main__":
    pattern = r"a+[bc]+d+e+"
    text = "aaaaabbbbe"
    pattern_offset, text_offset = partial_pattern_match(pattern, text)
    print "pattern, pattern_offset", pattern, repr(pattern_offset)
    print "good pattern", pattern[:pattern_offset]
    print "text:"
    print text
    print ' ' * text_offset + '^'

    pattern = r"a+[bc]+z*e+f"
    text = "aaaaabbbbef"
    pattern_offset, text_offset = partial_pattern_match(pattern, text)
    print "pattern, pattern_offset", pattern, repr(pattern_offset)
    print "good pattern", pattern[:pattern_offset]
    print "text:"
    print text
    print ' ' * text_offset + '^'

Discussion

Regular expressions are useful, although their use should be postponed until deemed necessary. If you do need them (or use them anyway :) and they work, everything is fine. If they do not match, though, it is often very hard understand what went wrong.

This recipe returns the offsets into the pattern and text where a mismatch occurs.

Comments

  1. 1. At 6:16 p.m. on 27 mar 2006, Mike Krell said:

    Code is broken. This algorithm doesn't work for various quantifiers such as kleene closure (*).

    For example, it returns the wrong point of failure for

    pattern = r"a+[bc]+z*e+f"

    text = "aaaaabbbbe"

  2. 2. At 8:32 a.m. on 28 mar 2006, Mario Pernici said:

    bug fix. To avoid this bug one can pop out characters from the end of the pattern

    import re
    
    def sub_re(pattern):
        for offset in range(len(pattern)+1,0,-1):
            try:
                re_obj = re.compile(pattern[:offset])
            except re.error: # syntax error in re part
                continue
            yield offset, re_obj
    
    def partial_pattern_match(pattern, text):
        good_pattern_offset = 0
        good_text_offset = 0
        for re_offset, re_obj in sub_re(pattern):
            match = re_obj.match(text)
            if match:
                good_pattern_offset = re_offset
                good_text_offset = match.end()
                return good_pattern_offset, good_text_offset
        return good_pattern_offset, good_text_offset
    
    if __name__ == "__main__":
        pattern = r"a+[bc]+d+e+"
        text = "aaaaabbbbe"
        pattern_offset, text_offset = partial_pattern_match(pattern, text)
        print "pattern, pattern_offset", pattern, repr(pattern_offset)
        print "good pattern", pattern[:pattern_offset]
        print "text:"
        print text
        print ' ' * text_offset + '^'
    
        pattern = r"a+[bc]+z*e+f"
        text = "aaaaabbbbef"
        pattern_offset, text_offset = partial_pattern_match(pattern, text)
        print "pattern, pattern_offset", pattern, repr(pattern_offset)
        print "good pattern", pattern[:pattern_offset]
        print "text:"
        print text
        print ' ' * text_offset + '^'
    
  3. 3. At 10:01 a.m. on 31 mar 2006, Christos Georgiou (the author) said:

    Thank you for pointing this out. In my tests (actually, parsing Postfix log files :) I had no failures with the function as given --thank you again, and thank you too, Mario. I used Mario's version as suggested.

Sign in to comment