Welcome, guest | Sign In | My Account | Store | Cart

Debug non-matching regular expressions by finding out the maximum parts of the pattern and the text that do match.

Python, 39 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import re

def sub_re(pattern):
    for offset in range(len(pattern)+1,0,-1):
        try:
            re_obj = re.compile(pattern[:offset])
        except re.error: # syntax error in re part
            continue
        yield offset, re_obj

def partial_pattern_match(pattern, text):
    good_pattern_offset = 0
    good_text_offset = 0
    for re_offset, re_obj in sub_re(pattern):
        match = re_obj.match(text)
        if match:
            good_pattern_offset = re_offset
            good_text_offset = match.end()
            return good_pattern_offset, good_text_offset
    return good_pattern_offset, good_text_offset

if __name__ == "__main__":
    pattern = r"a+[bc]+d+e+"
    text = "aaaaabbbbe"
    pattern_offset, text_offset = partial_pattern_match(pattern, text)
    print "pattern, pattern_offset", pattern, repr(pattern_offset)
    print "good pattern", pattern[:pattern_offset]
    print "text:"
    print text
    print ' ' * text_offset + '^'

    pattern = r"a+[bc]+z*e+f"
    text = "aaaaabbbbef"
    pattern_offset, text_offset = partial_pattern_match(pattern, text)
    print "pattern, pattern_offset", pattern, repr(pattern_offset)
    print "good pattern", pattern[:pattern_offset]
    print "text:"
    print text
    print ' ' * text_offset + '^'

Regular expressions are useful, although their use should be postponed until deemed necessary. If you do need them (or use them anyway :) and they work, everything is fine. If they do not match, though, it is often very hard understand what went wrong.

This recipe returns the offsets into the pattern and text where a mismatch occurs.

3 comments

Mike Krell 15 years, 8 months ago  # | flag

Code is broken. This algorithm doesn't work for various quantifiers such as kleene closure (*).

For example, it returns the wrong point of failure for

pattern = r"a+[bc]+z*e+f"

text = "aaaaabbbbe"

Mario Pernici 15 years, 8 months ago  # | flag

bug fix. To avoid this bug one can pop out characters from the end of the pattern

import re

def sub_re(pattern):
    for offset in range(len(pattern)+1,0,-1):
        try:
            re_obj = re.compile(pattern[:offset])
        except re.error: # syntax error in re part
            continue
        yield offset, re_obj

def partial_pattern_match(pattern, text):
    good_pattern_offset = 0
    good_text_offset = 0
    for re_offset, re_obj in sub_re(pattern):
        match = re_obj.match(text)
        if match:
            good_pattern_offset = re_offset
            good_text_offset = match.end()
            return good_pattern_offset, good_text_offset
    return good_pattern_offset, good_text_offset

if __name__ == "__main__":
    pattern = r"a+[bc]+d+e+"
    text = "aaaaabbbbe"
    pattern_offset, text_offset = partial_pattern_match(pattern, text)
    print "pattern, pattern_offset", pattern, repr(pattern_offset)
    print "good pattern", pattern[:pattern_offset]
    print "text:"
    print text
    print ' ' * text_offset + '^'

    pattern = r"a+[bc]+z*e+f"
    text = "aaaaabbbbef"
    pattern_offset, text_offset = partial_pattern_match(pattern, text)
    print "pattern, pattern_offset", pattern, repr(pattern_offset)
    print "good pattern", pattern[:pattern_offset]
    print "text:"
    print text
    print ' ' * text_offset + '^'
Christos Georgiou (author) 15 years, 8 months ago  # | flag

Thank you for pointing this out. In my tests (actually, parsing Postfix log files :) I had no failures with the function as given --thank you again, and thank you too, Mario. I used Mario's version as suggested.

Created by Christos Georgiou on Mon, 27 Mar 2006 (PSF)
Python recipes (4591)
Christos Georgiou's recipes (6)

Required Modules

Other Information and Tasks