Debug non-matching regular expressions by finding out the maximum parts of the pattern and the text that do match.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | import re
def sub_re(pattern):
for offset in range(len(pattern)+1,0,-1):
try:
re_obj = re.compile(pattern[:offset])
except re.error: # syntax error in re part
continue
yield offset, re_obj
def partial_pattern_match(pattern, text):
good_pattern_offset = 0
good_text_offset = 0
for re_offset, re_obj in sub_re(pattern):
match = re_obj.match(text)
if match:
good_pattern_offset = re_offset
good_text_offset = match.end()
return good_pattern_offset, good_text_offset
return good_pattern_offset, good_text_offset
if __name__ == "__main__":
pattern = r"a+[bc]+d+e+"
text = "aaaaabbbbe"
pattern_offset, text_offset = partial_pattern_match(pattern, text)
print "pattern, pattern_offset", pattern, repr(pattern_offset)
print "good pattern", pattern[:pattern_offset]
print "text:"
print text
print ' ' * text_offset + '^'
pattern = r"a+[bc]+z*e+f"
text = "aaaaabbbbef"
pattern_offset, text_offset = partial_pattern_match(pattern, text)
print "pattern, pattern_offset", pattern, repr(pattern_offset)
print "good pattern", pattern[:pattern_offset]
print "text:"
print text
print ' ' * text_offset + '^'
|
Regular expressions are useful, although their use should be postponed until deemed necessary. If you do need them (or use them anyway :) and they work, everything is fine. If they do not match, though, it is often very hard understand what went wrong.
This recipe returns the offsets into the pattern and text where a mismatch occurs.
Code is broken. This algorithm doesn't work for various quantifiers such as kleene closure (*).
For example, it returns the wrong point of failure for
pattern = r"a+[bc]+z*e+f"
text = "aaaaabbbbe"
bug fix. To avoid this bug one can pop out characters from the end of the pattern
Thank you for pointing this out. In my tests (actually, parsing Postfix log files :) I had no failures with the function as given --thank you again, and thank you too, Mario. I used Mario's version as suggested.