A short engine for testing against a regex, understanding the 3 common quantifiers ?,+,* (non-greedy) working on characters, ., [...], [^...], \s, \S, bracketed patterns and group designators \N. Accepts unicode objects and fixed-width encoded strings (but problems with eventual comparisons of trailing bytes in multi-byte utf-letters). Captures up to 10 groups ( (?:...) implemented), which can be used for back referencing and in xreplace(). Captured groups are accessible after the search in the global list xGroups. | is supported, but only in groups and needing nested=True. nested=False is making '(' and ')' common letters.
This is not about Python or for Python, there it has little use beside re. But regarding that re needs about 6,000 lines you might agree with the author, that these 176 lines are powerful. This was the reason to publish it as a recipe - as a kind of (fairly complete) minimal example of a regex tester and as an example for corresponding recursive structures in data (TokenListCache) and code.
Working on this improved the author's understanding of regular expressions - especially of their eventual "greed". "Greedy" quantifiers are a concept, which has to be explained seperately and is coming unexpected: Whoever is scanning a text for '<.*>'
, s/he will search SGML tags, not the whole text. Even with the star's "greediness" the code has to take care, that '.*'
doesn't eat the whole text finding no match for '<.*>'
at all. Thus the standard syntax with greedy quantifiers cannot be simpler to implement than this with its mere 3 lines 101, 111 and 121 preventing any greed. Perhaps it is faster - otherwise it is difficult to understand, why the concept "greed" is existing at all.
This engine might be useful here and then under circumstances with nothing else available. Its brevity eases translation to other languages and it can work with arbitrary characters for STAR or PERHAPS (for example).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | ANYCHAR = '.'; ESCAPE = '\\'
REGIONA = '['; REGIONO = ']'; RANGE = '-'; COMPLEMENT = '^'
GROUPA = '(' ; GROUPO = ')'; ALTERNATIVE = '|'
PERHAPS = '?' ; STAR = '*'; JUST_ONE_and_STAR = '+'
EXTENSION = '(?'; SKIPSTORE = '?:'
def match_region(ch, pattern):
""" region: a list of comparation chars and ranges in [] or [^]"""
if pattern[1]==COMPLEMENT: booly=False; booln=True; i=2
else: booly=True; booln=False; i=1
while i < len(pattern):
if pattern[i]==ESCAPE:
if pattern[i+1]==ch: return booly
else: i+=2
elif i<len(pattern)-2 and pattern[i+1]==RANGE:
if pattern[i]<=ch<=pattern[i+2]: return booly
else: i+=2
elif pattern[i]==ch: return booly
else: i+=1
return booln
def match_simple_token(sarg, i, token):
global MATCHLEN; c0=token[0]; ch=sarg[i]
if c0==ESCAPE:
c1=token[1]
if c1=='s' and ch in (' ','\t','\r','\n','\f','\v'): MATCHLEN=1; return True
elif c1=='S' and ch not in (' ','\t','\r','\n','\f','\v'): MATCHLEN=1; return True
elif '0'<=c1<='9':
captured=xGroups[int(c1)]; lg=len(captured)
if sarg[i:i+lg]==captured: MATCHLEN=lg; return True
elif ch==c1: MATCHLEN=1; return True
elif c0==REGIONA and match_region(ch,token): MATCHLEN=1; return True
elif c0==ANYCHAR or c0==ch: MATCHLEN=1; return True
return False
def strip_groupattris(s):
if any([s.startswith(EXTENSION+c) for c in (':','=','!')]): s=s[3:]
else: s=s[1:]
if not s.endswith(')'): return s[:-2]
return s[:-1]
TokenListCache = {}; xGroups = []
def parse_pattern(pattern, nested):
"""
tokens are:
1. patterns included in brackets (parsing is recursive)
2. regions
3. \\ together with the escaped character
4. periods
5. simple characters
All paired to 2-tuples with their trailing quantifiers or None
"""
if pattern in TokenListCache.keys(): return TokenListCache[pattern]
tokens=[]; i=0; pL = len(pattern)
while i < pL:
c = pattern[i]
if c==REGIONA:
k = pattern.find(REGIONO, i)
if k==-1: raise ValueError('Unmatched '+REGIONA+' in '+pattern)
while pattern[k-1] == ESCAPE:
k = pattern.find(REGIONO, k+1)
if k==-1: raise ValueError('Unmatched '+REGIONA+' in '+pattern)
tokens.append(pattern[i:k+1]); i=k+1
elif c == ANYCHAR: tokens.append(ANYCHAR); i+=1
elif c == ESCAPE:
if i<pL-1: tokens.append(pattern[i:i+2]); i+=2
else: raise ValueError('Trailing '+ESCAPE)
elif nested and c==GROUPA:
resu = GROUPA; k=i+1; lv=1
while lv > 0:
cc = pattern[k]
if cc == ESCAPE: resu+=cc; resu+=pattern[k+1]; k+=2; continue
if cc == GROUPA: lv+=1
elif cc == GROUPO: lv-=1
resu+=cc; k+=1
tokens.append(resu); i=k; kore=strip_groupattris(resu)
if resu not in TokenListCache.keys():
TokenListCache[resu] = []
# groups are parsed to lists of token lists, each an alternative from '|'
if kore[0] != GROUPA:
for s in kore.split(ALTERNATIVE):
TokenListCache[resu].append(parse_pattern(s, True))
else:
TokenListCache[resu].append(parse_pattern(kore, True))
else: tokens.append(c); i+=1
if i<pL:
if pattern[i]==PERHAPS: tokens[-1]=(tokens[-1],PERHAPS); i+=1
elif pattern[i]==STAR: tokens[-1]=(tokens[-1],STAR); i+=1
elif pattern[i]==JUST_ONE_and_STAR:
tokens.append((tokens[-1],STAR)); tokens[-2]=(tokens[-2],None); i+=1
else: tokens[-1] = (tokens[-1],None)
else: tokens[-1] = (tokens[-1],None)
TokenListCache[pattern]=tokens; return tokens
def try_index(sarg, tokens, ns):
tkL=len(tokens)-1; L=len(sarg); global MATCHEND; compix=MATCHEND
for tix in range(tkL+1):
if compix==L: return any([pair[1] for pair in tokens[tix:]])
ctk, qua = tokens[tix]
if qua and tix<tkL and try_index(sarg, tokens[tix+1:], ns): return True
if ns and ctk[0] == GROUPA:
if any([try_index(sarg, t, True) for t in TokenListCache[ctk]]):
if ctk[1:3] != SKIPSTORE: xGroups.append(sarg[compix:MATCHEND])
if ctk.startswith(EXTENSION+'='): continue
elif ctk.startswith(EXTENSION+'!'): return False
compix=MATCHEND
if qua==STAR:
T = TokenListCache[ctk]
while compix<L:
if tix<tkL and try_index(sarg, tokens[tix+1:], ns): return True
if not any([try_index(sarg, t, ns) for t in T]): break
compix=MATCHEND
else:
if ctk.startswith(EXTENSION+'!'): continue
if tix<tkL or not qua: return False
elif match_simple_token(sarg, compix, ctk):
compix+=MATCHLEN; MATCHEND=compix
if qua==STAR:
while compix<L:
if tix<tkL and try_index(sarg, tokens[tix+1:], ns): return True
if not match_simple_token(sarg, compix, ctk): break
compix+=MATCHLEN; MATCHEND=compix
elif tix<tkL or not qua: return False
return True
def xsearch(sarg, pattern, nested=False, start=0):
tokens = parse_pattern(pattern, nested); L=len(sarg); global MATCHEND
if nested: global xGroups; xGroups=[]
while start<L:
MATCHEND=start
if try_index(sarg, tokens, nested): return (start, MATCHEND)
start+=1
return ()
def xfinditer(sarg, pattern, nested=False, start=0):
tokens = parse_pattern(pattern, nested); n=0; L=len(sarg); global MATCHEND
if nested: global xGroups; xGroups=[]
while start<L:
if n: start=n
MATCHEND=start
if try_index(sarg, tokens, nested): n=MATCHEND; yield (start, MATCHEND)
else: n=0; start+=1
raise StopIteration()
def xmatch(sarg, pattern, nested=False):
""" checks, whether sarg as the whole matches the pattern """
tokens = parse_pattern(pattern, nested); global MATCHEND; MATCHEND=0
if nested: global xGroups; xGroups=[]
if try_index(sarg, tokens, nested) and MATCHEND==len(sarg): return True
return False
def xsplit(sarg, pattern, nested=False):
resu = []; xpair = xsearch(sarg, pattern, nested=nested); residue=0
while xpair:
resu.append(sarg[:xpair[0]])
residue = xpair[1]
xpair = xsearch(sarg, pattern, nested, xpair[1])
return resu+sarg[residue:]
def xreplace(sarg, pattern, subst, nested=False):
to_replace=[]; s=sarg; xpair=xsearch(sarg, pattern, nested)
while xpair:
to_replace.append(xpair); xpair=xsearch(sarg, pattern, nested, xpair[1])
if nested:
for i in range(len(xGroups)): subst=subst.replace(ESCAPE+str(i), xGroups[i])
for xpair in reversed(to_replace): s = s[:xpair[0]]+subst+s[xpair[1]:]
return s
def xfuncreplace(sarg, pattern, f, nested=False):
to_replace=[]; s=sarg; xpair=xsearch(sarg, pattern, nested)
while xpair:
to_replace.append(xpair); xpair=xsearch(sarg, pattern, nested, xpair[1])
for xpair in reversed(to_replace):
s = s[:xpair[0]]+f(s[xpair[0]:xpair[1]])+s[xpair[1]:]
return s
|
The three first notes below have lost any meaning now.
Revision 38 changed the return values to booleans providing use of any() and reduction of indentation levels in try_index(). More than 30 lines could be spared so. This is the final revision or at least near to it.
And rev.39 doesn't change anything with this - it only improved a variable's name. On the occasion of sending the rewritten introduction. Rev.40 fixed a bug and a flaw.
Avoiding code duplication in try_sindex and try_nindex looks expensive - however match_region should be:
And line 76 should be replaced by:
WARNING :(
Currently (rev.12) the engine fails with \s or \n in patterns (or are some reached strings "incorrect" here ?). Unfortunately not by crashing, but by skipping matches. This is a bit difficult to debug - i am still working on it.
With so often revisions it is time to migrate to googlecode, I mean :)
Admittedly very many. I hadn't foreseen, how easy it is to extend the functionality of this engine. Only one line each for positive (118) and negative (119) look-ahead testing for example. That is, why i still like it in spite of its low speed. And in respect to hrevity i tried all to come near to perfection. So the introduction of pairs as entries of the token lists with None as the second tuple item in case of non-quantified tokens spared some lines in try_index() and made several other lines shorter ("if qua:" now instead of what former were about "if qua in (PERHAPS, STAR):").
The first revisions didn't even have groups at all.
Furthermore here is the end. I will not produce more revisions as long as i find no bugs or a radically changed approach providing much more speed. Since these 34 revisions are here now and watched by so many migration from this point wouldn't make sense.