Welcome, guest | Sign In | My Account | Store | Cart

A short engine for testing against a regex, understanding the 3 common quantifiers ?,+,* (non-greedy) working on characters, ., [...], [^...], \s, \S, bracketed patterns and group designators \N. Accepts unicode objects and fixed-width encoded strings (but problems with eventual comparisons of trailing bytes in multi-byte utf-letters). Captures up to 10 groups ( (?:...) implemented), which can be used for back referencing and in xreplace(). Captured groups are accessible after the search in the global list xGroups. | is supported, but only in groups and needing nested=True. nested=False is making '(' and ')' common letters.

This is not about Python or for Python, there it has little use beside re. But regarding that re needs about 6,000 lines you might agree with the author, that these 176 lines are powerful. This was the reason to publish it as a recipe - as a kind of (fairly complete) minimal example of a regex tester and as an example for corresponding recursive structures in data (TokenListCache) and code.

Working on this improved the author's understanding of regular expressions - especially of their eventual "greed". "Greedy" quantifiers are a concept, which has to be explained seperately and is coming unexpected: Whoever is scanning a text for '<.*>', s/he will search SGML tags, not the whole text. Even with the star's "greediness" the code has to take care, that '.*' doesn't eat the whole text finding no match for '<.*>' at all. Thus the standard syntax with greedy quantifiers cannot be simpler to implement than this with its mere 3 lines 101, 111 and 121 preventing any greed. Perhaps it is faster - otherwise it is difficult to understand, why the concept "greed" is existing at all.

This engine might be useful here and then under circumstances with nothing else available. Its brevity eases translation to other languages and it can work with arbitrary characters for STAR or PERHAPS (for example).

Python, 176 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
ANYCHAR = '.'; ESCAPE = '\\'
REGIONA = '['; REGIONO = ']'; RANGE = '-'; COMPLEMENT = '^'
GROUPA = '(' ; GROUPO = ')'; ALTERNATIVE = '|'
PERHAPS = '?' ; STAR = '*'; JUST_ONE_and_STAR = '+'
EXTENSION = '(?'; SKIPSTORE = '?:'

def match_region(ch, pattern):
    """ region: a list of comparation chars and ranges in [] or [^]"""
    if pattern[1]==COMPLEMENT: booly=False; booln=True; i=2
    else: booly=True; booln=False; i=1
    while i < len(pattern):
        if pattern[i]==ESCAPE:
            if pattern[i+1]==ch: return booly
            else: i+=2
        elif i<len(pattern)-2 and pattern[i+1]==RANGE:
            if pattern[i]<=ch<=pattern[i+2]: return booly
            else: i+=2
        elif pattern[i]==ch: return booly
        else: i+=1
    return booln

def match_simple_token(sarg, i, token):
    global MATCHLEN; c0=token[0]; ch=sarg[i]
    if c0==ESCAPE:
        c1=token[1]
        if c1=='s' and ch in (' ','\t','\r','\n','\f','\v'): MATCHLEN=1; return True
        elif c1=='S' and ch not in (' ','\t','\r','\n','\f','\v'): MATCHLEN=1; return True
        elif '0'<=c1<='9':
            captured=xGroups[int(c1)]; lg=len(captured)
            if sarg[i:i+lg]==captured: MATCHLEN=lg; return True        
        elif ch==c1: MATCHLEN=1; return True
    elif c0==REGIONA and match_region(ch,token): MATCHLEN=1; return True
    elif c0==ANYCHAR or c0==ch: MATCHLEN=1; return True
    return False

def strip_groupattris(s):
    if any([s.startswith(EXTENSION+c) for c in (':','=','!')]): s=s[3:]
    else: s=s[1:]
    if not s.endswith(')'): return s[:-2]
    return s[:-1]

TokenListCache = {}; xGroups = []

def parse_pattern(pattern, nested):
    """ 
    tokens are: 
    1. patterns included in brackets (parsing is recursive)
    2. regions
    3. \\ together with the escaped character
    4. periods
    5. simple characters
    All paired to 2-tuples with their trailing quantifiers or None
    """
    if pattern in TokenListCache.keys(): return TokenListCache[pattern]
    tokens=[]; i=0; pL = len(pattern)
    while i < pL:
        c = pattern[i]
        if c==REGIONA:
            k = pattern.find(REGIONO, i)
            if k==-1: raise ValueError('Unmatched '+REGIONA+' in '+pattern)
            while pattern[k-1] == ESCAPE:
                k = pattern.find(REGIONO, k+1)
                if k==-1: raise ValueError('Unmatched '+REGIONA+' in '+pattern)
            tokens.append(pattern[i:k+1]); i=k+1
        elif c == ANYCHAR: tokens.append(ANYCHAR); i+=1
        elif c == ESCAPE:
            if i<pL-1: tokens.append(pattern[i:i+2]); i+=2
            else: raise ValueError('Trailing '+ESCAPE)
        elif nested and c==GROUPA:
            resu = GROUPA; k=i+1; lv=1
            while lv > 0:
                cc = pattern[k]
                if cc == ESCAPE: resu+=cc; resu+=pattern[k+1]; k+=2; continue
                if cc == GROUPA: lv+=1 
                elif cc == GROUPO: lv-=1
                resu+=cc; k+=1
            tokens.append(resu); i=k; kore=strip_groupattris(resu)
            if resu not in TokenListCache.keys():
                TokenListCache[resu] = []   
                # groups are parsed to lists of token lists, each an alternative from '|'
                if kore[0] != GROUPA:
                    for s in kore.split(ALTERNATIVE):
                        TokenListCache[resu].append(parse_pattern(s, True))
                else:
                    TokenListCache[resu].append(parse_pattern(kore, True))
        else: tokens.append(c); i+=1
        if i<pL:
            if pattern[i]==PERHAPS: tokens[-1]=(tokens[-1],PERHAPS); i+=1
            elif pattern[i]==STAR: tokens[-1]=(tokens[-1],STAR); i+=1
            elif pattern[i]==JUST_ONE_and_STAR: 
                tokens.append((tokens[-1],STAR)); tokens[-2]=(tokens[-2],None); i+=1
            else: tokens[-1] = (tokens[-1],None)
        else: tokens[-1] = (tokens[-1],None) 
    TokenListCache[pattern]=tokens; return tokens

def try_index(sarg, tokens, ns):
    tkL=len(tokens)-1; L=len(sarg); global MATCHEND; compix=MATCHEND
    for tix in range(tkL+1):
        if compix==L: return any([pair[1] for pair in tokens[tix:]])
        ctk, qua = tokens[tix]   
        if qua and tix<tkL and try_index(sarg, tokens[tix+1:], ns): return True
        if ns and ctk[0] == GROUPA:
            if any([try_index(sarg, t, True) for t in TokenListCache[ctk]]):
                if ctk[1:3] != SKIPSTORE: xGroups.append(sarg[compix:MATCHEND])
                if ctk.startswith(EXTENSION+'='): continue
                elif ctk.startswith(EXTENSION+'!'): return False
                compix=MATCHEND
                if qua==STAR:
                    T = TokenListCache[ctk]
                    while compix<L:
                        if tix<tkL and try_index(sarg, tokens[tix+1:], ns): return True
                        if not any([try_index(sarg, t, ns) for t in T]): break
                        compix=MATCHEND
            else:
                if ctk.startswith(EXTENSION+'!'): continue
                if tix<tkL or not qua: return False
        elif match_simple_token(sarg, compix, ctk):
            compix+=MATCHLEN; MATCHEND=compix
            if qua==STAR:
                while compix<L:
                    if tix<tkL and try_index(sarg, tokens[tix+1:], ns): return True
                    if not match_simple_token(sarg, compix, ctk): break
                    compix+=MATCHLEN; MATCHEND=compix
        elif tix<tkL or not qua: return False
    return True

def xsearch(sarg, pattern, nested=False, start=0):
    tokens = parse_pattern(pattern, nested); L=len(sarg); global MATCHEND
    if nested: global xGroups; xGroups=[]
    while start<L:
        MATCHEND=start
        if try_index(sarg, tokens, nested): return (start, MATCHEND)
        start+=1
    return ()
    
def xfinditer(sarg, pattern, nested=False, start=0):
    tokens = parse_pattern(pattern, nested); n=0; L=len(sarg); global MATCHEND
    if nested: global xGroups; xGroups=[]
    while start<L:
        if n: start=n
        MATCHEND=start
        if try_index(sarg, tokens, nested): n=MATCHEND; yield (start, MATCHEND)
        else: n=0; start+=1
    raise StopIteration()

def xmatch(sarg, pattern, nested=False):
    """ checks, whether sarg as the whole matches the pattern """
    tokens = parse_pattern(pattern, nested); global MATCHEND; MATCHEND=0
    if nested: global xGroups; xGroups=[]
    if try_index(sarg, tokens, nested) and MATCHEND==len(sarg): return True
    return False

def xsplit(sarg, pattern, nested=False):
    resu = []; xpair = xsearch(sarg, pattern, nested=nested); residue=0
    while xpair:
        resu.append(sarg[:xpair[0]])
        residue = xpair[1]
        xpair = xsearch(sarg, pattern, nested, xpair[1])
    return resu+sarg[residue:]
    
def xreplace(sarg, pattern, subst, nested=False):
    to_replace=[]; s=sarg; xpair=xsearch(sarg, pattern, nested)
    while xpair:
        to_replace.append(xpair); xpair=xsearch(sarg, pattern, nested, xpair[1])
    if nested:
        for i in range(len(xGroups)): subst=subst.replace(ESCAPE+str(i), xGroups[i])
    for xpair in reversed(to_replace): s = s[:xpair[0]]+subst+s[xpair[1]:]   
    return s

def xfuncreplace(sarg, pattern, f, nested=False):
    to_replace=[]; s=sarg; xpair=xsearch(sarg, pattern, nested)
    while xpair:
        to_replace.append(xpair); xpair=xsearch(sarg, pattern, nested, xpair[1])
    for xpair in reversed(to_replace):
        s = s[:xpair[0]]+f(s[xpair[0]:xpair[1]])+s[xpair[1]:]   
    return s

The three first notes below have lost any meaning now.

Revision 38 changed the return values to booleans providing use of any() and reduction of indentation levels in try_index(). More than 30 lines could be spared so. This is the final revision or at least near to it.

And rev.39 doesn't change anything with this - it only improved a variable's name. On the occasion of sending the rewritten introduction. Rev.40 fixed a bug and a flaw.

5 comments

Joost Behrends (author) 11 years, 6 months ago  # | flag

Avoiding code duplication in try_sindex and try_nindex looks expensive - however match_region should be:

def match_region(ch, pattern):
    """ region: a list of comparation chs and ranges in [] or [^]"""
    booly = True; booln = False; i=1
    if pattern[1]=='^': booly = False; booln = True; i=2
    while i < len(pattern):
        if pattern[i] == '\\' and pattern[i+1] in (']', '\\','-'):
            if ch== pattern[i+1]: return booly
            else: i+=2
        elif i<len(pattern)-2 and pattern[i+1]=='-':
            if pattern[i]<=ch<=pattern[i+2]: return booly
            else: i+= 2
        elif pattern[i]==ch: return booly
        else: i+=1
    return booln
Joost Behrends (author) 11 years, 6 months ago  # | flag

And line 76 should be replaced by:

            if not resu in TokenListCache.keys():
                TokenListCache[resu]=parse_pattern(resu[1:-1], nested=True)
Joost Behrends (author) 11 years, 5 months ago  # | flag

WARNING :(

Currently (rev.12) the engine fails with \s or \n in patterns (or are some reached strings "incorrect" here ?). Unfortunately not by crashing, but by skipping matches. This is a bit difficult to debug - i am still working on it.

Denis Barmenkov 11 years, 5 months ago  # | flag

With so often revisions it is time to migrate to googlecode, I mean :)

Joost Behrends (author) 11 years, 5 months ago  # | flag

Admittedly very many. I hadn't foreseen, how easy it is to extend the functionality of this engine. Only one line each for positive (118) and negative (119) look-ahead testing for example. That is, why i still like it in spite of its low speed. And in respect to hrevity i tried all to come near to perfection. So the introduction of pairs as entries of the token lists with None as the second tuple item in case of non-quantified tokens spared some lines in try_index() and made several other lines shorter ("if qua:" now instead of what former were about "if qua in (PERHAPS, STAR):").

The first revisions didn't even have groups at all.

Furthermore here is the end. I will not produce more revisions as long as i find no bugs or a radically changed approach providing much more speed. Since these 34 revisions are here now and watched by so many migration from this point wouldn't make sense.

Created by Joost Behrends on Tue, 1 Jun 2010 (MIT)
Python recipes (4591)
Joost Behrends's recipes (1)

Required Modules

  • (none specified)

Other Information and Tasks