Welcome, guest | Sign In | My Account | Store | Cart

Very often we need to look for the occurence of words in a string or group of words. We rely on regular expressions for such operations. A very common requirement is to look for the occurence of certain words in a paragraph or string. We can group the occurence by boolean operators AND, OR and NOT, allowing to search for certain words using boolean logic.

This class is created to do exactly that. It wraps up a complex boolean word expression, creating an internal regular expression, and provides methods allowing you to perform matches and searches on it.

Python, 219 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
#!/usr/bin/python

import re

class PyBoolReException(Exception):

    def __init__(self, value):
        self.value = value

    def __str__(self):
        return str(self.value)
    
    
class PyBoolRe:
    """ A class to perform boolean word matches in
    a string or paragraph. This class allows you to
    perform complex matches in a string or group of
    words by creating simple boolean expressions,
    grouped by parantheses to create complex match
    expressions.

    Author: Anand B Pillai, http://tinyurl.com/yq3y
    Copyright: None
    LICENSE: GPL
    Version: 0.2
    
    Usage:

    1. Create regular expressions using the boolean
       keywords '|' and '&', standing for 'OR' and
       'AND' respectively.
    2. Use parantheses to group the boolean expressions
       to create complex match expressions.
    3. Caveats:

       1. Fails for expressions with redundant parens such
       as ((A | B)) etc.
       

    Example:
    
    p = PyBoolRe('Guido & Python')
    s = 'Guido created Python'
    mobject = p.match(s)
    
    # Work with 'mobject' like you normally work with
    # regular expression match objects
      
    """
    
    def __init__(self, boolstr):
        # Require whitespace  before words?
        self.__needspace = True
        # whitespace re
        self._wspre = re.compile('^\s*$')
        # create regexp string
        self.__rexplist = []
        oparct = boolstr.count('(')
        clparct = boolstr.count(')')
        if oparct != clparct:
            raise PyBoolReException, 'Mismatched parantheses!'

        self.__parse(boolstr)
        # if NOT is one of the members, reverse
        # the list
        # print self.__rexplist
        if '!' in self.__rexplist:
            self.__rexplist.reverse()

        s = self.__makerexp(self.__rexplist)
        # print s
        self.__rexp = re.compile(s)

    def match(self, data):
        """ Match the boolean expression, behaviour
        is same as the 'match' method of re """
        
        return self.__rexp.match(data)

    def search(self, data):
        """ Search the boolean expression, behaviour
        is same as the 'search' method of re """

        return self.__rexp.search(data)

    def __parse(self, s):
        """ Parse the boolean regular expression string
        and create the regexp list """

        # The string is a nested parantheses with
        # any character in between the parens.

        scopy = s[:]
        oparmatch, clparmatch = False, False

        # Look for a NOT expression
        index = scopy.rfind('(')

        l = []
        if index != -1:
            oparmatch = True
            index2 = scopy.find(')', index)
            if index2 != -1:
                clparmatch = True
                newstr = scopy[index+1:index2]
                # if the string is only of whitespace chars, skip it
                if not self._wspre.match(newstr):
                    self.__rexplist.append(newstr)
                replacestr = '(' + newstr + ')'
                scopy = scopy.replace(replacestr, '')
                    
                self.__parse(scopy)
                
        if not clparmatch and not oparmatch:
            if scopy: self.__rexplist.append(scopy)

    def is_inbetween(self, l, elem):
        """ Find out if an element is in between
        in a list """

        index = l.index(elem)
        if index == -1:
            return False

        if index>2:
            if index in range(1, len(l) -1):
                return True
            else:
                return False
        else:
            return True

    def __makenotexpr(self, s):
        """ Make a NOT expression """

        if s.find('!') == 0:
            return ''.join(('(?!', s[1:], ')'))
        else:
            return s
                          
    def __makerexp(self, rexplist):
        """ Make the regular expression string for
        the boolean match from the nested list """

        
        is_list = True

        if type(rexplist) is str:
            is_list = False
            elem = rexplist
        elif type(rexplist) is list:
            elem = rexplist[0]

        if type(elem) is list:
            elem = elem[0]
            
        eor = False
        if not is_list or len(rexplist) == 1:
            eor = True

        word_str = '.*'
        
        s=''
        # Implementing NOT
        if elem == '!':
            return ''.join(('(?!', self.__makerexp(rexplist[1:]), ')'))
        # Implementing OR
        elif elem.find(' | ') != -1:
            listofors = elem.split(' | ')

            for o in listofors:
                index = listofors.index(o)
                in_bet = self.is_inbetween(listofors, o)

                if o:
                    o = self.__makenotexpr(o)
                    if in_bet:
                        s = ''.join((s, '|', word_str, o, '.*'))
                    else:
                        s = ''.join((s, word_str, o, '.*'))

        # Implementing AND
        elif elem.find(' & ') != -1:
            listofands = elem.split(' & ')
            
            for a in listofands:
                index = listofands.index(a)
                in_bet = self.is_inbetween(listofands, a)                

                if a:
                    a = self.__makenotexpr(a)                   
                    s = ''.join((s, word_str, a, '.*'))

        else:
            if elem:
                elem = self.__makenotexpr(elem)             
                s = ''.join((elem, '.*'))

        if eor:
            return s
        else:
            return ''.join((s, self.__makerexp(rexplist[1:])))
            
                    
if __name__=="__main__":
    p = PyBoolRe('(!Guido)')
    
    s1 = 'Guido invented Python and Larry invented Perl'
    s2 = 'Larry invented Perl, not Python'
    
    if p.match(s1):
       print 'Match found for first string'
    else:
       print 'No match found for first string'

    if p.match(s2):
       print 'Match found for second string'
    else:
       print 'No match found for second string'
        
        

        

The first version had errors in parsing of the expression, so I re-wrote the parse() method in this version.

Implemented a basic NOT operator handling algorithm - 19/12/03 The script no longer parses whitespace.

-Anand

1 comment

Owen Richter 17 years, 10 months ago  # | flag

One Error and One more caveat. Error:

I believe that in the following code:

def is_inbetween(self, l, elem):

    """ Find out if an element is in between

    in a list """



    index = l.index(elem)

    if index == -1:

        return False

the line:

    if index == -1:

should read:

    if index == 0:

The way it currently reads, any | will always "or" with Null, and will thus always be true. (i.e. Guido | Larry is really: Null | Guido | Larry).

Caveat:

There is also another caveat. In a boolean statement such as (Larry Error:

I believe that in the following code:

def is_inbetween(self, l, elem):

    """ Find out if an element is in between

    in a list """



    index = l.index(elem)

    if index == -1:

        return False

the line:

    if index == -1:

should read:

    if index == 0:

The way it currently reads, any | will always "or" with Null, and will thus always be true. (i.e. Guido | Larry is really: Null | Guido | Larry).

Caveat:

There is also another caveat. In a boolean statement such as (Larry

Created by Anand on Thu, 11 Dec 2003 (PSF)
Python recipes (4591)
Anand's recipes (38)

Required Modules

Other Information and Tasks