Welcome, guest | Sign In | My Account | Store | Cart
3

This recipe shows how to use the Python standard re module to perform single-pass multiple string substitution using a dictionary.

Python, 77 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Tested with Python 2.1
from __future__ import nested_scopes

import re 

#
# The simplest, lambda-based implementation
#

def multiple_replace(dict, text): 

  """ Replace in 'text' all occurences of any key in the given
  dictionary by its corresponding value.  Returns the new tring.""" 

  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

#
# You may combine both the dictionnary and search-and-replace
# into a single object using a 'callable' dictionary wrapper
# which can be directly used as a callback object.
# 

# In Python 2.2+ you may extend the 'dictionary' built-in class instead
from UserDict import UserDict 
class Xlator(UserDict):

  """ An all-in-one multiple string substitution class """ 

  def _make_regex(self): 

    """ Build a regular expression object based on the keys of
    the current dictionary """

    return re.compile("(%s)" % "|".join(map(re.escape, self.keys()))) 

  def __call__(self, mo): 
    
    """ This handler will be invoked for each regex match """

    # Count substitutions
    self.count += 1 # Look-up string

    return self[mo.string[mo.start():mo.end()]]

  def xlat(self, text): 

    """ Translate text, returns the modified text. """ 

    # Reset substitution counter
    self.count = 0 

    # Process text
    return self._make_regex().sub(self, text)

#
# Test
# 

if __name__ == "__main__": 

  text = "Larry Wall is the creator of Perl"

  dict = {
    "Larry Wall" : "Guido van Rossum",
    "creator" : "Benevolent Dictator for Life",
    "Perl" : "Python",
  } 

  print multiple_replace(dict, text) 

  xlat = Xlator(dict)
  print xlat.xlat(text)
  print "Changed %d thing(s)" % xlat.count 

Let's say you have a dictionary-based 1-to-1 mapping between strings. The keys are the set of strings (or regular expression patterns) you want to replace and their corresponding values are the strings they should be replaced by. You could actually do that by calling re.sub for each (key, value) pair in the dictionary, thus processing and creating a new copy of the whole text several times. It would be a lot better to do all the changes in a single pass, so we would need to process and create a copy of the text only once. Fortunately, re.sub's callback facility enables us to do so!

First we have to build a regular expression from our set of keys we want to match. Such a regular expression is a form of "(a1|a2|...|an)" pattern and can be easily generated using a cute oneliner (see source).

Then, instead of feeding re.sub with a replacement string, we'll invoke it with a callable object. This object will be called for any match with a re.MatchObject as its only argument. The re.sub method actually expects this callable object to return the replacement string. This callback routine has just to look-up the matched text in the dictionary and return the corresponding value.

I propose two implementations. The first one is a straighforward lambda-based and the other uses a callable dictionary-like object. This second option is better if you want to perform additional processing on each match such as counting the substitutions or implement any kind of context related complex behaviour.

15 comments

Alan Eldridge 12 years, 11 months ago  # | flag

Nice algorithm; needs one fix and one optimization. If dict has no entries, then the regex matches everything. Need to have an explicit check so as not to call re.sub() in that case.

The dict-like object is IMO the way to go. Now, if we have one of these, then we may (probably?) be using it more than once. So, we ought to keep the compiled regex around. There's no harm in keeping the text regex around either: it's nice for debugging, and we can get a further optimization out of it, too.

Here's what I came up with based on your original code:

# Original algorithm by Xavier Defrang.
# http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/81330
# This implementation by alane@sourceforge.net.

import re
import UserDict

class textsub(UserDict.UserDict):
    def __init__(self, dict = None):
        self.re = None
        self.regex = None
        UserDict.UserDict.__init__(self, dict)

    def compile(self):
        if len(self.data) > 0:
            tmp = "(%s)" % "|".join(map(re.escape,
                                        self.data.keys()))
            if self.re != tmp:
                self.re = tmp
                self.regex = re.compile(self.re)

    def __call__(self, match):
        return self.data[match.string[match.start():match.end()]]

    def sub(self, s):
        if len(self.data) == 0:
            return s
        return self.regex.sub(self, s)
Paul Hardwick 11 years, 11 months ago  # | flag

Performance tweaking..

I like this re based function a lot and was interested to compare it to a similar string replace function
I already use -- the results are interesting:

import time
vars = {'%_fred':'hello','%_dick':'world','%_freddy':'sunny'}
line = "%_fred beautiful %_dick"

def string_replace(dict,text):
    keys = dict.keys()
    keys.sort()
    for n in keys:
        if '%' not in text: break
        text = text.replace(n,dict[n])
    return text

start = time.time()
for n in range(1000):
    multiple_replace(vars,line)
print 1000/(time.time()-start)

start = time.time()
for n in range(1000):
    string_replace(vars,line)
print 1000/(time.time()-start)

On my PIII 1.1GHz machine:
multiple_replace 3831.41707853 loops/sec
string_replace 50000.0476838 loops/sec
However.. if the re.compile statement is moved out of the function
new_multiple_replace 10000.0095368 loops/sec

If you can arrange to compile outside of the routine then the re
version is best otherwise the longer string based function has the performance.
Martin Miller 11 years, 8 months ago  # | flag

Re: Performance tweaking. Actually, the

keys.sort()

shown in the string_replace() function doesn't do any thing useful and could be omitted. For the small number of keys, three, shown in the example, doing this doesn't speed up the routine very much.

However, it can be useful to sort the keys by the reverse of their string lengths, like this instead:

keys.sort(lambda a,b: len(b)-len(a))

because doing this will try matches with the longest patterns first. This can be signficant if one or more of the keys are substrings of others -- as the case with '%_fred' and '%_freddy' in the example.

If the keys are not sorted by length (either the default way, as originally shown, or not at all), the following string:

"%_freddy beautiful %_dick"

will become:

"hellody beautiful world"

instead of:

"sunny beautiful world"

The down side of the reverse length sort is that it does slow the routine down (about 50% on my machine).

All versions above might also benefit by placing a "str(...)" call around the dictionary lookups. This allow the replacement values to be numbers and other object, as well as strings. I find this useful when the replacement values are calculated.

Martin Miller 11 years, 8 months ago  # | flag

Re: Performance tweaking. Interestingly, removing the line:

if '%' not in text: break

from the string_replace() function speeds it up fairly significantly, with no detrimental effects. Another advantage to leaving the line out is that it makes the routine more general, since keys no longer all have to start with a "%" character.

Paul Hardwick 11 years, 8 months ago  # | flag

Futher Tweaking... You're right of course - the sort needs to be reversed and done outside of the function if possible :D

I did some more checking... With modifed code, using the same small sample string and increasing the size of the dictionary:

keys,string,regex

20,20k,10k

40,11k,10k

200,2k,10k

2000,250,12k

As the dictionary grew > 40 items, the perfomance of the regex routine was unaffected whereas my string replace function started to perfom badly - we live and learn!

bobihope 10 years, 7 months ago  # | flag

Thanx! I used it in a script for xchat to display rhythmbox's output (http://www.schitterendedingen.com/scripts/rhy.py).. now users can easily change extensions I used it in a script for xchat to display rhythmbox's output (http://www.schitterendedingen.com/scripts/rhy.py).. now users can easily change extensions

andrew mcdonald 9 years, 12 months ago  # | flag

great compact class !! To ignore breaking spaces before the key. just need to change one line:

def compile(self):
        if len(self.data) > 0:
            tmp = r"(\b%s)" % "|" r"\b".join( map( re.escape,
                                        self.data.keys() ) )
            if self.re != tmp:
                self.re = tmp
                self.regex = re.compile(self.re)

for those who don't know:

to match breaking spaces you need to use r"\b"

so reg exp. r"\bPYTHON\b" will be matched in "I love PYTHON"

but not "PYTHON_version"

what i just realized is that the above change in compile doesn't account for

breaking spaces after the string!

it is too late to fool around with python ;>

Emanuel Rumpf 9 years, 8 months ago  # | flag

Maybe just use this.

import re
def strTr( text, dic ):
    """ Replaces keys of dic with values of dic in text. 2005-02 by Emanuel Rumpf """
    pat = "(%s)" % "|".join( map(re.escape, dic.keys())  )
    return re.sub( pat, lambda m:dic[m.group()], text )
Brian Jones 8 years, 9 months ago  # | flag

add breaking spaces after each string. This is a simple change to the code to include breaking spaces before and after each of the keys:

tmp = r"(\b%s)" % "|" r"\b".join( map( re.escape, change to: tmp = r"(\b%s\b)" % r"\b" "|" r"\b".join( map( re.escape,

David Bailey 6 years, 1 month ago  # | flag

To resolve the longest to shortest key issue in the original function multiple_replace above, I wrote this bit to replace the one-liner:

regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

With:

keys = dict.keys()
# Sort keys by length, longest first
keys.sort(lambda a,b: len(b)-len(a))
expression = []
for item in keys:
  expression.append(re.escape(item))
regex = re.compile("(%s)" % "|".join(expression))

It's not nearly so brief, and I'm sure others could find a way to condense or make clearer, but now keys which are subsets of longer keys won't get in the way.

David Bailey 5 years, 8 months ago  # | flag

This method was broken in python 2.6 and later:

Try this instead:

keys = sorted(dictobj.keys(), key=lambda a: len(a))
expression = []
for item in keys:
  expression.append(re.escape(item))
regex = re.compile("(%s)" % "|".join(expression))
David Bailey 5 years, 8 months ago  # | flag

or even better:

keys = sorted(dict.keys(), key=len)
David Bailey 5 years, 8 months ago  # | flag

Or even better:

keys = sorted(dict.keys(), key=len)
David Bailey 5 years, 8 months ago  # | flag

One other thing, don't forget to reverse the sort, so that the longest items are first.

keys = sorted(dict.keys(), key=len)
keys.reverse()
Patrick Dobbs 3 years, 2 months ago  # | flag

sorted takes a keyword argument:

keys = sorted(dict.keys(), key=len, reverse=True)

Add a comment

Sign in to comment