Welcome, guest | Sign In | My Account | Store | Cart

This recipe shows how to use the Python standard re module to perform single-pass multiple string substitution using a dictionary.

Python, 77 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Tested with Python 2.1
from __future__ import nested_scopes

import re 

#
# The simplest, lambda-based implementation
#

def multiple_replace(dict, text): 

  """ Replace in 'text' all occurences of any key in the given
  dictionary by its corresponding value.  Returns the new tring.""" 

  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

#
# You may combine both the dictionnary and search-and-replace
# into a single object using a 'callable' dictionary wrapper
# which can be directly used as a callback object.
# 

# In Python 2.2+ you may extend the 'dictionary' built-in class instead
from UserDict import UserDict 
class Xlator(UserDict):

  """ An all-in-one multiple string substitution class """ 

  def _make_regex(self): 

    """ Build a regular expression object based on the keys of
    the current dictionary """

    return re.compile("(%s)" % "|".join(map(re.escape, self.keys()))) 

  def __call__(self, mo): 
    
    """ This handler will be invoked for each regex match """

    # Count substitutions
    self.count += 1 # Look-up string

    return self[mo.string[mo.start():mo.end()]]

  def xlat(self, text): 

    """ Translate text, returns the modified text. """ 

    # Reset substitution counter
    self.count = 0 

    # Process text
    return self._make_regex().sub(self, text)

#
# Test
# 

if __name__ == "__main__": 

  text = "Larry Wall is the creator of Perl"

  dict = {
    "Larry Wall" : "Guido van Rossum",
    "creator" : "Benevolent Dictator for Life",
    "Perl" : "Python",
  } 

  print multiple_replace(dict, text) 

  xlat = Xlator(dict)
  print xlat.xlat(text)
  print "Changed %d thing(s)" % xlat.count 

Let's say you have a dictionary-based 1-to-1 mapping between strings. The keys are the set of strings (or regular expression patterns) you want to replace and their corresponding values are the strings they should be replaced by. You could actually do that by calling re.sub for each (key, value) pair in the dictionary, thus processing and creating a new copy of the whole text several times. It would be a lot better to do all the changes in a single pass, so we would need to process and create a copy of the text only once. Fortunately, re.sub's callback facility enables us to do so!

First we have to build a regular expression from our set of keys we want to match. Such a regular expression is a form of "(a1|a2|...|an)" pattern and can be easily generated using a cute oneliner (see source).

Then, instead of feeding re.sub with a replacement string, we'll invoke it with a callable object. This object will be called for any match with a re.MatchObject as its only argument. The re.sub method actually expects this callable object to return the replacement string. This callback routine has just to look-up the matched text in the dictionary and return the corresponding value.

I propose two implementations. The first one is a straighforward lambda-based and the other uses a callable dictionary-like object. This second option is better if you want to perform additional processing on each match such as counting the substitutions or implement any kind of context related complex behaviour.

16 comments

Alan Eldridge 22 years, 4 months ago  # | flag

Nice algorithm; needs one fix and one optimization. If dict has no entries, then the regex matches everything. Need to have an explicit check so as not to call re.sub() in that case.

The dict-like object is IMO the way to go. Now, if we have one of these, then we may (probably?) be using it more than once. So, we ought to keep the compiled regex around. There's no harm in keeping the text regex around either: it's nice for debugging, and we can get a further optimization out of it, too.

Here's what I came up with based on your original code:

# Original algorithm by Xavier Defrang.
# http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/81330
# This implementation by alane@sourceforge.net.

import re
import UserDict

class textsub(UserDict.UserDict):
    def __init__(self, dict = None):
        self.re = None
        self.regex = None
        UserDict.UserDict.__init__(self, dict)

    def compile(self):
        if len(self.data) > 0:
            tmp = "(%s)" % "|".join(map(re.escape,
                                        self.data.keys()))
            if self.re != tmp:
                self.re = tmp
                self.regex = re.compile(self.re)

    def __call__(self, match):
        return self.data[match.string[match.start():match.end()]]

    def sub(self, s):
        if len(self.data) == 0:
            return s
        return self.regex.sub(self, s)
Paul Hardwick 21 years, 4 months ago  # | flag

Performance tweaking..

I like this re based function a lot and was interested to compare it to a similar string replace function
I already use -- the results are interesting:

import time
vars = {'%_fred':'hello','%_dick':'world','%_freddy':'sunny'}
line = "%_fred beautiful %_dick"

def string_replace(dict,text):
    keys = dict.keys()
    keys.sort()
    for n in keys:
        if '%' not in text: break
        text = text.replace(n,dict[n])
    return text

start = time.time()
for n in range(1000):
    multiple_replace(vars,line)
print 1000/(time.time()-start)

start = time.time()
for n in range(1000):
    string_replace(vars,line)
print 1000/(time.time()-start)

On my PIII 1.1GHz machine:
multiple_replace 3831.41707853 loops/sec
string_replace 50000.0476838 loops/sec
However.. if the re.compile statement is moved out of the function
new_multiple_replace 10000.0095368 loops/sec

If you can arrange to compile outside of the routine then the re
version is best otherwise the longer string based function has the performance.
Martin Miller 21 years, 1 month ago  # | flag

Re: Performance tweaking. Actually, the

keys.sort()

shown in the string_replace() function doesn't do any thing useful and could be omitted. For the small number of keys, three, shown in the example, doing this doesn't speed up the routine very much.

However, it can be useful to sort the keys by the reverse of their string lengths, like this instead:

keys.sort(lambda a,b: len(b)-len(a))

because doing this will try matches with the longest patterns first. This can be signficant if one or more of the keys are substrings of others -- as the case with '%_fred' and '%_freddy' in the example.

If the keys are not sorted by length (either the default way, as originally shown, or not at all), the following string:

"%_freddy beautiful %_dick"

will become:

"hellody beautiful world"

instead of:

"sunny beautiful world"

The down side of the reverse length sort is that it does slow the routine down (about 50% on my machine).

All versions above might also benefit by placing a "str(...)" call around the dictionary lookups. This allow the replacement values to be numbers and other object, as well as strings. I find this useful when the replacement values are calculated.

Martin Miller 21 years, 1 month ago  # | flag

Re: Performance tweaking. Interestingly, removing the line:

if '%' not in text: break

from the string_replace() function speeds it up fairly significantly, with no detrimental effects. Another advantage to leaving the line out is that it makes the routine more general, since keys no longer all have to start with a "%" character.

Paul Hardwick 21 years ago  # | flag

Futher Tweaking... You're right of course - the sort needs to be reversed and done outside of the function if possible :D

I did some more checking... With modifed code, using the same small sample string and increasing the size of the dictionary:

keys,string,regex

20,20k,10k

40,11k,10k

200,2k,10k

2000,250,12k

As the dictionary grew > 40 items, the perfomance of the regex routine was unaffected whereas my string replace function started to perfom badly - we live and learn!

bobihope 19 years, 12 months ago  # | flag

Thanx! I used it in a script for xchat to display rhythmbox's output (http://www.schitterendedingen.com/scripts/rhy.py).. now users can easily change extensions I used it in a script for xchat to display rhythmbox's output (http://www.schitterendedingen.com/scripts/rhy.py).. now users can easily change extensions

andrew mcdonald 19 years, 4 months ago  # | flag

great compact class !! To ignore breaking spaces before the key. just need to change one line:

def compile(self):
        if len(self.data) > 0:
            tmp = r"(\b%s)" % "|" r"\b".join( map( re.escape,
                                        self.data.keys() ) )
            if self.re != tmp:
                self.re = tmp
                self.regex = re.compile(self.re)

for those who don't know:

to match breaking spaces you need to use r"\b"

so reg exp. r"\bPYTHON\b" will be matched in "I love PYTHON"

but not "PYTHON_version"

what i just realized is that the above change in compile doesn't account for

breaking spaces after the string!

it is too late to fool around with python ;>

Emanuel Rumpf 19 years, 1 month ago  # | flag

Maybe just use this.

import re
def strTr( text, dic ):
    """ Replaces keys of dic with values of dic in text. 2005-02 by Emanuel Rumpf """
    pat = "(%s)" % "|".join( map(re.escape, dic.keys())  )
    return re.sub( pat, lambda m:dic[m.group()], text )
Brian Jones 18 years, 2 months ago  # | flag

add breaking spaces after each string. This is a simple change to the code to include breaking spaces before and after each of the keys:

tmp = r"(\b%s)" % "|" r"\b".join( map( re.escape, change to: tmp = r"(\b%s\b)" % r"\b" "|" r"\b".join( map( re.escape,

David Bailey 15 years, 6 months ago  # | flag

To resolve the longest to shortest key issue in the original function multiple_replace above, I wrote this bit to replace the one-liner:

regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

With:

keys = dict.keys()
# Sort keys by length, longest first
keys.sort(lambda a,b: len(b)-len(a))
expression = []
for item in keys:
  expression.append(re.escape(item))
regex = re.compile("(%s)" % "|".join(expression))

It's not nearly so brief, and I'm sure others could find a way to condense or make clearer, but now keys which are subsets of longer keys won't get in the way.

David Bailey 15 years, 1 month ago  # | flag

This method was broken in python 2.6 and later:

Try this instead:

keys = sorted(dictobj.keys(), key=lambda a: len(a))
expression = []
for item in keys:
  expression.append(re.escape(item))
regex = re.compile("(%s)" % "|".join(expression))
David Bailey 15 years, 1 month ago  # | flag

or even better:

keys = sorted(dict.keys(), key=len)
David Bailey 15 years, 1 month ago  # | flag

Or even better:

keys = sorted(dict.keys(), key=len)
David Bailey 15 years, 1 month ago  # | flag

One other thing, don't forget to reverse the sort, so that the longest items are first.

keys = sorted(dict.keys(), key=len)
keys.reverse()
Patrick Dobbs 12 years, 7 months ago  # | flag

sorted takes a keyword argument:

keys = sorted(dict.keys(), key=len, reverse=True)

Jeff Hykin 7 years, 2 months ago  # | flag

I noticed this only worked for strings (not regex groups and repetition etc ) so I wrote some code for that.

I fixed the sorting (thanks Martin Miller) and empty dict input (thanks Alan Eldridge). There are some problems with lookahead/lookbehind/beginning/end regex. I'm not an advanced programmer, and this isn't optimizated at all.

import regex as re


def SimultaneouslyReplaceEachIn__DictOn__String ( dict_of_things_im_looking_for , text_thats_getting_changed ):
  # if the dict is empty, don't change anything
  if len(dict_of_things_im_looking_for) == 0:
    return text_thats_getting_changed

  list_of_all_of_the_things_to_find = dict_of_things_im_looking_for.keys()
  all_of_those_things_in_order = sorted(list_of_all_of_the_things_to_find, key=lambda x: len(x))
  all_of_them_in_a_regex_string = "("  + "|".join(all_of_those_things_in_order) + ")" 

  def Is__RegexIn__String(regexstring,string): 
    output = re.search(regexstring,string) != None 
    return output

  def the_corrisponding_replacement(regex_found):
    string_that_was_found = regex_found.group()
    for each_item in all_of_those_things_in_order:
      # the below line could cause problems for lookahead / lookbehind / beginning / end regex 
      if Is__RegexIn__String( each_item , string_that_was_found ):
        # if an item was found, then return it's corrisponding replace value
        return dict_of_things_im_looking_for[ each_item ] 

    print "probably an error with Simultanious Replacement"
    print "may be due lookahead/lookbehind/beginning/end regex"
    return string_that_was_found

  return re.sub( all_of_them_in_a_regex_string , the_corrisponding_replacement , text_thats_getting_changed )


########################
# Test 

source_copy = " AB = ab  ,  BA = ba "

replacements_dict = {
r'b'           : "(the letter b)" ,
r'(a|e|i|o|u)' : "(a basic vowel)" ,
}


#normally r'(a|e|i|o|u)' 
#would replace the e's in "(the letter b)" 
#And vise versa for b in "(a basic vowel)"

#this gets around this issue 

print source_copy 
print SimultaneouslyReplaceEachIn__DictOn__String( replacements_dict , source_copy )

Here is the same function, condensed:

def SimulSub ( dict_ , string_ ):
  if len(dict_) == 0:
    return string_
  def repl_(regex_):
    match_ = regex_.group()
    for x in (sorted(dict_.keys(), key=lambda x: len(x))):
      # the below line could cause problems for lookahead / lookbehind / beginning / end regex 
      if (re.search( x , match_ ) != None): return dict_[ x ] 
    print "error with SimulSub"
    print "check lookahead/lookbehind/beginning/end regex"
    return match_
  return re.sub( "(" + "|".join(sorted(dict_.keys(), key=lambda x: len(x))) + ")"  , repl_ , string_ )