This recipe shows how to use the Python standard re module to perform single-pass multiple string substitution using a dictionary.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | # Tested with Python 2.1
from __future__ import nested_scopes
import re
#
# The simplest, lambda-based implementation
#
def multiple_replace(dict, text):
""" Replace in 'text' all occurences of any key in the given
dictionary by its corresponding value. Returns the new tring."""
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
#
# You may combine both the dictionnary and search-and-replace
# into a single object using a 'callable' dictionary wrapper
# which can be directly used as a callback object.
#
# In Python 2.2+ you may extend the 'dictionary' built-in class instead
from UserDict import UserDict
class Xlator(UserDict):
""" An all-in-one multiple string substitution class """
def _make_regex(self):
""" Build a regular expression object based on the keys of
the current dictionary """
return re.compile("(%s)" % "|".join(map(re.escape, self.keys())))
def __call__(self, mo):
""" This handler will be invoked for each regex match """
# Count substitutions
self.count += 1 # Look-up string
return self[mo.string[mo.start():mo.end()]]
def xlat(self, text):
""" Translate text, returns the modified text. """
# Reset substitution counter
self.count = 0
# Process text
return self._make_regex().sub(self, text)
#
# Test
#
if __name__ == "__main__":
text = "Larry Wall is the creator of Perl"
dict = {
"Larry Wall" : "Guido van Rossum",
"creator" : "Benevolent Dictator for Life",
"Perl" : "Python",
}
print multiple_replace(dict, text)
xlat = Xlator(dict)
print xlat.xlat(text)
print "Changed %d thing(s)" % xlat.count
|
Let's say you have a dictionary-based 1-to-1 mapping between strings. The keys are the set of strings (or regular expression patterns) you want to replace and their corresponding values are the strings they should be replaced by. You could actually do that by calling re.sub for each (key, value) pair in the dictionary, thus processing and creating a new copy of the whole text several times. It would be a lot better to do all the changes in a single pass, so we would need to process and create a copy of the text only once. Fortunately, re.sub's callback facility enables us to do so!
First we have to build a regular expression from our set of keys we want to match. Such a regular expression is a form of "(a1|a2|...|an)" pattern and can be easily generated using a cute oneliner (see source).
Then, instead of feeding re.sub with a replacement string, we'll invoke it with a callable object. This object will be called for any match with a re.MatchObject as its only argument. The re.sub method actually expects this callable object to return the replacement string. This callback routine has just to look-up the matched text in the dictionary and return the corresponding value.
I propose two implementations. The first one is a straighforward lambda-based and the other uses a callable dictionary-like object. This second option is better if you want to perform additional processing on each match such as counting the substitutions or implement any kind of context related complex behaviour.
Nice algorithm; needs one fix and one optimization. If dict has no entries, then the regex matches everything. Need to have an explicit check so as not to call re.sub() in that case.
The dict-like object is IMO the way to go. Now, if we have one of these, then we may (probably?) be using it more than once. So, we ought to keep the compiled regex around. There's no harm in keeping the text regex around either: it's nice for debugging, and we can get a further optimization out of it, too.
Here's what I came up with based on your original code:
Performance tweaking..
Re: Performance tweaking. Actually, the
shown in the string_replace() function doesn't do any thing useful and could be omitted. For the small number of keys, three, shown in the example, doing this doesn't speed up the routine very much.
However, it can be useful to sort the keys by the reverse of their string lengths, like this instead:
because doing this will try matches with the longest patterns first. This can be signficant if one or more of the keys are substrings of others -- as the case with '%_fred' and '%_freddy' in the example.
If the keys are not sorted by length (either the default way, as originally shown, or not at all), the following string:
will become:
instead of:
The down side of the reverse length sort is that it does slow the routine down (about 50% on my machine).
All versions above might also benefit by placing a "str(...)" call around the dictionary lookups. This allow the replacement values to be numbers and other object, as well as strings. I find this useful when the replacement values are calculated.
Re: Performance tweaking. Interestingly, removing the line:
from the string_replace() function speeds it up fairly significantly, with no detrimental effects. Another advantage to leaving the line out is that it makes the routine more general, since keys no longer all have to start with a "%" character.
Futher Tweaking... You're right of course - the sort needs to be reversed and done outside of the function if possible :D
I did some more checking... With modifed code, using the same small sample string and increasing the size of the dictionary:
keys,string,regex
20,20k,10k
40,11k,10k
200,2k,10k
2000,250,12k
As the dictionary grew > 40 items, the perfomance of the regex routine was unaffected whereas my string replace function started to perfom badly - we live and learn!
Thanx! I used it in a script for xchat to display rhythmbox's output (http://www.schitterendedingen.com/scripts/rhy.py).. now users can easily change extensions I used it in a script for xchat to display rhythmbox's output (http://www.schitterendedingen.com/scripts/rhy.py).. now users can easily change extensions
great compact class !! To ignore breaking spaces before the key. just need to change one line:
for those who don't know:
to match breaking spaces you need to use r"\b"
so reg exp. r"\bPYTHON\b" will be matched in "I love PYTHON"
but not "PYTHON_version"
what i just realized is that the above change in compile doesn't account for
breaking spaces after the string!
it is too late to fool around with python ;>
Maybe just use this.
add breaking spaces after each string. This is a simple change to the code to include breaking spaces before and after each of the keys:
tmp = r"(\b%s)" % "|" r"\b".join( map( re.escape, change to: tmp = r"(\b%s\b)" % r"\b" "|" r"\b".join( map( re.escape,
To resolve the longest to shortest key issue in the original function multiple_replace above, I wrote this bit to replace the one-liner:
With:
It's not nearly so brief, and I'm sure others could find a way to condense or make clearer, but now keys which are subsets of longer keys won't get in the way.
This method was broken in python 2.6 and later:
Try this instead:
or even better:
Or even better:
One other thing, don't forget to reverse the sort, so that the longest items are first.
sorted takes a keyword argument:
keys = sorted(dict.keys(), key=len, reverse=True)
I noticed this only worked for strings (not regex groups and repetition etc ) so I wrote some code for that.
I fixed the sorting (thanks Martin Miller) and empty dict input (thanks Alan Eldridge). There are some problems with lookahead/lookbehind/beginning/end regex. I'm not an advanced programmer, and this isn't optimizated at all.
Here is the same function, condensed: