Was doing some work with strings and threw this together. This will calculate the Hamming distance (or number of differences) between two strings of the same length.
1 2 3 4 5 6 7 8 | def hamdist(str1, str2):
"""Count the # of differences between equal length strings str1 and str2"""
diffs = 0
for ch1, ch2 in zip(str1, str2):
if ch1 != ch2:
diffs += 1
return diffs
|
This would be useful more for quick and dirty bioinformatics sequence analysis. (eg. panning for motifs in a set of sequences and you want to allow your hits to have some degeneracy)
Tags: text
Faster (on Py2.5) and uses less memory
from itertools import izip
def hamming1(str1, str2):
____"""hamming1(str1, str2): Hamming distance. Count the number of differences
____between equal length strings str1 and str2."""
____# Do not use Psyco
____assert len(str1) == len(str2)
____return sum(c1 != c2 for c1, c2 in izip(str1, str2))
alternative using imap. I've not timed it but this feels like it should be yet faster
Like bearophile's it does not build an intermediate list. This version uses one generator instead of two and cheats a bit by using the underlying string comparison directly rather than the Python expression.
With Python 2.5 it seems my version is a bit faster (and it works with unicode too).
str.__ne__ is the slow part. The slow part was, surprisingly, the str.__ne__ call. Try this instead
I found it was about 5% faster than using != in bearophile's example. Here's my test code.
and the output under a pre-release version of Python 2.5 (order is original, bearophile, dalke, and I've cleaned up the output)
Very good, testing is the best way to find the truth in science too :-) Note that this has the same speed because imap takes the address of the operator.ne object anyway:
return sum(imap(operator.ne, str1, str2))
another solution using numarray. In some cases (eg, pure biological sequences with no need for unicode support) it may be better to use a numeric array rather than a Python string as the computer representation. I found that
using a numarray array was faster than my previous example using stock Python.
The result was timing numbers like
That is, with Numeric most of the nontrivial overhead is in function call setup. but with strings over about 100 characters it's much faster.