Welcome, guest | Sign In | My Account | Store | Cart
1

Was doing some work with strings and threw this together. This will calculate the Hamming distance (or number of differences) between two strings of the same length.

Python, 8 lines
 ```1 2 3 4 5 6 7 8``` ```def hamdist(str1, str2): """Count the # of differences between equal length strings str1 and str2""" diffs = 0 for ch1, ch2 in zip(str1, str2): if ch1 != ch2: diffs += 1 return diffs ```

This would be useful more for quick and dirty bioinformatics sequence analysis. (eg. panning for motifs in a set of sequences and you want to allow your hits to have some degeneracy)

bearophile - 9 years, 7 months ago

Faster (on Py2.5) and uses less memory

from itertools import izip

def hamming1(str1, str2):

____"""hamming1(str1, str2): Hamming distance. Count the number of differences

____between equal length strings str1 and str2."""

____# Do not use Psyco

____assert len(str1) == len(str2)

____return sum(c1 != c2 for c1, c2 in izip(str1, str2))

Andrew Dalke 9 years, 7 months ago

alternative using imap. I've not timed it but this feels like it should be yet faster

``````import itertools

def hamming1(str1, str2):
return sum(itertools.imap(str.__ne__, str1, str2))
``````

Like bearophile's it does not build an intermediate list. This version uses one generator instead of two and cheats a bit by using the underlying string comparison directly rather than the Python expression.

bearophile - 9 years, 7 months ago

With Python 2.5 it seems my version is a bit faster (and it works with unicode too).

Andrew Dalke 9 years, 7 months ago

str.__ne__ is the slow part. The slow part was, surprisingly, the str.__ne__ call. Try this instead

``````def hamming2(str1, str2):
assert len(str1) == len(str2)
#ne = str.__ne__  ## this is surprisingly slow
ne = operator.ne
return sum(imap(ne, str1, str2))
``````

I found it was about 5% faster than using != in bearophile's example. Here's my test code.

``````import time
from itertools import izip, imap
import operator

def hamdist(str1, str2):
"""Count the # of differences between equal length strings str1 and str2"""

diffs = 0
for ch1, ch2 in zip(str1, str2):
if ch1 != ch2:
diffs += 1
return diffs

def bearophile(str1, str2):
"""hamming1(str1, str2): Hamming distance. Count the number of differences
between equal length strings str1 and str2."""
# Do not use Psyco
assert len(str1) == len(str2)
return sum(c1 != c2 for c1, c2 in izip(str1, str2))

def dalke(str1, str2):
assert len(str1) == len(str2)
#ne = str.__ne__  ## this is surprisingly slow
ne = operator.ne
return sum(imap(ne, str1, str2))

def go(i):
str1 = "A" * i
str2 = str1[:i//2] + "B" + str1[i//2+1:]

print i,
for func in (hamdist, bearophile, dalke):
t1 = time.time()
func(str1, str2)
func(str1, str2)
func(str1, str2)
func(str1, str2)
func(str1, str2)
func(str1, str2)
func(str1, str2)
func(str1, str2)
func(str1, str2)
func(str1, str2)
t2 = time.time()
print t2-t1,
print

go(10)
go(100)
go(1000)
go(10000)
go(100000)
``````

and the output under a pre-release version of Python 2.5 (order is original, bearophile, dalke, and I've cleaned up the output)

``````        10 0.00025 0.00028 0.00044
100 0.0013   0.0016   0.0013
1000 0.024    0.018      0.013
10000 0.22    0.145        0.132
100000 3.4    1.47           1.34
``````
bearophile - 9 years, 7 months ago

Very good, testing is the best way to find the truth in science too :-) Note that this has the same speed because imap takes the address of the operator.ne object anyway:

return sum(imap(operator.ne, str1, str2))

Andrew Dalke 9 years, 7 months ago

another solution using numarray. In some cases (eg, pure biological sequences with no need for unicode support) it may be better to use a numeric array rather than a Python string as the computer representation. I found that

``````def numeric_hamming4(num1, num2):
assert len(num1) == len(num2)
return numarray.sum(num1 != num2)
``````

using a numarray array was faster than my previous example using stock Python.

``````from itertools import imap
import operator
import numarray

def hamming3(str1, str2):
assert len(str1) == len(str2)
return sum(imap(operator.ne, str1, str2))

def numeric_hamming4(num1, num2):
assert len(num1) == len(num2)
return numarray.sum(num1 != num2)

# See http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/392115
def makeSigDigs(num, digits, debug=False):
...  ## not included here; only used to make pretty output
return sig_string

def go(i):
str1 = "A" * i
str2 = str1[:i//2] + "B" + str1[i//2+1:]

print i,

t1 = time.time()
assert hamming3(str1, str2) == 1
hamming3(str1, str2)
hamming3(str1, str2)
hamming3(str1, str2)
hamming3(str1, str2)
hamming3(str1, str2)
hamming3(str1, str2)
hamming3(str1, str2)
hamming3(str1, str2)
hamming3(str1, str2)
t2 = time.time()
print makeSigDigs(t2-t1, 2),

num1 = numarray.array(str1)
num2 = numarray.array(str2)
t1 = time.time()
assert numeric_hamming4(num1, num2) == 1
numeric_hamming4(num1, num2)
numeric_hamming4(num1, num2)
numeric_hamming4(num1, num2)
numeric_hamming4(num1, num2)
numeric_hamming4(num1, num2)
numeric_hamming4(num1, num2)
numeric_hamming4(num1, num2)
numeric_hamming4(num1, num2)
numeric_hamming4(num1, num2)
t2 = time.time()
print makeSigDigs(t2-t1, 2)

go(10)
go(100)
go(1000)
go(10000)
go(100000)
go(1000000)
``````

The result was timing numbers like

``````      10  0.00025 0.0024
100  0.0013  0.0011
1000  0.038   0.0016
10000  0.17    0.0024
100000  1.8     0.039
1000000 21.     0.28
``````

That is, with Numeric most of the nontrivial overhead is in function call setup. but with strings over about 100 characters it's much faster.

 Created by Michael Mayhew on Mon, 11 Dec 2006 (PSF)

### Tags

▶ Show machine tags (4)

### Required Modules

• (none specified)