Hamming distance « Python recipes

Was doing some work with strings and threw this together. This will calculate the Hamming distance (or number of differences) between two strings of the same length.

      def hamdist(str1, str2):
   """Count the # of differences between equal length strings str1 and str2"""
        
        diffs = 0
        for ch1, ch2 in zip(str1, str2):
                if ch1 != ch2:
                        diffs += 1
        return diffs

      

This would be useful more for quick and dirty bioinformatics sequence analysis. (eg. panning for motifs in a set of sequences and you want to allow your hits to have some degeneracy)

Tags: text

6 comments

bearophile - 17 years, 4 months ago # | flag

Faster (on Py2.5) and uses less memory

from itertools import izip

def hamming1(str1, str2):

____"""hamming1(str1, str2): Hamming distance. Count the number of differences

____between equal length strings str1 and str2."""

____# Do not use Psyco

____assert len(str1) == len(str2)

____return sum(c1 != c2 for c1, c2 in izip(str1, str2))

Andrew Dalke 17 years, 4 months ago # | flag

alternative using imap. I've not timed it but this feels like it should be yet faster

import itertools

def hamming1(str1, str2):
  return sum(itertools.imap(str.__ne__, str1, str2))

Like bearophile's it does not build an intermediate list. This version uses one generator instead of two and cheats a bit by using the underlying string comparison directly rather than the Python expression.

bearophile - 17 years, 4 months ago # | flag

With Python 2.5 it seems my version is a bit faster (and it works with unicode too).

Andrew Dalke 17 years, 4 months ago # | flag

str.__ne__ is the slow part. The slow part was, surprisingly, the str.__ne__ call. Try this instead

def hamming2(str1, str2):
    assert len(str1) == len(str2)
    #ne = str.__ne__  ## this is surprisingly slow
    ne = operator.ne
    return sum(imap(ne, str1, str2))

I found it was about 5% faster than using != in bearophile's example. Here's my test code.

import time
from itertools import izip, imap
import operator

def hamdist(str1, str2):
   """Count the # of differences between equal length strings str1 and str2"""

   diffs = 0
   for ch1, ch2 in zip(str1, str2):
       if ch1 != ch2:
           diffs += 1
   return diffs


def bearophile(str1, str2):
    """hamming1(str1, str2): Hamming distance. Count the number of differences
    between equal length strings str1 and str2."""
    # Do not use Psyco
    assert len(str1) == len(str2)
    return sum(c1 != c2 for c1, c2 in izip(str1, str2))

def dalke(str1, str2):
    assert len(str1) == len(str2)
    #ne = str.__ne__  ## this is surprisingly slow
    ne = operator.ne
    return sum(imap(ne, str1, str2))

def go(i):
    str1 = "A" * i
    str2 = str1[:i//2] + "B" + str1[i//2+1:]

    print i,
    for func in (hamdist, bearophile, dalke):
        t1 = time.time()
        func(str1, str2)
        func(str1, str2)
        func(str1, str2)
        func(str1, str2)
        func(str1, str2)
        func(str1, str2)
        func(str1, str2)
        func(str1, str2)
        func(str1, str2)
        func(str1, str2)
        t2 = time.time()
        print t2-t1,
    print


go(10)
go(100)
go(1000)
go(10000)
go(100000)

and the output under a pre-release version of Python 2.5 (order is original, bearophile, dalke, and I've cleaned up the output)

        10 0.00025 0.00028 0.00044
      100 0.0013   0.0016   0.0013
    1000 0.024    0.018      0.013
  10000 0.22    0.145        0.132
100000 3.4    1.47           1.34

bearophile - 17 years, 4 months ago # | flag

Very good, testing is the best way to find the truth in science too :-) Note that this has the same speed because imap takes the address of the operator.ne object anyway:

return sum(imap(operator.ne, str1, str2))

Andrew Dalke 17 years, 4 months ago # | flag

another solution using numarray. In some cases (eg, pure biological sequences with no need for unicode support) it may be better to use a numeric array rather than a Python string as the computer representation. I found that

def numeric_hamming4(num1, num2):
    assert len(num1) == len(num2)
    return numarray.sum(num1 != num2)

using a numarray array was faster than my previous example using stock Python.

from itertools import imap
import operator
import numarray

def hamming3(str1, str2):
    assert len(str1) == len(str2)
    return sum(imap(operator.ne, str1, str2))

def numeric_hamming4(num1, num2):
    assert len(num1) == len(num2)
    return numarray.sum(num1 != num2)

# See http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/392115
def makeSigDigs(num, digits, debug=False):
    ...  ## not included here; only used to make pretty output
    return sig_string

def go(i):
    str1 = "A" * i
    str2 = str1[:i//2] + "B" + str1[i//2+1:]

    print i,

    t1 = time.time()
    assert hamming3(str1, str2) == 1
    hamming3(str1, str2)
    hamming3(str1, str2)
    hamming3(str1, str2)
    hamming3(str1, str2)
    hamming3(str1, str2)
    hamming3(str1, str2)
    hamming3(str1, str2)
    hamming3(str1, str2)
    hamming3(str1, str2)
    t2 = time.time()
    print makeSigDigs(t2-t1, 2),

    num1 = numarray.array(str1)
    num2 = numarray.array(str2)
    t1 = time.time()
    assert numeric_hamming4(num1, num2) == 1
    numeric_hamming4(num1, num2)
    numeric_hamming4(num1, num2)
    numeric_hamming4(num1, num2)
    numeric_hamming4(num1, num2)
    numeric_hamming4(num1, num2)
    numeric_hamming4(num1, num2)
    numeric_hamming4(num1, num2)
    numeric_hamming4(num1, num2)
    numeric_hamming4(num1, num2)
    t2 = time.time()
    print makeSigDigs(t2-t1, 2)

go(10)
go(100)
go(1000)
go(10000)
go(100000)
go(1000000)

The result was timing numbers like

      10  0.00025 0.0024
     100  0.0013  0.0011
    1000  0.038   0.0016
   10000  0.17    0.0024
 100000  1.8     0.039
1000000 21.     0.28

That is, with Numeric most of the nontrivial overhead is in function call setup. but with strings over about 100 characters it's much faster.

◄	Python recipes (4591)	►
◄	Michael Mayhew's recipes (1)	►

Hamming distance (Python recipe) by Michael Mayhew
ActiveState Code (http://code.activestate.com/recipes/499304/)

6 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Hamming distance (Python recipe) by Michael Mayhew ActiveState Code (http://code.activestate.com/recipes/499304/)

6 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Hamming distance (Python recipe) by Michael Mayhew
ActiveState Code (http://code.activestate.com/recipes/499304/)