Welcome, guest | Sign In | My Account | Store | Cart

Function to generate soundex code for any string (usually a name). Conforms to Knuth's algorithm and the common Perl implementation.

Python, 28 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def soundex(name, len=4):
    """ soundex module conforming to Knuth's algorithm
        implementation 2000-12-24 by Gregory Jorgensen
        public domain
    """

    # digits holds the soundex values for the alphabet
    digits = '01230120022455012623010202'
    sndx = ''
    fc = ''

    # translate alpha chars in name to soundex digits
    for c in name.upper():
        if c.isalpha():
            if not fc: fc = c   # remember first letter
            d = digits[ord(c)-ord('A')]
            # duplicate consecutive soundex digits are skipped
            if not sndx or (d != sndx[-1]):
                sndx += d

    # replace first digit with first alpha character
    sndx = fc + sndx[1:]

    # remove all 0s from the soundex code
    sndx = sndx.replace('0','')

    # return soundex code padded to len characters
    return (sndx + (len * '0'))[:len]

1 comment

Scott David Daniels 22 years, 9 months ago  # | flag

Warning: This is designed for English names. Warning: This algorithm (by Odell and Russell, as reported in Knuth) is designed for English language surnames. If you have a significant number of non-English surnames, you might do well to alter the values in digits to improve your matches. For example, to accomodate a large number of Spanish surname data, you should count 'J' and 'L' ('L' because of the way 'll' is used) as vowels, setting their position in digit to '0'.

The basic assumptions of Soundex are that the consonants are more important than the vowels, and that the consonants are grouped into "confusable" groups. Coming up with a set of confusables for a language is not horribly tough, but remember: each group should contain all letters that are confusable with any of those in the group. a slightly better code for both English and Spanish names has digits = '01230120002055012623010202'.

Created by Greg Jorgensen on Tue, 6 Mar 2001 (PSF)
Python recipes (4591)
Greg Jorgensen's recipes (2)

Required Modules

  • (none specified)

Other Information and Tasks