Function to generate soundex code for any string (usually a name). Conforms to Knuth's algorithm and the common Perl implementation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | def soundex(name, len=4):
""" soundex module conforming to Knuth's algorithm
implementation 2000-12-24 by Gregory Jorgensen
public domain
"""
# digits holds the soundex values for the alphabet
digits = '01230120022455012623010202'
sndx = ''
fc = ''
# translate alpha chars in name to soundex digits
for c in name.upper():
if c.isalpha():
if not fc: fc = c # remember first letter
d = digits[ord(c)-ord('A')]
# duplicate consecutive soundex digits are skipped
if not sndx or (d != sndx[-1]):
sndx += d
# replace first digit with first alpha character
sndx = fc + sndx[1:]
# remove all 0s from the soundex code
sndx = sndx.replace('0','')
# return soundex code padded to len characters
return (sndx + (len * '0'))[:len]
|
Tags: algorithms
Warning: This is designed for English names. Warning: This algorithm (by Odell and Russell, as reported in Knuth) is designed for English language surnames. If you have a significant number of non-English surnames, you might do well to alter the values in digits to improve your matches. For example, to accomodate a large number of Spanish surname data, you should count 'J' and 'L' ('L' because of the way 'll' is used) as vowels, setting their position in digit to '0'.
The basic assumptions of Soundex are that the consonants are more important than the vowels, and that the consonants are grouped into "confusable" groups. Coming up with a set of confusables for a language is not horribly tough, but remember: each group should contain all letters that are confusable with any of those in the group. a slightly better code for both English and Spanish names has digits = '01230120002055012623010202'.