calculates the Levenshtein-Distance of two strings.
see: http://en.wikipedia.org/wiki/Levenshtein_distance (english) and: http://de.wikipedia.org/wiki/Levenshtein-Distanz (deutsch)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | ''' Calculates the Levenshtein distance of 2 strings'''
def printMatrix(m):
print ' '
for line in m:
spTupel = ()
breite = len(line)
for column in line:
spTupel = spTupel + (column, )
print "%3i"*breite % spTupel
s1 = raw_input('first word: ')
s2 = raw_input('second word: ')
def levenshtein(s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(l1 + 1)] * (l2 + 1)
for zz in range(l2 + 1):
matrix[zz] = range(zz,zz + l1 + 1)
for zz in range(0,l2):
for sz in range(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
print "That's the Levenshtein-Matrix:"
printMatrix(matrix)
return matrix[l2][l1]
distance = levenshtein(s1, s2)
print 'The Levenshtein-Distance of ',s1, ' and ', s2, ' is ', distance
|
Tags: text
Wikipedia implementations:
English
http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Python
Russian
http://ru.wikibooks.org/wiki/%D0%A0%D0%B0%D1%81%D1%81%D1%82%D0%BE%D1%8F%D0%BD%D0%B8%D0%B5_%D0%9B%D0%B5%D0%B2%D0%B5%D0%BD%D1%88%D1%82%D0%B5%D0%B9%D0%BD%D0%B0#.D0.90.D0.BB.D0.B3.D0.BE.D1.80.D0.B8.D1.82.D0.BC_.D0.9B.D0.B5.D0.B2.D0.B5.D0.BD.D1.88.D1.82.D0.B5.D0.B9.D0.BD.D0.B0_.D0.BD.D0.B0_.D1.8F.D0.B7.D1.8B.D0.BA.D0.B5_Python
Ok This code is fantastic and counts the differences in my strings. The problem is that I can not find a way to print "what" it is comparing.
To be more clear. I am trying to compare two strings for phoneme error rates. What the matrix doesn't tell me, is WHICH fones ahve been substituted, delted or inserted for Which phonemes from the other string.
For example if I have
Expected Sentence- I have a baseball Obtained Sentence- I have the best ball Expected Phonemes- ei h a v ae b e s b a l Obtained Phonemes- ei h a v th i b e s t b a l
Ho can I get the LD to return the exact phonemes which were confused and whether they were substituted for another phoneme (and which one), deleted, or inserted?