Many written languages using latin alphabets employ diacritical marks. Even today, it is still pretty common to encounter situations where it would be desirable to get rid of them: files naming, creation of easy to read URIs, indexing schemes, etc.
An easy way has always been to simply filter out any "decorated characters"; unfortunately, this does not preserve the base, undecorated glyphs. But thanks to Unicode support in Python, it is now straightforward to perform such a transliteration.
(This recipe was completely rewritten based on a comment by Mathieu Clabaut: many thanks to him!)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | '''
Remove diacritical marks from strings containing characters from any
latin alphabets.
Tested on both Python 2.x and Python 3.x
'''
import unicodedata
def remove_diacritic(input):
'''
Accept a unicode string, and return a normal string (bytes in Python 3)
without any diacritical marks.
'''
return unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore')
if __name__ == '__main__':
import sys
input = '\xc0 quelle \xe9cole va-tu?'
if sys.hexversion >= 0x3000000:
# On Python >= 3.0.0
output = remove_diacritic(input).decode()
else:
# On Python < 3.0.0
output = remove_diacritic(unicode(input, 'ISO-8859-1'))
print(input)
print(output)
assert(output == 'A quelle ecole va-tu?')
|
import unicodedata
self.output = unicodedata.normalize('NFKD', self.input).encode('ASCII','ignore') should do the trick.
See also discussion on this recipe: http://code.activestate.com/recipes/251871/