Many written languages using latin alphabets employ diacritical marks. Even today, it is still pretty common to encounter situations where it would be desirable to get rid of them: files naming, creation of easy to read URIs, indexing schemes, etc.
An easy way has always been to simply filter out any "decorated characters"; unfortunately, this does not preserve the base, undecorated glyphs. But thanks to Unicode support in Python, it is now straightforward to perform such a transliteration.
(This recipe was completely rewritten based on a comment by Mathieu Clabaut: many thanks to him!)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
''' Remove diacritical marks from strings containing characters from any latin alphabets. Tested on both Python 2.x and Python 3.x ''' import unicodedata def remove_diacritic(input): ''' Accept a unicode string, and return a normal string (bytes in Python 3) without any diacritical marks. ''' return unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore') if __name__ == '__main__': import sys input = '\xc0 quelle \xe9cole va-tu?' if sys.hexversion >= 0x3000000: # On Python >= 3.0.0 output = remove_diacritic(input).decode() else: # On Python < 3.0.0 output = remove_diacritic(unicode(input, 'ISO-8859-1')) print(input) print(output) assert(output == 'A quelle ecole va-tu?')