Welcome, guest | Sign In | My Account | Store | Cart

Many written languages using latin alphabets employ diacritical marks. Even today, it is still pretty common to encounter situations where it would be desirable to get rid of them: files naming, creation of easy to read URIs, indexing schemes, etc.

An easy way has always been to simply filter out any "decorated characters"; unfortunately, this does not preserve the base, undecorated glyphs. But thanks to Unicode support in Python, it is now straightforward to perform such a transliteration.

(This recipe was completely rewritten based on a comment by Mathieu Clabaut: many thanks to him!)

Python, 30 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
'''
Remove diacritical marks from strings containing characters from any
latin alphabets.

Tested on both Python 2.x and Python 3.x
'''
import unicodedata

def remove_diacritic(input):
    '''
    Accept a unicode string, and return a normal string (bytes in Python 3)
    without any diacritical marks.
    '''
    return unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore')

if __name__ == '__main__':
    import sys
    
    input = '\xc0 quelle \xe9cole va-tu?'

    if sys.hexversion >= 0x3000000:
        # On Python >= 3.0.0
        output = remove_diacritic(input).decode()
    else:
        # On Python < 3.0.0
        output = remove_diacritic(unicode(input, 'ISO-8859-1'))

    print(input)
    print(output)
    assert(output == 'A quelle ecole va-tu?')

2 comments

Mathieu Clabaut 15 years, 1 month ago  # | flag

import unicodedata

self.output = unicodedata.normalize('NFKD', self.input).encode('ASCII','ignore') should do the trick.

Trent Mick 15 years, 1 month ago  # | flag

See also discussion on this recipe: http://code.activestate.com/recipes/251871/