ActiveState Code

Recipe 576648: Remove diatrical marks (including accents) from strings using latin alphabets


Many written languages using latin alphabets employ diacritical marks. Even today, it is still pretty common to encounter situations where it would be desirable to get rid of them: files naming, creation of easy to read URIs, indexing schemes, etc.

An easy way has always been to simply filter out any "decorated characters"; unfortunately, this does not preserve the base, undecorated glyphs. But thanks to Unicode support in Python, it is now straightforward to perform such a transliteration.

(This recipe was completely rewritten based on a comment by Mathieu Clabaut: many thanks to him!)

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
'''
Remove diacritical marks from strings containing characters from any
latin alphabets.

Tested on both Python 2.x and Python 3.x
'''
import unicodedata

def remove_diacritic(input):
    '''
    Accept a unicode string, and return a normal string (bytes in Python 3)
    without any diacritical marks.
    '''
    return unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore')

if __name__ == '__main__':
    import sys
    
    input = '\xc0 quelle \xe9cole va-tu?'

    if sys.hexversion >= 0x3000000:
        # On Python >= 3.0.0
        output = remove_diacritic(input).decode()
    else:
        # On Python < 3.0.0
        output = remove_diacritic(unicode(input, 'ISO-8859-1'))

    print(input)
    print(output)
    assert(output == 'A quelle ecole va-tu?')

Comments

  1. 1. At 12:15 a.m. on 11 feb 2009, Mathieu Clabaut said:

    import unicodedata

    self.output = unicodedata.normalize('NFKD', self.input).encode('ASCII','ignore') should do the trick.

  2. 2. At 10:12 a.m. on 11 feb 2009, Trent Mick said:

    See also discussion on this recipe: http://code.activestate.com/recipes/251871/

Sign in to comment