Welcome, guest | Sign In | My Account | Store | Cart

Remove diatrical marks (including accents) from strings using latin alphabets (Python recipe) by Sylvain Fourmanoit
ActiveState Code (http://code.activestate.com/recipes/576648/)

Many written languages using latin alphabets employ diacritical marks. Even today, it is still pretty common to encounter situations where it would be desirable to get rid of them: files naming, creation of easy to read URIs, indexing schemes, etc.

An easy way has always been to simply filter out any "decorated characters"; unfortunately, this does not preserve the base, undecorated glyphs. But thanks to Unicode support in Python, it is now straightforward to perform such a transliteration.

(This recipe was completely rewritten based on a comment by Mathieu Clabaut: many thanks to him!)

      '''
Remove diacritical marks from strings containing characters from any
latin alphabets.

Tested on both Python 2.x and Python 3.x
'''
import unicodedata

def remove_diacritic(input):
    '''
    Accept a unicode string, and return a normal string (bytes in Python 3)
    without any diacritical marks.
    '''
    return unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore')

if __name__ == '__main__':
    import sys
    
    input = '\xc0 quelle \xe9cole va-tu?'

    if sys.hexversion >= 0x3000000:
        # On Python >= 3.0.0
        output = remove_diacritic(input).decode()
    else:
        # On Python < 3.0.0
        output = remove_diacritic(unicode(input, 'ISO-8859-1'))

    print(input)
    print(output)
    assert(output == 'A quelle ecole va-tu?')

      

Tags: text, text_processing

2 comments

Mathieu Clabaut 15 years, 2 months ago # | flag

import unicodedata

self.output = unicodedata.normalize('NFKD', self.input).encode('ASCII','ignore') should do the trick.

Trent Mick 15 years, 2 months ago # | flag

See also discussion on this recipe: http://code.activestate.com/recipes/251871/

Created by Sylvain Fourmanoit on Tue, 10 Feb 2009 (MIT)

◄	Python recipes (4591)	►
◄	Sylvain Fourmanoit's recipes (2)	►

Required Modules

Other Information and Tasks

Licensed under the MIT License
Viewed 29655 times
Revision 7 (updated 15 years ago)

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

Remove diatrical marks (including accents) from strings using latin alphabets (Python recipe) by Sylvain Fourmanoit ActiveState Code (http://code.activestate.com/recipes/576648/)