A function which implements sort keys for the german language according to DIN 5007.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | #!/usr/bin/env python
# -*- coding: utf-8 -*-
def din5007(input):
""" This function implements sort keys for the german language according to
DIN 5007."""
# key1: compare words lowercase and replace umlauts according to DIN 5007
key1=input.lower()
key1=key1.replace(u"ä", u"a")
key1=key1.replace(u"ö", u"o")
key1=key1.replace(u"ü", u"u")
key1=key1.replace(u"ß", u"ss")
# key2: sort the lowercase word before the uppercase word and sort
# the word with umlaut after the word without umlaut
key2=input.swapcase()
# in case two words are the same according to key1, sort the words
# according to key2.
return (key1, key2)
words=[u"All", u"Tränen", u"Zauber", u"aber", u"tränen", u"zum", u"Ärger", u"ärgerlich"]
print sorted(words, key=din5007)
|
When I tried to sort words with german umlauts with Python I recognized,
that the sorted()
-function sorts german words with umlauts in
the wrong way (Note: The correct order is already given by the list words!):
>>> words=[u"aber",u"All",u"Ärger",u"ärgerlich",u"tränen",u"Tränen",u"Zauber",u"zum"]
>>> print sorted(words)
[u'All', u'Tr\xe4nen', u'Zauber', u'aber', u'tr\xe4nen', u'zum', u'\xc4rger', u'\xe4rgerlich']
The umlauts are sorted to the end of the list, which is wrong according to
DIN 5007,
and lowercase words are sorted after uppercase words, which is also wrong
(have a look into your DUDEN, if you don't believe it ;-) ).
First i tried to solve this using functions of the module locale
like
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
>>> print sorted(words, key=locale.strxfrm)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 0: ordinal not in range(128)
So 'locale.strxfrm' doesn't seem to support Unicode, but i could fix it using
>>> import sys
>>> sys.setdefaultencoding("utf_8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
Another annoying problem in Python, but one can circumvent it by
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding("utf_8")
>>> print sorted(words, key=locale.strxfrm)
[u'All', u'Tr\xe4nen', u'Zauber', u'aber', u'tr\xe4nen', u'zum', u'\xc4rger', u'\xe4rgerlich']
So I had no success! But this was only the case, when I used Python on
MAC OS X 10.5. Using Python installed in a german Windows enviroment yielded the
right result. So dependent on your OS and your Python enviroment
one gets different results, when sorting german words when using locale
.
This is annoying!
So I started to write my own function for sorting words in the correct way independed of any OS and localization setting and finally came up with this recipe.
Note that the function returns a tuple of two keys, where the second one is only used if according to the first key the two words are the same. This is a feature of Python that is not wildely known (it is not mentioned in the Python documentation or the HowToSorting ), although it is very useful.