latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American"
This takes a UNICODE string and replaces Latin-1 characters with something equivalent in 7-bit ASCII and returns a plain ASCII string. This function makes a best effort to convert Latin-1 characters into ASCII equivalents. It does not just strip out the Latin-1 characters. All characters in the standard 7-bit ASCII range are preserved. In the 8th bit range all the Latin-1 accented letters are converted to unaccented equivalents. Most symbol characters are converted to something meaningful. Anything not converted is deleted.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | #!/usr/bin/env python
"""
latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American"
This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. This returns a plain ASCII string.
This function makes a best effort to convert Latin-1 characters into
ASCII equivalents. It does not just strip out the Latin1 characters.
All characters in the standard 7-bit ASCII range are preserved.
In the 8th bit range all the Latin-1 accented letters are converted to
unaccented equivalents. Most symbol characters are converted to
something meaningful. Anything not converted is deleted.
Background:
One of my clients gets address data from Europe, but most of their systems
cannot handle Latin-1 characters. With all due respect to the umlaut,
scharfes s, cedilla, and all the other fine accented characters of Europe,
all I needed to do was to prepare addresses for a shipping system.
After getting headaches trying to deal with this problem using Python's
built-in UNICODE support I gave up and decided to use some brute force.
This function converts all accented letters to their unaccented equivalents.
I realize this is dirty, but for my purposes the mail gets delivered.
"""
def latin1_to_ascii (unicrap):
"""This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. It returns a plain ASCII string.
This function makes a best effort to convert Latin-1 characters into
ASCII equivalents. It does not just strip out the Latin-1 characters.
All characters in the standard 7-bit ASCII range are preserved.
In the 8th bit range all the Latin-1 accented letters are converted
to unaccented equivalents. Most symbol characters are converted to
something meaningful. Anything not converted is deleted.
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += str(i)
return r
if __name__ == '__main__':
s = unicode('','latin-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c),'latin-1')
plain_ascii = latin1_to_ascii(s)
print 'INPUT type:', type(s)
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT type:', type(plain_ascii)
print 'OUTPUT:'
print plain_ascii
|
One of my clients gets address data from Europe, but most of their systems cannot handle Latin-1 characters. With all due respect to the umlaut, scharfes s, cedilla, and all the other fine accented characters of Europe, all I needed to do was to prepare addresses for a shipping system. After getting headaches trying to deal with this problem using Python's built-in UNICODE support I gave up and decided to use some brute force. This function converts all accented letters to their unaccented equivalents. I realize this is dirty, but for my purposes the mail gets delivered.
If you run this script from the command line it will run a demo. It will create a UNICODE string with all the Latin-1 characters from 32 to 255. Then it will convert that string to a plain ASCII Python string and print the results.
Better method. For the application for which this was written, the code given is OK, but it would creak a lot with a long string to convert. I'm sure the following is much faster:
Better method. For the application for which this was written, the code given is OK, but it would creak a lot with a long string to convert. I'm sure the following is much faster:
unicodedata is your friend. You can save a lot of time by using unicodedata.name and unicodedata.normalize.
The following code collects all unicode characters whose name starts with 'LATIN' in a dictionary:
This gives you a list of all the latin character names you might want to reduce to plain ASCII.
Remember that you can use unicode character names in python unicode strings:
Also, you could the use unicodedata.normalize function to decompose combinatorial unicode characters into their components. For instance,
A possible approach to your problem might be:
Replace every unicode character in your text with its KD normal form using unicodedata.normalize;
Create a dictionary associating each unicode character that occurs in your text with its unicode name, using unicodedata.name;
Discard from the dictionary all items that correspond to plain vanilla ASCII characters;
Remove all unicode characters in the dictionary from your text, or replace them with some ASCII representation (e.g. u'\u0301' -> u'\N{COMBINING ACUTE ACCENT}');
Hope that helps.
Improve readability by using unicode character names. If you don't want to make any changes to the behaviour of your code, you can still make it more readable by replacing the "xlate" dictionary with the following:
(comment continued...)
(...continued from previous comment)
almost. the code has to change a little bit ... since the dictionary keys are now unicode characters, two lines must be changed to:
it's even more readable now.
thanks a lot to both of you for the nice recipe.
a cleaner solution. Unicode strings have a 'translate' function which takes the dictionary mapping unicode values to new text. If not present the character is left unchanged. If the mapped value is None then the character is deleted.
Here's one way to use it to solve this problem. Call the fix_unicode() function defined at the end of this comment. It takes the unicode string and returns the hammered ASCII string.
(comment continued...)
(...continued from previous comment)
(comment continued...)
(...continued from previous comment)
Using NFKD. A very simple, and obviously-correct way to do this is like so:
It has the advantage that you don't need to enumerate any particular conversions-- any accented latin characters will be reduced to their base form, and non-ascii characters will be stripped.
By normalizing to NFKD, we transform precomposed characters like \u00C0 (LATIN CAPITAL LETTER A WITH GRAVE) into pairs of base letter \u0041 (A) and combining character \u0300 (GRAVE accent).
Converting to ascii using 'ignore' strips all non-ascii characters, e.g. the combining characters. However, it will also strip other non-ascii characters, so if there are no latin characters in the input, the output will be empty.
Another solution. BTW this is very, very useful. Thanks for this thread.
I had been solving this issue by using a modified version of Skip Montanaro's latscii, which creates a string encoding (a codec) that does that, but I like your unicodedata solution better::
(comment continued...)
(...continued from previous comment)
</pre>
Nice solution with NFKD. This is a really elegant solution. Thank you!