You need to deal with text strings which include non-ASCII characters. Python has a first class unicode type which you should always use instead of str to represent text.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
It's easy, once you accept the need to explicitely convert between a bytestring and a unicode string: >>> german_ae = unicode('\xc3\xa4', 'utf8') Here german_ae is a unicode string representing the german "lowercase a with umlaut". It has been constructed from interpreting the bytestring '\xc3\xa4' according to the specified UTF8 encoding. There are many encodings, but UTF8 is often used because it is universal and yet fully compatible with the 7-bit ASCII set (any ASCII bytestring is a correct UTF8-encoded string). Once you crossed this barrier, life is easy! You can manipulate this unicode string in practically the same way as a plain str string: >>> sentence = "This is a " + german_ae >>> sentence2 = "Easy!" >>> para = ". ".join([sentence, sentence2]) Note that para is a unicode string, because operations between a unicode string and a byte string always result in a unicode string... unless they fail and raise an exception: >>> bytestring = '\xc3\xa4' # Uuh, some non-ASCII bytestring! >>> german_ae += bytestring UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) The byte '0xc3' is not a valid character in the 7-bit ASCII encoding, and Python refuses to guess an encoding. So, being explicit about encodings is the crucial point about successfully using unicode strings with Python.
Unicode is easy to handle in Python, if you respect a few guidelines and learn to deal with common problems. This is not to say that an efficient implementation of unicode is an easy task. Luckily, as with other hard problems, you don't have to care much: you can just use the efficient implementation of unicode which Python provides.
The most important issue is to fully accept the distinction between a byte string and a unicode string. As exemplified in this recipe's solution, you often need to explicitely construct a unicode string by providing a byte string and an encoding. Without an encoding a bytestring is basically meaningless, unless you happen to be lucky and can just assume that the bytestring is text in ASCII.
The most common problem with using unicode in Python arises when you are doing some text manipulation where only some of your strings are unicode objects, and others are bytestrings. Python makes a shallow attempt to implicitly convert your bytestrings to unicode. It usually assumes an ASCII encoding, though, which gives you UnicodeDecodeError exceptions if you actually have non-ASCII bytes somewhere. UnicodeDecodeErrors always tell you that you mixed unicode and byte strings in such a way that Python cannot (doesn't even try to) guess what your text bytestring might represent.
Developers from many big Python projects have come up with simple rules of thumb to prevent such runtime UnicodeDecodeErrors, and the rules may be summarized into one sentence: always do the conversion at IO-barriers. To express this same concept a bit more extensively...:
whenever your program receives text data from the outside (from the network, from a file, from user input, ...), construct unicode objects immediately. Find out the appropriate encoding from, e.g., an HTTP-header, or look for some appropriate convention to determine the encoding to use.
whenever your program sends text data to the outside (to the network, to some file, to the user, ...), determine the correct encoding, and convert your text to a byte string with that encoding. (Otherwise, Python would attempt to convert unicode to an ASCII bytestring, likely producing UnicodeEncodeErrors which are just the converse of the UnicodedecodeErrors previously mentioned).
With these two rules, you will find that most unicode problems just go away. If you still get UnicodeErrors of either kind, go and look for where you forgot to properly construct a unicode object, forgot to propertly convert back to an encoded bytestring, or ended up using an inappropriate encoding due to some mistake (it is quite possible that such encoding mistakes are due to the user or to some other program which is interacting with yours, which is not following the proper rules or conventions regarding which encoding is to be used).
In order to convert a unicode string back to an encoded bytestring, you usually do something like...:
>>> bytestring = german_ae.encode('latin1') >>> bytestring '\xe4'
Now bytestring is a german ae character in the 'latin1' encoding. Note how '\xe4'(in Latin1) and the previously shown '\xc3\xa4' (in UTF8) represent the same german character, but in different encodings.
By now, you can probably imagine why Python refuses to guess among the hundreds of possible encodings. It's a crucial design choice, based on one of the "Zen of Python" principles: In the face of ambiguity, refuse the temptation to guess. At any interactive Python shell prompt, enter the statement "import this" to read all of the important principles that make up the "Zen of Python".
This is good, but I would change this:
"Python has a first class unicode type which you can use in place of the plain byte-string str type."
"Python has a first class unicode type which you should always use instead of str to represent text." :)
Typo: unicode object has no decode() method. Excellent article, with one typo: A unicode obj has an encode() method but not a decode() one.
thanks to Bob and Wade. i incoporated your suggestions in version 1.1.
re: unicode.decode(). Unicode objects have a decode method in Python2.4.
First example with "german ae" could be better. Two points:
(1) there is nothing specifically German about the letter; it is used in other languages
(2) It's very hard to imagine in practice anybody writing code in terms of utf8 constants. Where do you look up what what code to use? Much more practical is using (wait for it!) Unicode -- can be looked up on the unicode.org website, or using the "charmap" accessory on Windows [presumably similar on other OSes], ...
latin_small_letter_ae = u'\u00e6' # or u'\xe6'
Re: John Machin's point (2) I agree that it's hard to imagine writing code in terms of utf8 constants, but once you know the official name of the unicode character(s) you need (e.g. by looking them up at unicode.org) you can import unicodedata and look them up by name, thus avoiding messy hex altogether e.g.:
Much more self documenting!