Guaranteed conversion to unicode or byte string « Python recipes

Python's built in function str() and unicode() return a string representation of the object in byte string and unicode string respectively. This enhanced version of str() and unicode() can be used as handy functions to convert between byte string and unicode. This is especially useful in debugging when mixup of the string types is suspected.

      def safe_unicode(obj, *args):
    """ return the unicode representation of obj """
    try:
        return unicode(obj, *args)
    except UnicodeDecodeError:
        # obj is byte string
        ascii_text = str(obj).encode('string_escape')
        return unicode(ascii_text)

def safe_str(obj):
    """ return the byte string representation of obj """
    try:
        return str(obj)
    except UnicodeEncodeError:
        # obj is unicode
        return unicode(obj).encode('unicode_escape')


# ------------------------------------------------------------------------
# Sample code below to illustrate their usage

def write_unicode_to_file(filename, unicode_text):
    """
    Write unicode_text to filename in UTF-8 encoding.
    Parameter is expected to be unicode. But it will also tolerate byte string.
    """
    fp = file(filename,'wb')
    # workaround problem if caller gives byte string instead
    unicode_text = safe_unicode(unicode_text)
    utf8_text = unicode_text.encode('utf-8')
    fp.write(utf8_text)
    fp.close()

      

Python's built in function str() and unicode() return a string representation of the object in byte string and unicode string respectively. However they cannot be applied to text string or unicode string in general. In such case Python would just apply the system encoding (e.g. ASCII, strict), which more often then not would result in UnicodeError. The proper way to convert them is to use the encode and decode method, e.g.

# convert byte string to unicode
unicode_text = byte_string.decode(encoding)

# convert unicode to byte string
byte_string = unicode_string.encode(encoding)

This is considerably more involving. You have to choose the proper encoding. You also have to pay attention to the direction (i.e. call encode() for unicode string, decode() for byte string). Any mistake might end up in UnicodeError. Instead safe_unicode() and safe_str() can apply to either unicode or byte string. It applies string_escape or unicode_escape encoding when necessary, which quote characters using Python's string escape.

When would you want to use safe_unicode() and safe_str() rather than encoding them properly? Mostly in debugging situation. If you expect unicode input but recieved a byte string instead, it might result in UnicodeError with very little information what the offending string is. Use safe_unicode() to avoid the exception, so that you can collect more information for debugging.

Another use case is a no fuss conversion for printing. One thing that makes Python so usable is everything has a text representation, not only for scalar types but also for objects and even complex aggregated data like list of list or map of map. A pythonistas would instinctively use the print command to check out the structure or content of any object. Ironically this would fail if the data is an unicode string. print safe_str(obj) would give you some idea of the data effortlessly. The information would aid you to determine the steps necessary to convert it the 'proper' way.

Let's look at an example write_unicode_to_file(). It expect a unicode string input. It encodes the input with UTF-8 encoding and then write it to a file. We will compare the result with or without the workaround code inserted:

unicode_text = safe_unicode(unicode_text)

First of all, when an unicode string is passed as intended:

>>> # define a test message
... text = u'The message is \u8463'
>>> write_unicode_to_file('test.tmp', text)


correct output -> The message is &#33891;

However if there is a simple misunderstanding and a byte string is passed instead, it would result in UnicodeDecodeError with little information on the offending string besides the 'byte 0xe8 in position 15' message.

>>> e_text = text.encode('utf-8')
>>> write_unicode_to_file_simple('test.tmp', e_text)
Traceback (most recent call last):
  File "", line 1, in ?
  File "", line 7, in write_unicode_to_file
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 15: ordinal not in range(128)

With the workaround inserted the exception would be averted.

>>> # 3. a more robust version using safe_unicode
... e_text = text.encode('utf-8')
>>> write_unicode_to_file('test.tmp', e_text)


output -> The message is \xe8\x91\xa3

The output is at least readable and would be a useful intelligence to track down the source of problem.

Tags: text

2 comments

Sridhar Ratnakumar 13 years, 1 month ago # | flag

Note that string_escape (and thus your safe_unicode) will also eat up newlines.

>>> b'df\nsdf'.encode('string_escape')
'df\\nsdf'
>>> print b'df\nsdf'.encode('string_escape')
df\nsdf

Marlon Baptista de Quadros 12 years, 1 month ago # | flag

Thanks, you help me a lot =D

◄	Python recipes (4591)	►
◄	Wai Yip Tung's recipes (9)	►

Guaranteed conversion to unicode or byte string (Python recipe) by Wai Yip Tung
ActiveState Code (http://code.activestate.com/recipes/466341/)

2 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Guaranteed conversion to unicode or byte string (Python recipe) by Wai Yip Tung ActiveState Code (http://code.activestate.com/recipes/466341/)

2 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Guaranteed conversion to unicode or byte string (Python recipe) by Wai Yip Tung
ActiveState Code (http://code.activestate.com/recipes/466341/)