Welcome, guest | Sign In | My Account | Store | Cart

I received UnicodeEncodeError when playing with various codepages in source code/files/standard streams. Sometime I receive UnicodeEncodeError when script launched via scheduler or in long running batch when parsing unpredictable [alien ;)] HTML.

Function console() helps avoid this exceptions by converting erroneous charatcters to standard python representation.

to do in future: make a codec-wrapper for safe using in statements like this:

sys.stdout=codecs.getwriter('cp866')(sys.stdout)
Python, 58 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# -*- coding: Windows-1251 -*-

_goodchars=dict()

def console(msg):
    '''
    Author: Denis Barmenkov <denis.barmenkov@gmail.com>
    Date: 02-jul-2007

    Write string to stdout.
    On UnicodeEncodeError exception all the unsafe chars from string
    replaced by its python representation
    '''
    global _goodchars
    try:
        print msg
    except UnicodeEncodeError:
        # get error, 
        res=''
        for i in list(msg):
            # try to put unknown characters thru print statement:
            if i not in _goodchars:
                try:
                    print i # try print character, some extra trash on screen
                            # for each unknown printable character 
                    _goodchars[i]=i # safe character, save it as is
                except UnicodeEncodeError:
                    # format character as python string constant
                    code=ord(i)
                    if code < 256:
                        t='\\x%02x' % code # 8-bit value
                    elif code < 65536:
                        t='\\u%04x' % code # 16-bit value unicode
                    else:
                        t='\\U%08x' % code # other values as 32-bit unicode
                    _goodchars[i]=t # or '.' for readability ;-)
            res+=_goodchars[i]  # append to result
        print res

if __name__=='__main__':
    import codecs
    import sys

    reload(sys)

    # prepare my encodings
    sys.setdefaultencoding('cp1251')                  # set default encoding for source
    sys.stdout=codecs.getwriter('cp866')(sys.stdout)  # set DOS cyrillic codepage

    test_string='\xab'

    try:
        print 'Using print statement:', test_string
    except UnicodeEncodeError:
        print 'UnicodeEncodeError exception while using print!'
        
    print 'Using console():',
    console(test_string)