ActiveState Code

Recipe 358117: unicode word-wrap function


There are many cool word-wrap recipes, but most don't support unicode, such as Chinese characters. Note: you should use python2.4 to test the recipe for the 'gbk' encoding.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# This recipe refers:
#
#  http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/148061

import re
rx=re.compile(u"([\u2e80-\uffff])", re.UNICODE)

def cjkwrap(text, width, encoding="utf8"):
     return reduce(lambda line, word, width=width: '%s%s%s' %              
                (line,
                 [' ','\n', ''][(len(line)-line.rfind('\n')-1
                       + len(word.split('\n',1)[0] ) >= width) or
                      line[-1:] == '\0' and 2],
                 word),
                rx.sub(r'\1\0 ', unicode(text,encoding)).split(' ')
            ).replace('\0', '').encode(encoding)

# Here is the Chese test message, it is in Chinese 'gbk' encoding
# Since this site doesn't support Chinese, I have to use hex format :(
msg='\xce\xd2\xc3\xc7\xd7\xd4\xbc\xba\xbf\xc9\xd2\xd4\xb5\xc4\xa3\xac\
\xb2\xbb\xca\xc7\xc2\xf0? whay \xce\xd2\xc3\xc7\xd5\xfd\xb5\xc4\n \
\xd2\xaa\xc7\xf3\xca\xc7\xca\xb2\xc3\xb4\xa1\xa2how to dothat? no \
problem? !! \
\xb5\xab\xca\xc7\xd6\xd0\xce\xc4\xbe\xcd\xb2\xbb\xcd\xac\xc1\xcb\xa3\
\xac\xd2\xbb\xb8\xf6\xba\xba\xd7\xd6\xb5\xc4\xbf\xd5\xbc\xe4\xd5\xbc\
\xd3\xc3\xb5\xc8\xd3\xda2\xb8\xf6\xd3\xa2\xce\xc4\xa1\xa3\xba\xba\xd7\
\xd6\xb5\xc4\xb7\xd6\xb4\xca\xa3\xac\xca\xc7\xc3\xbb\xd3\xd0\xbf\xd5\
\xb8\xf1\xb5\xc4\xa1\xa3\n\n \
\xbd\xe8\xbc\xf8CJKSplitter\xb5\xc4\xbf\xaa\xb7\xa2\xbe\xad\xd1\xe9\
\xa3\xac\xce\xd2\xd7\xbc\xb1\xb8\xd2\xb2\xd7\xf6\xd2\xbb\xb8\xf6\xd6\
\xd0\xce\xc4\xd5\xdb\xd0\xd0\xcb\xe3\xb7\xa8\xa3\xa8\xd3\xa6\xb8\xc3\
\xd2\xb2\xba\xdc\xc8\xdd\xd2\xd7\xd6\xa7\xb3\xd6\xc8\xd5\xba\xab\xce\
\xc4\xd7\xd6\xa3\xa9\xa3\xac\xb5\xb1\xc8\xbb\xb2\xbb\xbf\xc9\xc4\xdc\
\xd3\xd0\xd3\xa2\xce\xc4\xb5\xc4\xc4\xc7\xc3\xb4\xb8\xdf\xd0\xa7\xc1\
\xcb\xa3\xba\n \
\xd3\xa2\xce\xc4\xca\xc7ascii\xa3\xac\xc3\xbf\xb8\xf6\xd7\xd6\xb7\xfb\
\xd5\xbc\xd3\xc3\xbf\xd5\xbc\xe4\xcf\xe0\xcd\xac\xa3\xac\xd6\xb1\xbd\
\xd3\xca\xb9\xd3\xc3\xbf\xd5\xb8\xf1\xbe\xcd\xbf\xc9\xd2\xd4\xb7\xd6\
\xb4\xca\xa1\xa3\xd2\xf2\xb4\xcb\xd3\xa2\xce\xc4\xb0\xe6\xb1\xbe\xb5\
\xc4\xd5\xdb\xd0\xd0\xcb\xe3\xb7\xa8\xd4\xdapython \
cookbook\xc9\xcf\xd3\xd0\xba\xdc\xbe\xad\xb5\xe4\xb5\xc4\xb8\xdf\xd0\
\xa7\xcb\xe3\xb7\xa8\xa3\xba\n\n one-liner word-wrap function'

# ok, now I print it the real msg here, you will see some long lines
print msg

# example: make it fit in 50 columns, and use 'gbk' encoding
print cjkwrap(msg, 50, 'gbk')

# you will see the correct wraped lines ...

Discussion

Chinese character is wider than the ascii charactor(double) and don't use white space to split words. So we have to use another way.

Comments

  1. 1. At 4:39 a.m. on 10 dec 2004, Junyong Pan (the author) said:

    Chinese version of this recipe. Chiese version of this recipe is here:

    http://blog.czug.org/panjunyong/mail-line-wrap

Sign in to comment