Welcome, guest | Sign In | My Account | Store | Cart

It's a simple recipe to convert a str type string with pure unicode code point (e.g string = "\u5982\u679c\u7231" ) to an unicode type string. Actually, this method has the same effect with 'u' prefix. But differently, it allows you to pass a variable of code points string as well as a constant one.

Python, 27 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def u_converter( string = "\u5982\u679c\u7231" ):
    """
    Simple handler for converting a str type string with pure unicode
    code point (that is it has '\u' in string but no 'u' prefix) to
    an unicode type string.

    Actually, this method has the same effect with 'u' prefix. But differently,
    it allows you to pass a variable of code points string as well as a constant
    one.
    """
    chars = string.split("\u")
    chinese = ''
    for char in chars:
        if len(char):
            try:
                ncode = int(char,16)
            except ValueError:
                continue
            try:
                uchar = unichr(ncode)
            except ValueError:
                continue
            chinese += uchar
    return chinese
if __name__ == "__main__":
    pure_string = '\u9633\u5149\u707f\u70c2\u7684\u65e5\u5b50'
    print u_converter(pure_string)
    

Usually we can easily decode a string(say 'gbk' encoded) to a unicode string. But now, I want to convert a str type string with pure unicode code point (that is it has '\u' and integer followed in string but no 'u' prefix, e.g "\u5982\u679c\u7231") an unicode type string. If the str is constant, just adding a 'u' prefix will do it. If the str is a variable, the 'u' prefix and unicode function cannot apply, it would be treated as pure string. The main point is to use unichr function.

3 comments

Kent Johnson 12 years, 2 months ago  # | flag

You have just duplicated what the 'unicode_escape' codec does.

In [1]: s="\u5982\u679c\u7231"

In [3]: s
Out[3]: '\\u5982\\u679c\\u7231'

In [5]: s.decode('unicode_escape')
Out[5]: u'\u5982\u679c\u7231'
Ryan (author) 12 years, 2 months ago  # | flag

Thanks so much. Actually, as I'm new in python and afraid there may be a built-in function or easy way to do this job. But after searching and asking such question in several forums, I got no explicit answer.
'unicode_escape' indeed can handle this. Thanks again~

Gabriel Genellina 12 years, 2 months ago  # | flag

The comp.lang.python newsgroup is a friendly place to ask such questions. See http://www.python.org/community/lists/

Created by Ryan on Tue, 22 Sep 2009 (MIT)
Python recipes (4591)
Ryan's recipes (1)

Required Modules

  • (none specified)

Other Information and Tasks