ActiveState Code

Recipe 576909: Simple reverse converter of unicode code points string


It's a simple recipe to convert a str type string with pure unicode code point (e.g string = "\u5982\u679c\u7231" ) to an unicode type string. Actually, this method has the same effect with 'u' prefix. But differently, it allows you to pass a variable of code points string as well as a constant one.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def u_converter( string = "\u5982\u679c\u7231" ):
    """
    Simple handler for converting a str type string with pure unicode
    code point (that is it has '\u' in string but no 'u' prefix) to
    an unicode type string.

    Actually, this method has the same effect with 'u' prefix. But differently,
    it allows you to pass a variable of code points string as well as a constant
    one.
    """
    chars = string.split("\u")
    chinese = ''
    for char in chars:
        if len(char):
            try:
                ncode = int(char,16)
            except ValueError:
                continue
            try:
                uchar = unichr(ncode)
            except ValueError:
                continue
            chinese += uchar
    return chinese
if __name__ == "__main__":
    pure_string = '\u9633\u5149\u707f\u70c2\u7684\u65e5\u5b50'
    print u_converter(pure_string)
    

Discussion

Usually we can easily decode a string(say 'gbk' encoded) to a unicode string. But now, I want to convert a str type string with pure unicode code point (that is it has '\u' and integer followed in string but no 'u' prefix, e.g "\u5982\u679c\u7231") an unicode type string. If the str is constant, just adding a 'u' prefix will do it. If the str is a variable, the 'u' prefix and unicode function cannot apply, it would be treated as pure string. The main point is to use unichr function.

Comments

  1. 1. At 7:30 a.m. on 23 sep 2009, Kent Johnson said:

    You have just duplicated what the 'unicode_escape' codec does.

    In [1]: s="\u5982\u679c\u7231"
    
    In [3]: s
    Out[3]: '\\u5982\\u679c\\u7231'
    
    In [5]: s.decode('unicode_escape')
    Out[5]: u'\u5982\u679c\u7231'
    
  2. 2. At 3:26 a.m. on 24 sep 2009, ryan std (the author) said:

    Thanks so much. Actually, as I'm new in python and afraid there may be a built-in function or easy way to do this job. But after searching and asking such question in several forums, I got no explicit answer.
    'unicode_escape' indeed can handle this. Thanks again~

  3. 3. At 12:26 a.m. on 25 sep 2009, Gabriel Genellina said:

    The comp.lang.python newsgroup is a friendly place to ask such questions. See http://www.python.org/community/lists/

Sign in to comment