Welcome, guest | Sign In | My Account | Store | Cart

Simple reverse converter of unicode code points string (Python recipe) by Ryan
ActiveState Code (http://code.activestate.com/recipes/576909/)

It's a simple recipe to convert a str type string with pure unicode code point (e.g string = "\u5982\u679c\u7231" ) to an unicode type string. Actually, this method has the same effect with 'u' prefix. But differently, it allows you to pass a variable of code points string as well as a constant one.

      def u_converter( string = "\u5982\u679c\u7231" ):
    """
    Simple handler for converting a str type string with pure unicode
    code point (that is it has '\u' in string but no 'u' prefix) to
    an unicode type string.

    Actually, this method has the same effect with 'u' prefix. But differently,
    it allows you to pass a variable of code points string as well as a constant
    one.
    """
    chars = string.split("\u")
    chinese = ''
    for char in chars:
        if len(char):
            try:
                ncode = int(char,16)
            except ValueError:
                continue
            try:
                uchar = unichr(ncode)
            except ValueError:
                continue
            chinese += uchar
    return chinese
if __name__ == "__main__":
    pure_string = '\u9633\u5149\u707f\u70c2\u7684\u65e5\u5b50'
    print u_converter(pure_string)
    

      

Usually we can easily decode a string(say 'gbk' encoded) to a unicode string. But now, I want to convert a str type string with pure unicode code point (that is it has '\u' and integer followed in string but no 'u' prefix, e.g "\u5982\u679c\u7231") an unicode type string. If the str is constant, just adding a 'u' prefix will do it. If the str is a variable, the 'u' prefix and unicode function cannot apply, it would be treated as pure string. The main point is to use unichr function.

Tags: code, points, prefix, reverse, str, string, u, unicode

3 comments

Kent Johnson 14 years, 7 months ago # | flag

You have just duplicated what the 'unicode_escape' codec does.

In [1]: s="\u5982\u679c\u7231"

In [3]: s
Out[3]: '\\u5982\\u679c\\u7231'

In [5]: s.decode('unicode_escape')
Out[5]: u'\u5982\u679c\u7231'

Ryan (author) 14 years, 7 months ago # | flag

Thanks so much. Actually, as I'm new in python and afraid there may be a built-in function or easy way to do this job. But after searching and asking such question in several forums, I got no explicit answer.
'unicode_escape' indeed can handle this. Thanks again~

Gabriel Genellina 14 years, 7 months ago # | flag

The comp.lang.python newsgroup is a friendly place to ask such questions. See http://www.python.org/community/lists/

Created by Ryan on Tue, 22 Sep 2009 (MIT)

◄	Python recipes (4591)	►
◄	Ryan's recipes (1)	►

Required Modules

(none specified)

Other Information and Tasks

Licensed under the MIT License
Viewed 12731 times
Revision 4 (updated 14 years ago)

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

Simple reverse converter of unicode code points string (Python recipe) by Ryan ActiveState Code (http://code.activestate.com/recipes/576909/)