Welcome, guest | Sign In | My Account | Store | Cart

Notice! PyPM is being replaced with the ActiveState Platform, which enhances PyPM’s build and deploy capabilities. Create your free Platform account to download ActivePython or customize Python with the packages you require and get automatic updates.

Download
ActivePython
INSTALL>
pypm install zhon

How to install zhon

  1. Download and install ActivePython
  2. Open Command Prompt
  3. Type pypm install zhon
 Python 2.7Python 3.2Python 3.3
Windows (32-bit)
Windows (64-bit)
Mac OS X (10.5+)
Linux (32-bit)
Linux (64-bit)
0.2.1 Available View build log
0.2.1 Available View build log
 
Author
Imports
Lastest release
version 0.2.1 on May 21st, 2013

Zhon is a Python module that provides constants commonly used in Chinese text processing:

  • Chinese characters
  • Chinese punctuation
  • Pinyin and Zhuyin characters
  • Traditional and simplified characters
  • ASCII characters
  • Fullwidth alphanumeric variants
  • Chinese radicals (as used in dictionaries)

Zhon's constants are formatted as strings containing Unicode code ranges. This is useful for compiling RE pattern objects. They can be combined to make RE pattern objects as needed.

Finding Chinese characters in a string:

System Message: ERROR/3 (<string>, line 21)

Unknown directive type "code".

.. code:: python

    >>> re.findall('[%s]' % zhon.unicode.HAN_IDEOGRAPHS, 'Hello = 你好')
    ['你', '好']

Splitting Chinese text on punctuation:

System Message: ERROR/3 (<string>, line 28)

Unknown directive type "code".

.. code:: python

    >>> re.split('[%s]' % zhon.unicode.PUNCTUATION, '有人丢失了一把斧子,怎么找也没有找到。')
    ['有人丢失了一把斧子', '怎么找也没有找到', '']

Finding all non-Chinese characters in a string:

System Message: ERROR/3 (<string>, line 35)

Unknown directive type "code".

.. code:: python

    >>> not_zh_re = re.compile('[^%s%s]' % (zhon.unicode.HAN_IDEOGRAPHS, zhon.unicode.PUNCTUATION))
    >>> not_zh_re.findall('我叫Thomas。你叫什么名字?')
    ['T', 'h', 'o', 'm', 'a', 's']

RE pattern objects for matching Pinyin:

System Message: ERROR/3 (<string>, line 43)

Unknown directive type "code".

.. code:: python

    >>> pinyin_a_re = re.compile(zhon.unicode.pinyin.RE_ACCENT, re.I | re.X)
    >>> pinyin_a_re.findall('Yǒurén diūshīle yī bǎ fǔzi, zěnme zhǎo yě méiyǒu zhǎodào.')
    ['Yǒu', 'rén', ' ', 'diū', 'shī', 'le', ' ', 'yī', ' ', 'bǎ', ' ', 'fǔ', 'zi', ',', ' ', 'zěn', 'me', ' ', 'zhǎo', ' ', 'yě', ' ', 'méi', 'yǒu', ' ', 'zhǎo', 'dào', '.']

Overview

zhon.unicode.HAN_IDEOGRAPHS
This represents every Chinese character (including historic and rare characters). HAN_IDEOGRAPHS includes CJK Unified Ideographs, CJK Unified Extensions (A-D), CJK Compatibility Ideographs, CJK Compatibility Ideographs Supplement, and the extension to the URO. More information is available in Chapter 12 of the Unicode Standard.
zhon.unicode.PUNCTUATION
This contains punctuation used in Chinese text.
zhon.unicode.PINYIN
This contains characters used in Pinyin (both numbered and accented). This can be used to quickly verify text as only having Pinyin-compatible characters. It does not check for invalid syllables or spelling -- it only contains individual characters.
zhon.unicode.ZHUYIN
This contains characters used in Zhuyin (Bopomofo).
zhon.unicode.ASCII
This contains all ASCII characters.
zhon.unicode.FULLWIDTH_ALPHANUMERIC
This contains the fullwidth variants for A-Z, a-z, and 0-9.
zhon.unicode.RADICALS
This contains the Kangxi radicals and the CJK Radicals Supplement. They are used in dictionaries to index characters.
zhon.pinyin.RE_NUMBER
This is a regular exprssion pattern (not compiled) that matches valid Pinyin (with numbers). It also matches punctuation and whitespace. When compiling, make sure to use the re.X and re.I flags.
zhon.pinyin.RE_ACCENT
This is a regular exprssion pattern (not compiled) that matches valid Pinyin (with accents). It also matches punctuation and whitespace. When compiling, make sure to use the re.X and re.I flags.
zhon.cedict.TRADITIONAL
This contains characters considered by CC-CEDICT to be traditional.
zhon.cedict.SIMPLIFIED
This contains characters considered by CC-CEDICT to be simplified.

Narrow Python Builds

If you have a narrow Python 2 build and run the following code, a ValueError is raised:

System Message: ERROR/3 (<string>, line 105)

Unknown directive type "code".

.. code:: python

    >>> unichr(0x20000)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Narrow Python 3.1/3.2 builds have problems compiling RE pattern objects using characters ranges greater than 0xFFFF:

System Message: ERROR/3 (<string>, line 115)

Unknown directive type "code".

.. code:: python

    >>> re.compile('[\U00020000-\U00020005]')
    Traceback (most recent call last):
    ...
    sre_constants.error: bad character range

Narrow Python builds incorrectly handle the character U00020000 and others like it. Zhon takes this into account when building its constants so that you don't have to worry about it -- characters greater than your Python build's sys.maxunicode are not included in Zhon's constants.

Name

Zhon is short for ZHongwen cONstants. It is pronounced like the name 'John'.

Requirements

Zhon supports Python 2.6, 2.7, 3.1, 3.2, and 3.3.

Install

Just use pip:

System Message: ERROR/3 (<string>, line 142)

Unknown directive type "code".

.. code:: bash

    $ pip install zhon


Bugs/Feature Requests

Zhon uses its GitHub Issues page to track bugs, feature requests, and support questions.

License

Zhon is released under the OSI-approved MIT License. See the file LICENSE.txt for more information.

Subscribe to package updates

Last updated May 21st, 2013

Download Stats

Last month:1

What does the lock icon mean?

Builds marked with a lock icon are only available via PyPM to users with a current ActivePython Business Edition subscription.

Need custom builds or support?

ActivePython Enterprise Edition guarantees priority access to technical support, indemnification, expert consulting and quality-assured language builds.

Plan on re-distributing ActivePython?

Get re-distribution rights and eliminate legal risks with ActivePython OEM Edition.