Welcome, guest | Sign In | My Account | Store | Cart

While processing text I felt some functions are missing, espacially for international texts. Here are some helpers.

Python, 56 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# -*- coding: ISO-8859-1 -*-
import re
import string

_char_simple = "abcdefghijklmnopqrstuvwxyzaaaaaceeeeiiiioooooooouuuuyþ"
_char_lower  = "abcdefghijklmnopqrstuvwxyzâãäåæçèéêëìíîïðñòóôõöøùúûüýþ"
_char_upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZÂÃÄÅÆÇÈÉÊ�<ÌÍÎÏÐÑÒÓÔÕÖØÙÚ�>ÜÝÞ"
_char_else = "0123456789ß×ÿ_"
_char_all = _char_lower + _char_upper + _char_else

_char_trans_lower = string.maketrans(_char_upper, _char_lower)
_char_trans_upper = string.maketrans(_char_lower, _char_upper)
_char_trans_simple = string.maketrans(_char_lower, _char_simple)

rx_ischar = re.compile("[^"+_char_all+"]*", re.DOTALL|re.MULTILINE)

def collapse(v):
    return " ".join(str(v).split()).strip()

def ilower(v):
    global _char_trans_lower
    return v.translate(_char_trans_lower)

def iupper(v):
    global _char_trans_upper
    return v.translate(_char_trans_upper)

def inormalize(v):
    global _char_trans_upper 
    v = v.translate(_char_trans_lower)
    return v.translate(_char_trans_simple)

def iwordlist(v, lower=0, minlen=0, simple=0):
    global _char_trans_lower, rx_ischar
    if lower or simple:
        v = v.translate(_char_trans_lower)
    if simple:
        v = v.translate(_char_trans_simple)
    wlist = rx_ischar.split(v)
    wlist.remove('')
    if minlen:
        wlist = filter(lambda x: len(x)>=minlen, wlist)
    return wlist

if __name__=="__main__":
    text = "Däs Äst\t êine 1  2 xx yy zzz xx TÜÖST "
    print text.lower()
    print ilower(text)
    print iupper(text)
    print inormalize(text)
    print collapse(text)
    print iwordlist(text)
    print iwordlist(text, 1)
    print iwordlist(text, 1, 2)
    print iwordlist(text, 1, 3)
    print iwordlist(text, simple=1)

collapse() strips blanks at beginning and end of a string and sets just one whitespace between all words.

The ilower() and iupper() functions also case special characters like german umlauts.

inormalze() tries to convert all variants of special characters in latin characters, like converting "ä" to "a" and "é" to "e".

iwordlist() returns a list of words e.g. to create a search index. It not just splits where whitespaces are, instead it looks for defined characters and leaves all other chars as trash.

2 comments

Peter Otten 17 years, 10 months ago  # | flag

Random remarks. • You don't need a 'global' declaration just to read a global variable.

• wlist.remove("") in iwordlist(v) will fail when v both starts and ends with a word. When v both starts and ends with non-word chars an empty string will remain in the worldlist.

• The .strip() method in collapse() has no effect.

In general, when you are dealing with non-ascii characters you should use Python's unicode strings. They often do the right thing out of the box:

>>> print u"σιγμα".upper() # sigma in greek letters
ΣΙΓΜΑ # SIGMA in greek letters
Andrew Dalke 17 years, 10 months ago  # | flag

stripping diacritics. From Aaron Bentley's comment in "the Unicode Hammer" recipe at

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871

>>> s=u"\N{LATIN CAPITAL LETTER A WITH ACUTE}"
>>> s
u'\xc1'
>>> import unicodedata
>>> unicodedata.normalize('NFKD', s).encode('ASCII', 'ignore')
'A'
>>> s=u"G\N{LATIN SMALL LETTER O WITH DIAERESIS}teborg - Espa\N{LATIN SMALL LETTER N WITH TILDE}a"
>>> unicodedata.normalize('NFKD', s).encode('ASCII', 'ignore')
'Goteborg - Espana'
>>>
Created by Dirk Holtwick on Tue, 30 May 2006 (PSF)
Python recipes (4591)
Dirk Holtwick's recipes (15)

Required Modules

  • (none specified)

Other Information and Tasks