While processing text I felt some functions are missing, espacially for international texts. Here are some helpers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | # -*- coding: ISO-8859-1 -*-
import re
import string
_char_simple = "abcdefghijklmnopqrstuvwxyzaaaaaceeeeiiiioooooooouuuuyþ"
_char_lower = "abcdefghijklmnopqrstuvwxyzâãäåæçèéêëìíîïðñòóôõöøùúûüýþ"
_char_upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZÂÃÄÅÆÇÈÉÊ�<ÌÍÎÏÐÑÒÓÔÕÖØÙÚ�>ÜÝÞ"
_char_else = "0123456789ß×ÿ_"
_char_all = _char_lower + _char_upper + _char_else
_char_trans_lower = string.maketrans(_char_upper, _char_lower)
_char_trans_upper = string.maketrans(_char_lower, _char_upper)
_char_trans_simple = string.maketrans(_char_lower, _char_simple)
rx_ischar = re.compile("[^"+_char_all+"]*", re.DOTALL|re.MULTILINE)
def collapse(v):
return " ".join(str(v).split()).strip()
def ilower(v):
global _char_trans_lower
return v.translate(_char_trans_lower)
def iupper(v):
global _char_trans_upper
return v.translate(_char_trans_upper)
def inormalize(v):
global _char_trans_upper
v = v.translate(_char_trans_lower)
return v.translate(_char_trans_simple)
def iwordlist(v, lower=0, minlen=0, simple=0):
global _char_trans_lower, rx_ischar
if lower or simple:
v = v.translate(_char_trans_lower)
if simple:
v = v.translate(_char_trans_simple)
wlist = rx_ischar.split(v)
wlist.remove('')
if minlen:
wlist = filter(lambda x: len(x)>=minlen, wlist)
return wlist
if __name__=="__main__":
text = "Däs Äst\t êine 1 2 xx yy zzz xx TÜÖST "
print text.lower()
print ilower(text)
print iupper(text)
print inormalize(text)
print collapse(text)
print iwordlist(text)
print iwordlist(text, 1)
print iwordlist(text, 1, 2)
print iwordlist(text, 1, 3)
print iwordlist(text, simple=1)
|
collapse() strips blanks at beginning and end of a string and sets just one whitespace between all words.
The ilower() and iupper() functions also case special characters like german umlauts.
inormalze() tries to convert all variants of special characters in latin characters, like converting "ä" to "a" and "é" to "e".
iwordlist() returns a list of words e.g. to create a search index. It not just splits where whitespaces are, instead it looks for defined characters and leaves all other chars as trash.
Random remarks. • You don't need a 'global' declaration just to read a global variable.
• wlist.remove("") in iwordlist(v) will fail when v both starts and ends with a word. When v both starts and ends with non-word chars an empty string will remain in the worldlist.
• The .strip() method in collapse() has no effect.
In general, when you are dealing with non-ascii characters you should use Python's unicode strings. They often do the right thing out of the box:
stripping diacritics. From Aaron Bentley's comment in "the Unicode Hammer" recipe at
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871