protect_utf8 is a function decorator that can prevent naive functions from breaking UTF-8.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | def protect_utf8(wrapped_function, encoding='UTF-8'):
"""Temporarily convert a UTF-8 string to Unicode to prevent breakage.
protect_utf8 is a function decorator that can prevent naive
functions from breaking UTF-8.
If the wrapped function takes a string, and that string happens to be valid
UTF-8, convert it to a unicode object and call the wrapped function. If a
conversion was done and if a unicode object was returned, convert it back
to a UTF-8 string.
The wrapped function should take a string as its first parameter and it may
return an object of the same type. Anything else is optional. For
example:
def truncate(s):
return s[:1]
Pass "encoding" if you want to protect something other than UTF-8.
Ideally, we'd have unicode objects everywhere, but sometimes life is not
ideal. :)
"""
def proxy_function(s, *args, **kargs):
unconvert = False
if isinstance(s, str):
try:
s = s.decode(encoding)
unconvert = True
except UnicodeDecodeError:
pass
ret = wrapped_function(s, *args, **kargs)
if unconvert and isinstance(ret, unicode):
ret = ret.encode(encoding)
return ret
return proxy_function
def truncate(s, length=1, etc="..."):
"""Truncate a string to the given length.
If truncation is necessary, append the value of "etc".
This is really just a silly test.
"""
if len(s) < length:
return s
else:
return s[:length] + etc
truncate = protect_utf8(truncate) # I'm stuck on Python 2.3.
if __name__ == '__main__':
assert (truncate('\xe3\x82\xa6\xe3\x82\xb6\xe3\x83\x86', etc="") ==
'\xe3\x82\xa6')
assert truncate('abc') == 'a...'
assert truncate(u'\u30a0\u30b1\u30c3', etc="") == u'\u30a0'
|
As I mentioned, ideally, we'd have unicode objects floating all over the place instead of UTF-8 strings. However, sometimes you're stuck with UTF-8 strings. It's really easy to break UTF-8 strings by doing things like truncating them, which may result in a character being broken in half. This function decorator can protect UTF-8 strings from naive code.