You have some string input with some specical characters escaped using syntax rules resemble Python's. For example the 2 characters '\n' stands for the control character LF. You need to decode these control characters efficiently. Use Python's builtin codecs to decode them efficiently.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | >>> len('a\\nb')
4
>>> len('a\\nb'.decode('string_escape'))
3
>>>
Or for unicode strings
>>> len(u'\N{euro sign}\\nB')
4
>>> len(u'\N{euro sign}\\nB'.encode('utf-8').decode('string_escape').decode('utf-8'))
3
This compares to naive approach to decode character escape by writing
your own scanner in pure Python. For example:
def decode(s):
output = []
iterator = iter(s)
for c in iterator:
if c == '\\':
...enter your state machine and decode...
else:
output.append(c)
return ''.join(output)
or
def decode(s):
return s\
.replace('\\n','\n')\
.replace('\\t','\t')\
...and so on for the few escapes supported...
The navie approaches are expected to be much slower.
|
Python's builtin codecs not only decode various character encoding to unicode, it also has a number of codecs that does useful transformation, such as base64 encoding. In this case the 'string_escape' codecs would decode string literal as in Python source code. These builtin codecs are presumably highly optimized. They should be lot more efficient compares to looking them up character by character with pure Python code.
In case of unicode string, there is a seemingly parallel 'unicode_string' codecs. However when we apply this to non-ASCII string it runs into problem.
>>> len(u'\N{euro sign}\\nB'.decode('unicode_escape'))
Traceback (most recent call last):
File "", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac' in position 0: ordinal not in range(256)
The issue is unlike 'string_escape', which convert from byte string to byte string, 'unicode_escape' decode byte string into unicode. If the operand is unicode, Python tries to convert it using system encoding first. This causes failures in many cases.
Steven Bethard has proposed a 3 steps decoding on the comp.lang.python newsgroup http://groups.google.com/group/comp.lang.python/browse_frm/thread/2c695421c1697432/72a8619ee3fc2631?q=string_escape&rnum=2#72a8619ee3fc2631. It resolves this problem by first encode the unicode string using UTF-8, then applies string_escape, and finally decode it back using UTF-8. This procedure is shown in the second algorithm of the recipe.
Due diligence on UTF-8 encoding
Before we call the problem solved, we should examine the possibility that the UTF-8 encoding might introduces bytes sequences that match Python string escape accidentally, and thus corrupts the output. A careful look in the mechanism of UTF-8 encoding assures us this will not happen.
UTF-8 - Wikipedia http://en.wikipedia.org/wiki/Utf-8 A byte sequence for one character never occurs as part of a longer sequence for another character. For instance, US-ASCII octet values do not appear otherwise in a UTF-8 encoded character stream.
From Python documentation http://www.python.org/doc/current/ref/strings.html, all string escapes are defined by ASCII characters.
Since non-ASCII character would not appears in UTF-8 encoded stream as ASCII octet, it would not introduce extra Python escapes by accident.
For comparison, notice how a valid unicode character \u5c6e would be mistaken as Python escape \n in UTF-16 encoding.
>>> u'\u5c6e'.encode('utf-16be')
'\\n'
>>> u'\u5c6e'.encode('utf-16be').decode('string_escape')
'\n'