A function analyzing an open xml file for its character encoding by - checking for a unicode BOM or (on failure) - searching the xml declaration at the beginning of the file for the "encoding" attribute
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | def detectXMLEncoding(fp):
""" Attempts to detect the character encoding of the xml file
given by a file object fp. fp must not be a codec wrapped file
object!
The return value can be:
- if detection of the BOM succeeds, the codec name of the
corresponding unicode charset is returned
- if BOM detection fails, the xml declaration is searched for
the encoding attribute and its value returned. the "<"
character has to be the very first in the file then (it's xml
standard after all).
- if BOM and xml declaration fail, None is returned. According
to xml 1.0 it should be utf_8 then, but it wasn't detected by
the means offered here. at least one can be pretty sure that a
character coding including most of ASCII is used :-/
"""
### detection using BOM
## the BOMs we know, by their pattern
bomDict={ # bytepattern : name
(0x00, 0x00, 0xFE, 0xFF) : "utf_32_be",
(0xFF, 0xFE, 0x00, 0x00) : "utf_32_le",
(0xFE, 0xFF, None, None) : "utf_16_be",
(0xFF, 0xFE, None, None) : "utf_16_le",
(0xEF, 0xBB, 0xBF, None) : "utf_8",
}
## go to beginning of file and get the first 4 bytes
oldFP = fp.tell()
fp.seek(0)
(byte1, byte2, byte3, byte4) = tuple(map(ord, fp.read(4)))
## try bom detection using 4 bytes, 3 bytes, or 2 bytes
bomDetection = bomDict.get((byte1, byte2, byte3, byte4))
if not bomDetection :
bomDetection = bomDict.get((byte1, byte2, byte3, None))
if not bomDetection :
bomDetection = bomDict.get((byte1, byte2, None, None))
## if BOM detected, we're done :-)
if bomDetection :
fp.seek(oldFP)
return bomDetection
## still here? BOM detection failed.
## now that BOM detection has failed we assume one byte character
## encoding behaving ASCII - of course one could think of nice
## algorithms further investigating on that matter, but I won't for now.
### search xml declaration for encoding attribute
import re
## assume xml declaration fits into the first 2 KB (*cough*)
fp.seek(0)
buffer = fp.read(2048)
## set up regular expression
xmlDeclPattern = r"""
^<\?xml # w/o BOM, xmldecl starts with <?xml at the first byte
.+? # some chars (version info), matched minimal
encoding= # encoding attribute begins
["'] # attribute start delimiter
(?P<encstr> # what's matched in the brackets will be named encstr
[^"']+ # every character not delimiter (not overly exact!)
) # closes the brackets pair for the named group
["'] # attribute end delimiter
.*? # some chars optionally (standalone decl or whitespace)
\?> # xmldecl end
"""
xmlDeclRE = re.compile(xmlDeclPattern, re.VERBOSE)
## search and extract encoding string
match = xmlDeclRE.search(buffer)
fp.seek(oldFP)
if match :
return match.group("encstr")
else :
return None
|
If you have to handle xml files without knowing its character encoding a priori, you'll need means to discover encoding; this function provides two of them: unicode BOM and xml declaration header (the latter of course only interesting if no unicode is used).
This recipe is not entirely new; I took Paul Prescods recipe http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52257 and improved bits here and there.
Known issues: - if BOM detection fails, the function assumes a character encoding which behaves like ASCII, at least in the xml declaration header. In fact most chracter coding schemes do, but not all of course... - the function assumes that the xml declaration header fits into the first 2048 bytes of the file. This should be no issue at all. In theory, the declaration header can span indefinite due to whitespaces, but... cough - the regular expression matches "anything but an attribute delimiter" for the encoding string. I didn't test that with malformed byte patterns. Overall, the regex will suit halfway correct xml headers well, but may behave odd if the header is uncorrect. One COULD improve the regex for bad xml headers.