Welcome, guest | Sign In | My Account | Store | Cart

The XML specification describes the outlines of an algorithm for detecting the Unicode encoding that an XML document uses. This function will do that.

Python, 64 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import codecs, encodings

"""Caller will hand this library a buffer and ask it to either convert
it or auto-detect the type."""

# None represents a potentially variable byte. "##" in the XML spec... 
autodetect_dict={ # bytepattern     : ("name",              
                (0x00, 0x00, 0xFE, 0xFF) : ("ucs4_be"),        
                (0xFF, 0xFE, 0x00, 0x00) : ("ucs4_le"),
                (0xFE, 0xFF, None, None) : ("utf_16_be"), 
                (0xFF, 0xFE, None, None) : ("utf_16_le"), 
                (0x00, 0x3C, 0x00, 0x3F) : ("utf_16_be"),
                (0x3C, 0x00, 0x3F, 0x00) : ("utf_16_le"),
                (0x3C, 0x3F, 0x78, 0x6D): ("utf_8"),
                (0x4C, 0x6F, 0xA7, 0x94): ("EBCDIC")
                 }

def autoDetectXMLEncoding(buffer):
    """ buffer -> encoding_name
    The buffer should be at least 4 bytes long.
        Returns None if encoding cannot be detected.
        Note that encoding_name might not have an installed
        decoder (e.g. EBCDIC)
    """
    # a more efficient implementation would not decode the whole
    # buffer at once but otherwise we'd have to decode a character at
    # a time looking for the quote character...that's a pain

    encoding = "utf_8" # according to the XML spec, this is the default
                          # this code successively tries to refine the default
                          # whenever it fails to refine, it falls back to 
                          # the last place encoding was set.
    bytes = (byte1, byte2, byte3, byte4) = tuple(map(ord, buffer[0:4]))
    enc_info = autodetect_dict.get(bytes, None)

    if not enc_info: # try autodetection again removing potentially 
                     # variable bytes
        bytes = (byte1, byte2, None, None)
        enc_info = autodetect_dict.get(bytes)

        
    if enc_info:
        encoding = enc_info # we've got a guess... these are
                            #the new defaults

        # try to find a more precise encoding using xml declaration
        secret_decoder_ring = codecs.lookup(encoding)[1]
        (decoded,length) = secret_decoder_ring(buffer) 
        first_line = decoded.split("\n")[0]
        if first_line and first_line.startswith(u"<?xml"):
            encoding_pos = first_line.find(u"encoding")
            if encoding_pos!=-1:
                # look for double quote
                quote_pos=first_line.find('"', encoding_pos) 

                if quote_pos==-1:                 # look for single quote
                    quote_pos=first_line.find("'", encoding_pos) 

                if quote_pos>-1:
                    quote_char,rest=(first_line[quote_pos],
                                                first_line[quote_pos+1:])
                    encoding=rest[:rest.find(quote_char)]

    return encoding

This code detects a variety of encodings, including some that are not supported by Python's Unicode decoder. So the fact that you can decipher the encoding does not guarantee that you can decipher the document itself!

3 comments

Mike Brown 21 years ago  # | flag

Good, but... It makes the assumption that the XML declaration is the only thing on the first line, but this is not necessarily going to be the case; there might not be any line breaks at all. For example, the encoding of '<?xml version="1.0"?><foo encoding="x-bar"/>' is detected as 'x-bar' instead of 'utf-8'. Using a regular expression to find the XML declaration would be more reliable.

Lars Tiede 19 years, 2 months ago  # | flag

Some changes. I made some changes to the code, beside blunt renaming and little cosmetic are worth mentioning:

  • I haven't found some of the BOM byte patterns you used. Thus, I removed them

  • the patterns for the 4 byte schemes fit to the names UTF32 rather than UCS4

  • the algorithm searching in the xml declaration is wrong. I worked out a regex which should do for all halfway correct XML 1.0 headers

My code: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/363841

Mike Brown 18 years, 6 months ago  # | flag

Error in 2nd edition. The discussion of this on page 469 of the 2nd print edition of the Python Cookbook acted upon my previous comment incorrectly. The book makes the assertion that the XML declaration must be terminated by a linefeed, and it implies that the recipe does not need to handle such cases of malformed "almost-XML". This is entirely wrong; there does not need to be linefeed at all; the XML grammar makes this clear in all three editions of XML 1.0.

Also, Lars Tiede has submitted a regex-based version at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/363841.