Encoding Unicode data for XML and HTML « Python recipes

The "xmlcharrefreplace" encoding error handler was introduced in Python 2.3. It is very useful when encoding Unicode data for HTML and other XML applications with limited (but popular) encodings like ASCII or ISO-Latin-1. The following code contains a trivial function "encode_for_xml" that illustrates the "xmlcharrefreplace" error handler, and a function "_xmlcharref_encode" which emulates "xmlcharrefreplace" for Python pre-2.3.

An HTML demonstration is included. Put the code into a file, run it with Python, and redirect the output to a .html file. Open the output file in a browser to see the results.

A variation of this code is used in the Docutils project. The original idea for backporting "xmlcharrefreplace" to pre-2.3 Python was from Felix Wiemann.

      def encode_for_xml(unicode_data, encoding='ascii'):
    """
    Encode unicode_data for use as XML or HTML, with characters outside
    of the encoding converted to XML numeric character references.
    """
    try:
        return unicode_data.encode(encoding, 'xmlcharrefreplace')
    except ValueError:
        # ValueError is raised if there are unencodable chars in the
        # data and the 'xmlcharrefreplace' error handler is not found.
        # Pre-2.3 Python doesn't support the 'xmlcharrefreplace' error
        # handler, so we'll emulate it.
        return _xmlcharref_encode(unicode_data, encoding)

def _xmlcharref_encode(unicode_data, encoding):
    """Emulate Python 2.3's 'xmlcharrefreplace' encoding error handler."""
    chars = []
    # Step through the unicode_data string one character at a time in
    # order to catch unencodable characters:
    for char in unicode_data:
        try:
            chars.append(char.encode(encoding, 'strict'))
        except UnicodeError:
            chars.append('&#%i;' % ord(char))
    return ''.join(chars)


if __name__ == '__main__':
    # demo
    data = u'''\
<html>
<head>
<title>Encoding Test</title>
</head>
<body>
<p>accented characters:</p>
<ul>
<li>\xe0 (a + grave)
<li>\xe7 (c + cedilla)
<li>\xe9 (e + acute)
<li>\xee (i + circumflex)
<li>\xf1 (n + tilde)
<li>\xfc (u + umlaut)
</ul>
<p>symbols:</p>
<ul>
<li>\xa3 (British pound)
<li>\xa2 (cent)
<li>\u20ac (Euro)
<li>\u221e (infinity)
<li>\xb0 (degree)
</ul>
</body></html>
'''
    print encode_for_xml(data, 'ascii')

      

Unicode data must be encoded before being printed or written out to a file. UTF-8 is an ideal encoding, since it can handle any Unicode character. But for many users and applications, ASCII or ISO-Latin-1 are often preferred over UTF-8. When the Unicode data contains characters that are outside of the given encoding (for example, accented characters and most symbols are not encodable in ASCII, and the "infinity" symbol is not encodable in Latin-1), these encodings cannot handle the data on their own. Python 2.3 introduced an encoding error handler called "xmlcharrefreplace" which replaces unencodable characters with XML numeric character references, like "∞" for the infinity symbol.

Please note that "xmlcharrefreplace" cannot be used for non-XML/HTML output. TeX and other non-XML markup languages do not recognize XML numeric character references.

Once comfortable with Unicode and encodings, there is a more concise and direct interface in the "codecs" module:

Create a file-like object, outfile, that will transparently

handle its own text encoding:

outfile = codecs.open('out.html', mode='w', encoding='ascii', errors='xmlcharrefreplace')

output = u'... some HTML containing non-ASCII characters ...'

No need to encode the output here:

outfile.write(output) outfile.close()

Tags: text

◄	Python recipes (4591)	►
◄	David Goodger's recipes (1)	►
◄	Python Cookbook Edition 2 (117)	►

Encoding Unicode data for XML and HTML (Python recipe) by David Goodger
ActiveState Code (http://code.activestate.com/recipes/303668/)

Create a file-like object, outfile, that will transparently

handle its own text encoding:

No need to encode the output here:

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Encoding Unicode data for XML and HTML (Python recipe) by David Goodger ActiveState Code (http://code.activestate.com/recipes/303668/)

Create a file-like object, outfile, that will transparently

handle its own text encoding:

No need to encode the output here:

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Encoding Unicode data for XML and HTML (Python recipe) by David Goodger
ActiveState Code (http://code.activestate.com/recipes/303668/)