Almost forgot.  I'm running Python 2.3.3.

>>>I'm trying to parse the rdf dumps from dmoz.org (Open Directory>Project) and am having great difficulty just getting Python to read>the files.  The files are RDF in UTF-8 encoding according to the>dmoz.org web site, but I get the following error:>>UnicodeDecodeError: 'utf8' codec can't decode bytes in position>52376-52378: invalid data>>Here's a sample of code that will reproduce the problem:>>>import sys>import codecs>from xml.sax import make_parser, handler>>def main():>    f = codecs.open(sys.argv[1], 'r', 'utf-8')>    parser = make_parser()>    parser.setContentHandler(dmoz())>    parser.parse(f)>>class dmoz(handler.ContentHandler):>    def startElement(self, name, attrs):>        print('%s' % name)>>if(__name__=='__main__'):>    main()>>>I'm working with the dump from February 23rd, 2004.  On the dmoz.org>web site news pertaining to the rdf dumps, there is an entry from>March 3rd, 2003 which states that they are filtering the data to>"prevent UTF-8 and XML character encoding problems".  So I am assuming>that the UTF-8 files I have are valid.  I run into the problem with>both the structure.rdf.u8 file and the content.rdf.u8 file.>>What am I doing wrong?>>>-Richard>>>dmoz.org rdf dumps: http://rdf.dmoz.org/>>dmoz.org rdf news: http://rdf.dmoz.org/rdf/Changes.html>>

