Almost forgot. I'm running Python 2.3.3.
On Tue, 09 Mar 2004 23:41:30 -0600, Richard West
<rwest004 at opti.cgi.net> wrote:
>>>I'm trying to parse the rdf dumps from dmoz.org (Open Directory>Project) and am having great difficulty just getting Python to read>the files. The files are RDF in UTF-8 encoding according to the>dmoz.org web site, but I get the following error:>>UnicodeDecodeError: 'utf8' codec can't decode bytes in position>52376-52378: invalid data>>Here's a sample of code that will reproduce the problem:>>>import sys>import codecs>from xml.sax import make_parser, handler>>def main():> f = codecs.open(sys.argv[1], 'r', 'utf-8')> parser = make_parser()> parser.setContentHandler(dmoz())> parser.parse(f)>>class dmoz(handler.ContentHandler):> def startElement(self, name, attrs):> print('%s' % name)>>if(__name__=='__main__'):> main()>>>I'm working with the dump from February 23rd, 2004. On the dmoz.org>web site news pertaining to the rdf dumps, there is an entry from>March 3rd, 2003 which states that they are filtering the data to>"prevent UTF-8 and XML character encoding problems". So I am assuming>that the UTF-8 files I have are valid. I run into the problem with>both the structure.rdf.u8 file and the content.rdf.u8 file.>>What am I doing wrong?>>>-Richard>>>dmoz.org rdf dumps: http://rdf.dmoz.org/>>dmoz.org rdf news: http://rdf.dmoz.org/rdf/Changes.html>>