| Store | Cart

Unicode and rdf

From: Richard West <rwes...@opti.cgi.net>
Tue, 09 Mar 2004 23:41:30 -0600

I'm trying to parse the rdf dumps from dmoz.org (Open Directory
Project) and am having great difficulty just getting Python to read
the files.  The files are RDF in UTF-8 encoding according to the
dmoz.org web site, but I get the following error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position
52376-52378: invalid data

Here's a sample of code that will reproduce the problem:


import sys
import codecs
from xml.sax import make_parser, handler

def main():
    f = codecs.open(sys.argv[1], 'r', 'utf-8')
    parser = make_parser()
    parser.setContentHandler(dmoz())
    parser.parse(f)

class dmoz(handler.ContentHandler):
    def startElement(self, name, attrs):
        print('%s' % name)

if(__name__=='__main__'):
    main()


I'm working with the dump from February 23rd, 2004.  On the dmoz.org
web site news pertaining to the rdf dumps, there is an entry from
March 3rd, 2003 which states that they are filtering the data to
"prevent UTF-8 and XML character encoding problems".  So I am assuming
that the UTF-8 files I have are valid.  I run into the problem with
both the structure.rdf.u8 file and the content.rdf.u8 file.

What am I doing wrong?


-Richard


dmoz.org rdf dumps: http://rdf.dmoz.org/

dmoz.org rdf news: http://rdf.dmoz.org/rdf/Changes.html

Recent Messages in this Thread
Richard West Mar 10, 2004 05:41 am
Richard West Mar 10, 2004 05:45 am
Mickel Grönroos Mar 10, 2004 06:25 am
A.M. Kuchling Mar 10, 2004 01:08 pm
deelan Mar 10, 2004 01:26 pm
Paul Prescod Mar 10, 2004 07:24 pm
Messages in this thread