| Store | Cart

Unicode and rdf

From: Richard West <rwes...@opti.cgi.net>
Tue, 09 Mar 2004 23:45:30 -0600
Almost forgot.  I'm running Python 2.3.3.


On Tue, 09 Mar 2004 23:41:30 -0600, Richard West
<rwest004 at opti.cgi.net> wrote:

>>>I'm trying to parse the rdf dumps from dmoz.org (Open Directory>Project) and am having great difficulty just getting Python to read>the files.  The files are RDF in UTF-8 encoding according to the>dmoz.org web site, but I get the following error:>>UnicodeDecodeError: 'utf8' codec can't decode bytes in position>52376-52378: invalid data>>Here's a sample of code that will reproduce the problem:>>>import sys>import codecs>from xml.sax import make_parser, handler>>def main():>    f = codecs.open(sys.argv[1], 'r', 'utf-8')>    parser = make_parser()>    parser.setContentHandler(dmoz())>    parser.parse(f)>>class dmoz(handler.ContentHandler):>    def startElement(self, name, attrs):>        print('%s' % name)>>if(__name__=='__main__'):>    main()>>>I'm working with the dump from February 23rd, 2004.  On the dmoz.org>web site news pertaining to the rdf dumps, there is an entry from>March 3rd, 2003 which states that they are filtering the data to>"prevent UTF-8 and XML character encoding problems".  So I am assuming>that the UTF-8 files I have are valid.  I run into the problem with>both the structure.rdf.u8 file and the content.rdf.u8 file.>>What am I doing wrong?>>>-Richard>>>dmoz.org rdf dumps: http://rdf.dmoz.org/>>dmoz.org rdf news: http://rdf.dmoz.org/rdf/Changes.html>>

Recent Messages in this Thread
Richard West Mar 10, 2004 05:41 am
Richard West Mar 10, 2004 05:45 am
Mickel Grönroos Mar 10, 2004 06:25 am
A.M. Kuchling Mar 10, 2004 01:08 pm
deelan Mar 10, 2004 01:26 pm
Paul Prescod Mar 10, 2004 07:24 pm
Messages in this thread