On Tue, 09 Mar 2004 23:41:30 -0600,
Richard West <rwest004 at opti.cgi.net> wrote:
> I'm trying to parse the rdf dumps from dmoz.org (Open Directory> Project) and am having great difficulty just getting Python to read> the files. The files are RDF in UTF-8 encoding according to the> dmoz.org web site, but I get the following error:
Around 2001/2002 I worked on Python code for processing dmoz dumps, but gave
up because the data was so bad -- some categories included content in
various Chinese encodings despite the file's claim to be UTF-8. I
eventually gave up because debugging a program that fails after running for
six hours is really, really tedious.
It looks like the problems still aren't fixed. The Google-cached version of
rainwaterreptileranch.org/steve/sw/odp/rdflist.html (the page itself is
inaccessible right now) says:
Status: Actively being worked on. Autumn has been working on
UTF-8 validation code for the editor input forms. sfromis has
been manually deleting any reported UTF-8 sequences from the
ODP database. I've created a C program that will process data
dumps and report details about the errors found that should
assist in locating and fixing them. No illegal UTF-8 sequences
were present in data dumps between March and July of 2003.
After completion of the server hardware upgrade, however, the
proliferation of UTF-8 errors has returned.
The same author has a Perl odp2db script at
http://rainwaterreptileranch.org/steve/sw/odp/ ; you could run that to get a
SQL database version, and then access that version from Python, or at least
look at the code to figure out what kind of hackery is required to actually
parse the dumps.