| Store | Cart

Unicode and rdf

From: A.M. Kuchling <a...@amk.ca>
Wed, 10 Mar 2004 07:08:24 -0600
On Tue, 09 Mar 2004 23:41:30 -0600, 
	Richard West <rwest004 at opti.cgi.net> wrote:
> I'm trying to parse the rdf dumps from dmoz.org (Open Directory> Project) and am having great difficulty just getting Python to read> the files.  The files are RDF in UTF-8 encoding according to the> dmoz.org web site, but I get the following error:

Oh dear.   

Around 2001/2002 I worked on Python code for processing dmoz dumps, but gave
up because the data was so bad -- some categories included content in
various Chinese encodings despite the file's claim to be UTF-8.  I
eventually gave up because debugging a program that fails after running for
six hours is really, really tedious.

It looks like the problems still aren't fixed. The Google-cached version of
rainwaterreptileranch.org/steve/sw/odp/rdflist.html (the page itself is
inaccessible right now) says:

	     Status: Actively being worked on. Autumn has been working on
	     UTF-8 validation code for the editor input forms. sfromis has
	     been manually deleting any reported UTF-8 sequences from the
	     ODP database. I've created a C program that will process data
	     dumps and report details about the errors found that should
	     assist in locating and fixing them. No illegal UTF-8 sequences
	     were present in data dumps between March and July of 2003.
	     After completion of the server hardware upgrade, however, the
	     proliferation of UTF-8 errors has returned.
The same author has a Perl odp2db script at 
http://rainwaterreptileranch.org/steve/sw/odp/ ; you could run that to get a
SQL database version, and then access that version from Python, or at least
look at the code to figure out what kind of hackery is required to actually
parse the dumps.


Recent Messages in this Thread
Richard West Mar 10, 2004 05:41 am
Richard West Mar 10, 2004 05:45 am
Mickel Grönroos Mar 10, 2004 06:25 am
A.M. Kuchling Mar 10, 2004 01:08 pm
deelan Mar 10, 2004 01:26 pm
Paul Prescod Mar 10, 2004 07:24 pm
Messages in this thread