A.M. Kuchling wrote:
>>I'm trying to parse the rdf dumps from dmoz.org (Open Directory>>Project) and am having great difficulty just getting Python to read>>the files. The files are RDF in UTF-8 encoding according to the>>dmoz.org web site, but I get the following error:> > Oh dear. > > Around 2001/2002 I worked on Python code for processing dmoz dumps, but gave> up because the data was so bad -- some categories included content in> various Chinese encodings despite the file's claim to be UTF-8. I> eventually gave up because debugging a program that fails after running for> six hours is really, really tedious.
unfortunately it seems that some encoding issues are still there, i've
written this little script to convert RDF/XML dmoz.org dump in turtle
(really ntriples in UTF-8) using rdflib but it fails after
700 lines or so:
from rdflib.TripleStore import TripleStore as Store
from rdflib.BNode import BNode
from rdflib.Literal import Literal
from purple.quoting import quote
store = Store()
outfile = codecs.open('structure.ttl', 'w', 'utf-8')
for triple in store.triples((None, None, None)):
s = triple
if isinstance(s, BNode): # URI of bNode?
s = '%s' % s
s = '<%s>' % s
p = triple
o = triple
if isinstance(o, Literal): # URI, bNode or Literal?
o = '"%s"@%s' % (quote(o), o.language)
o = '"%s"^^<%s>' % (quote(o), o.datatype)
o = '"%s"' % quote(o)
elif isinstance(o, BNode):
o = '%s' % o
o = '<%s>' % o
outfile.write('%s <%s> %s .\n' % (s, p, o))
but it stops giving:
rdf:712:45: not well-formed (invalid token)
i'm gonna try this script with musicbrainz datadump and see if
the UTF-8 data is encoded better.
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<#me> a foaf:Person ; foaf:nick "deelan" ;
foaf:weblog <http://www.deelan.com/> .