| Store | Cart

Unicode and rdf

From: deelan <g...@zzz.it>
Wed, 10 Mar 2004 14:26:20 +0100
A.M. Kuchling wrote:

>>I'm trying to parse the rdf dumps from dmoz.org (Open Directory>>Project) and am having great difficulty just getting Python to read>>the files.  The files are RDF in UTF-8 encoding according to the>>dmoz.org web site, but I get the following error:> > Oh dear.   > > Around 2001/2002 I worked on Python code for processing dmoz dumps, but gave> up because the data was so bad -- some categories included content in> various Chinese encodings despite the file's claim to be UTF-8.  I> eventually gave up because debugging a program that fails after running for> six hours is really, really tedious.

unfortunately it seems that some encoding issues are still there, i've 
written this little script to convert RDF/XML dmoz.org dump in turtle 
(really ntriples in UTF-8) using rdflib but it fails after
700 lines or so:

from rdflib.TripleStore import TripleStore as Store
from rdflib.BNode import BNode
from rdflib.Literal import Literal

from purple.quoting import quote

store = Store()
store.load('file:structure.rdf')

import codecs
outfile = codecs.open('structure.ttl', 'w', 'utf-8')

for triple in store.triples((None, None, None)):

     s = triple[0]
     if isinstance(s, BNode): # URI of bNode?
        s = '%s' % s
     else:
        s = '<%s>' % s

     p = triple[1]

     o = triple[2]
     if isinstance(o, Literal): # URI, bNode or Literal?
        if o.language:
           o = '"%s"@%s' % (quote(o), o.language)
        elif o.datatype:
           o = '"%s"^^<%s>' % (quote(o), o.datatype)
        else:
           o = '"%s"' % quote(o)
     elif isinstance(o, BNode):
        o = '%s' % o
     else:
        o = '<%s>' % o

     outfile.write('%s <%s> %s .\n' % (s, p, o))


outfile.close()




but it stops giving:

xml.sax._exceptions.SAXParseException:file:///D|/TMPSTU%7E1/dmoz.org/structure. 
rdf:712:45: not well-formed (invalid token)

i'm gonna try this script with musicbrainz datadump and see if
the UTF-8 data is encoded better.

-- 
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<#me> a foaf:Person ; foaf:nick "deelan" ;
foaf:weblog <http://www.deelan.com/> .

Recent Messages in this Thread
Richard West Mar 10, 2004 05:41 am
Richard West Mar 10, 2004 05:45 am
Mickel Grönroos Mar 10, 2004 06:25 am
A.M. Kuchling Mar 10, 2004 01:08 pm
deelan Mar 10, 2004 01:26 pm
Paul Prescod Mar 10, 2004 07:24 pm
Messages in this thread