OpenOffice is very popular. Some people may be interested in indexing the contents of their documents written with OpenOffice. Here is a very simple solution for that.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # -*- coding: Latin-1 -*-
"""
Convert OpenOffice documents to XML and text
USAGE:
ooconvert [filename]
"""
import zipfile
import re
import sys
rx_stripxml = re.compile("<[^>]*?>", re.DOTALL|re.MULTILINE)
class ReadOO:
def __init__(self, filename):
zf = zipfile.ZipFile(filename, "r")
self.data = zf.read("content.xml")
zf.close()
def getXML(self):
return self.data
def getData(self, collapse=1):
return " ".join(rx_stripxml.sub(" ", self.data).split())
if __name__=="__main__":
if len(sys.argv)>1:
oo = ReadOO(sys.argv[1])
print oo.getXML()
print oo.getData()
else:
print __doc__.strip()
|
OpenOffice files are ZIP files and they always contain a file called "content.xml". We extract this one. In the method getData we throw away XML informations, split the result by blanks and then join them again to save space. This part could be done in a better way using an XML parser, but they often don't do what we expect them to do, so some help would be apreciated ;-)