Welcome, guest | Sign In | My Account | Store | Cart

OpenOffice is very popular. Some people may be interested in indexing the contents of their documents written with OpenOffice. Here is a very simple solution for that.

Python, 35 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# -*- coding: Latin-1 -*-

"""
Convert OpenOffice documents to XML and text

USAGE:
ooconvert [filename]
"""

import zipfile
import re
import sys

rx_stripxml = re.compile("<[^>]*?>", re.DOTALL|re.MULTILINE)

class ReadOO:

    def __init__(self, filename):
        zf = zipfile.ZipFile(filename, "r")
        self.data = zf.read("content.xml")
        zf.close()

    def getXML(self):
        return self.data

    def getData(self, collapse=1):
        return " ".join(rx_stripxml.sub(" ", self.data).split())

if __name__=="__main__":
    if len(sys.argv)>1:
        oo = ReadOO(sys.argv[1])
        print oo.getXML()
        print oo.getData()
    else:
        print __doc__.strip()

OpenOffice files are ZIP files and they always contain a file called "content.xml". We extract this one. In the method getData we throw away XML informations, split the result by blanks and then join them again to save space. This part could be done in a better way using an XML parser, but they often don't do what we expect them to do, so some help would be apreciated ;-)

Created by Dirk Holtwick on Mon, 30 Aug 2004 (PSF)
Python recipes (4591)
Dirk Holtwick's recipes (1)

Required Modules

Other Information and Tasks