Welcome, guest | Sign In | My Account | Store | Cart

OpenOffice to xml and/or text (oo2txt) (Python recipe) by Dirk Holtwick
ActiveState Code (http://code.activestate.com/recipes/302633/)

OpenOffice is very popular. Some people may be interested in indexing the contents of their documents written with OpenOffice. Here is a very simple solution for that.

      # -*- coding: Latin-1 -*-

"""
Convert OpenOffice documents to XML and text

USAGE:
ooconvert [filename]
"""

import zipfile
import re
import sys

rx_stripxml = re.compile("<[^>]*?>", re.DOTALL|re.MULTILINE)

class ReadOO:

    def __init__(self, filename):
        zf = zipfile.ZipFile(filename, "r")
        self.data = zf.read("content.xml")
        zf.close()

    def getXML(self):
        return self.data

    def getData(self, collapse=1):
        return " ".join(rx_stripxml.sub(" ", self.data).split())

if __name__=="__main__":
    if len(sys.argv)>1:
        oo = ReadOO(sys.argv[1])
        print oo.getXML()
        print oo.getData()
    else:
        print __doc__.strip()

      

OpenOffice files are ZIP files and they always contain a file called "content.xml". We extract this one. In the method getData we throw away XML informations, split the result by blanks and then join them again to save space. This part could be done in a better way using an XML parser, but they often don't do what we expect them to do, so some help would be apreciated ;-)

Tags: text

Created by Dirk Holtwick on Mon, 30 Aug 2004 (PSF)

◄	Python recipes (4591)	►
◄	Dirk Holtwick's recipes (1)	►

Required Modules

Other Information and Tasks

Licensed under the PSF License
Viewed 6636 times
Revision 2 (updated 19 years ago)

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

OpenOffice to xml and/or text (oo2txt) (Python recipe) by Dirk Holtwick ActiveState Code (http://code.activestate.com/recipes/302633/)