Read OpenOffice Spreadsheet as list of lists (without UNO) « Python recipes

This is a quick-and-dirty snippet that read a OpenOffice 1.1 spreadsheet (.sxc files) as a list of lists, every list representing a oocalc row. The row contents are unicode object with the "string" content of the cell. This snippet used as a program will convert .sxc files into comma-separated-values (csv) on the command line. Style is not perfect and could be greatly enhanced, but it works for me as is...

      #! /usr/bin/env python
#
# -*- coding: latin1 -*-
"""
Read OpenOffice spreadsheets.

Read OpenOffice spreadsheets. Can be used as a module: it provides the
class OOspreadData which is simply a list of lists initialized with the
contents of the spreadsheet stored in a (typically .sxc) file (passed as an
argument).

Used as an executable, converts files.sxc to csv

USAGE:
readsxc file.sxc
"""

import sys

class ReadSXCError(Exception):
    pass

import xml.parsers.expat
import zipfile

tabla=[]
row=[]
cell=u''
rept=u'table:number-columns-repeated'
last_repeat_col=0
incol=False
compact=False
str_strip=False

def copyandtrim(l, trim):
    a = l[:]
    if trim:
        x=range(len(a))
        x.reverse()
        for i in x:
            if a[i]=="":
                del a[i]
            else:
                break
    return a

# 3 handler functions
def start_element(name, attrs):
    global tabla, row, cell, rept, last_repeat_col, incol, compact
    if name!="table:table-cell":
        return
    if incol:
        raise ReadSXCError("double cell start")
    incol=True
    cell=u""
    if attrs.has_key(rept):
        last_repeat_col = int(attrs[rept])
    else:
        last_repeat_col = 0

def end_element(name):
    global tabla, row, cell, rept, last_repeat_col, incol, compact, str_strip
    if name=="table:table-cell":
        if not incol:
            raise ReadSXCError("double cell end")
        incol=False
        # add the contents to the row
        if str_strip:
            row.append(cell.strip())
        else:
            row.append(cell)
        # print "append to row %d, col %d : %s" % (len(tabla),len(row),cell)
        # manage the repeater
        if last_repeat_col > 1:
            row.extend([cell]*(last_repeat_col-1))
    elif name=="table:table-row":
        l = copyandtrim(row,compact)
        if l == []:
            row = []
            return
        tabla.append(l)
        row = []

def char_data(data):
    global tabla, row, cell, rept, last_repeat_col, incol
    if incol:
        cell += data


def read_and_parse(inFileName):
    p = xml.parsers.expat.ParserCreate("UTF-8")
    p.StartElementHandler = start_element
    p.EndElementHandler = end_element
    p.CharacterDataHandler = char_data
    zf = zipfile.ZipFile(inFileName, "r")
    all = zf.read("content.xml")
    # Start the parse.
    p.returns_unicode=1
    p.Parse(all)
    zf.close()


class OOSpreadData(list):
    """OOspreadData: a=OOspreadData("file",trim=True,strip=False)

the class OOspreadData which is simply a list of lists initialized with the
contents of the spreadsheet stored in a (typically .sxc) file (passed as an
argument). Note: there is no validity analysis on the data.
Garbage in, garbage out, or unexepected execptions.

If trim is true, multiple void cell at the end of a row and void rows are
trimmed out; otherwise, all the cells are reported.

If strip is true, every cell content is stripped of blanks.
    """

    def __init__(self, fname,trim=True,strip=False):
        global tabla, row, cell, rept, last_repeat_col, incol, compact, str_strip
        tabla=[]
        row=[]
        incol=False
        cell=u''
        last_repeat_col=0
        compact=trim
        str_strip=strip
        # ok, do the hard work
        read_and_parse(fname)
        list.__init__(self, tabla)

if __name__=="__main__":

    if len(sys.argv)==2:
        oosxc = readsxc.OOSpreadData(sys.argv[1])
    else:
        print >> sys.stderr, "Usage: %s <OO_calc_file>" % sys.argv[0]
        sys.exit(1)

        for l in oosxc:
            a = ['"%s"' % i for i in l]
            print ",".join(a)

    sys.exit(0)

      

I know it's possible to use UNO and similar things, but this is stand-alone, little and suitable for command-line gimmick with cut, awk and co... 2005-11-28 Corrected a bug and a cut'n'paste typo.

Tags: text

5 comments

Remco Boerma 18 years, 9 months ago # | flag

typo. Thanks for the code, but there's a little typo in it. "dataclass" is the concatenation of data and a new class. So insert a few newlines and it's fixed.

Romano Giannetti (author) 18 years, 9 months ago # | flag

Yes, I missed a few lines during cut'n'paste. I edited the recipe. Thank you!

Remco Boerma 18 years, 9 months ago # | flag

read_and_parse? When trying to run the file standalone, i was prompted to give code for "read_and_parse" as it could not be found... So i guess the code is broken?

Remco Boerma 18 years, 9 months ago # | flag

thanks. Noted you entered the read_and_parse code as well. :) Thank you!

Romano Giannetti (author) 18 years, 4 months ago # | flag

2005-11-28 Bug correction. There where a bug (a missing row = [] in end_element()). Now it is corrected. I hope it helps...

◄	Python recipes (4591)	►
◄	Romano Giannetti's recipes (1)	►

Read OpenOffice Spreadsheet as list of lists (without UNO) (Python recipe) by Romano Giannetti
ActiveState Code (http://code.activestate.com/recipes/436066/)

5 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Read OpenOffice Spreadsheet as list of lists (without UNO) (Python recipe) by Romano Giannetti ActiveState Code (http://code.activestate.com/recipes/436066/)

5 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Read OpenOffice Spreadsheet as list of lists (without UNO) (Python recipe) by Romano Giannetti
ActiveState Code (http://code.activestate.com/recipes/436066/)