Welcome, guest | Sign In | My Account | Store | Cart

This is a quick-and-dirty snippet that read a OpenOffice 1.1 spreadsheet (.sxc files) as a list of lists, every list representing a oocalc row. The row contents are unicode object with the "string" content of the cell. This snippet used as a program will convert .sxc files into comma-separated-values (csv) on the command line. Style is not perfect and could be greatly enhanced, but it works for me as is...

Python, 142 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
#! /usr/bin/env python
#
# -*- coding: latin1 -*-
"""
Read OpenOffice spreadsheets.

Read OpenOffice spreadsheets. Can be used as a module: it provides the
class OOspreadData which is simply a list of lists initialized with the
contents of the spreadsheet stored in a (typically .sxc) file (passed as an
argument).

Used as an executable, converts files.sxc to csv

USAGE:
readsxc file.sxc
"""

import sys

class ReadSXCError(Exception):
    pass

import xml.parsers.expat
import zipfile

tabla=[]
row=[]
cell=u''
rept=u'table:number-columns-repeated'
last_repeat_col=0
incol=False
compact=False
str_strip=False

def copyandtrim(l, trim):
    a = l[:]
    if trim:
        x=range(len(a))
        x.reverse()
        for i in x:
            if a[i]=="":
                del a[i]
            else:
                break
    return a

# 3 handler functions
def start_element(name, attrs):
    global tabla, row, cell, rept, last_repeat_col, incol, compact
    if name!="table:table-cell":
        return
    if incol:
        raise ReadSXCError("double cell start")
    incol=True
    cell=u""
    if attrs.has_key(rept):
        last_repeat_col = int(attrs[rept])
    else:
        last_repeat_col = 0

def end_element(name):
    global tabla, row, cell, rept, last_repeat_col, incol, compact, str_strip
    if name=="table:table-cell":
        if not incol:
            raise ReadSXCError("double cell end")
        incol=False
        # add the contents to the row
        if str_strip:
            row.append(cell.strip())
        else:
            row.append(cell)
        # print "append to row %d, col %d : %s" % (len(tabla),len(row),cell)
        # manage the repeater
        if last_repeat_col > 1:
            row.extend([cell]*(last_repeat_col-1))
    elif name=="table:table-row":
        l = copyandtrim(row,compact)
        if l == []:
            row = []
            return
        tabla.append(l)
        row = []

def char_data(data):
    global tabla, row, cell, rept, last_repeat_col, incol
    if incol:
        cell += data


def read_and_parse(inFileName):
    p = xml.parsers.expat.ParserCreate("UTF-8")
    p.StartElementHandler = start_element
    p.EndElementHandler = end_element
    p.CharacterDataHandler = char_data
    zf = zipfile.ZipFile(inFileName, "r")
    all = zf.read("content.xml")
    # Start the parse.
    p.returns_unicode=1
    p.Parse(all)
    zf.close()


class OOSpreadData(list):
    """OOspreadData: a=OOspreadData("file",trim=True,strip=False)

the class OOspreadData which is simply a list of lists initialized with the
contents of the spreadsheet stored in a (typically .sxc) file (passed as an
argument). Note: there is no validity analysis on the data.
Garbage in, garbage out, or unexepected execptions.

If trim is true, multiple void cell at the end of a row and void rows are
trimmed out; otherwise, all the cells are reported.

If strip is true, every cell content is stripped of blanks.
    """

    def __init__(self, fname,trim=True,strip=False):
        global tabla, row, cell, rept, last_repeat_col, incol, compact, str_strip
        tabla=[]
        row=[]
        incol=False
        cell=u''
        last_repeat_col=0
        compact=trim
        str_strip=strip
        # ok, do the hard work
        read_and_parse(fname)
        list.__init__(self, tabla)

if __name__=="__main__":

    if len(sys.argv)==2:
        oosxc = readsxc.OOSpreadData(sys.argv[1])
    else:
        print >> sys.stderr, "Usage: %s <OO_calc_file>" % sys.argv[0]
        sys.exit(1)

        for l in oosxc:
            a = ['"%s"' % i for i in l]
            print ",".join(a)

    sys.exit(0)

I know it's possible to use UNO and similar things, but this is stand-alone, little and suitable for command-line gimmick with cut, awk and co... 2005-11-28 Corrected a bug and a cut'n'paste typo.

5 comments

Remco Boerma 16 years, 4 months ago  # | flag

typo. Thanks for the code, but there's a little typo in it. "dataclass" is the concatenation of data and a new class. So insert a few newlines and it's fixed.

Romano Giannetti (author) 16 years, 4 months ago  # | flag

Yes, I missed a few lines during cut'n'paste. I edited the recipe. Thank you!

Remco Boerma 16 years, 4 months ago  # | flag

read_and_parse? When trying to run the file standalone, i was prompted to give code for "read_and_parse" as it could not be found... So i guess the code is broken?

Remco Boerma 16 years, 4 months ago  # | flag

thanks. Noted you entered the read_and_parse code as well. :) Thank you!

2005-11-28 Bug correction. There where a bug (a missing row = [] in end_element()). Now it is corrected. I hope it helps...