This is a quick-and-dirty snippet that read a OpenOffice 1.1 spreadsheet (.sxc files) as a list of lists, every list representing a oocalc row. The row contents are unicode object with the "string" content of the cell. This snippet used as a program will convert .sxc files into comma-separated-values (csv) on the command line. Style is not perfect and could be greatly enhanced, but it works for me as is...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | #! /usr/bin/env python
#
# -*- coding: latin1 -*-
"""
Read OpenOffice spreadsheets.
Read OpenOffice spreadsheets. Can be used as a module: it provides the
class OOspreadData which is simply a list of lists initialized with the
contents of the spreadsheet stored in a (typically .sxc) file (passed as an
argument).
Used as an executable, converts files.sxc to csv
USAGE:
readsxc file.sxc
"""
import sys
class ReadSXCError(Exception):
pass
import xml.parsers.expat
import zipfile
tabla=[]
row=[]
cell=u''
rept=u'table:number-columns-repeated'
last_repeat_col=0
incol=False
compact=False
str_strip=False
def copyandtrim(l, trim):
a = l[:]
if trim:
x=range(len(a))
x.reverse()
for i in x:
if a[i]=="":
del a[i]
else:
break
return a
# 3 handler functions
def start_element(name, attrs):
global tabla, row, cell, rept, last_repeat_col, incol, compact
if name!="table:table-cell":
return
if incol:
raise ReadSXCError("double cell start")
incol=True
cell=u""
if attrs.has_key(rept):
last_repeat_col = int(attrs[rept])
else:
last_repeat_col = 0
def end_element(name):
global tabla, row, cell, rept, last_repeat_col, incol, compact, str_strip
if name=="table:table-cell":
if not incol:
raise ReadSXCError("double cell end")
incol=False
# add the contents to the row
if str_strip:
row.append(cell.strip())
else:
row.append(cell)
# print "append to row %d, col %d : %s" % (len(tabla),len(row),cell)
# manage the repeater
if last_repeat_col > 1:
row.extend([cell]*(last_repeat_col-1))
elif name=="table:table-row":
l = copyandtrim(row,compact)
if l == []:
row = []
return
tabla.append(l)
row = []
def char_data(data):
global tabla, row, cell, rept, last_repeat_col, incol
if incol:
cell += data
def read_and_parse(inFileName):
p = xml.parsers.expat.ParserCreate("UTF-8")
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
zf = zipfile.ZipFile(inFileName, "r")
all = zf.read("content.xml")
# Start the parse.
p.returns_unicode=1
p.Parse(all)
zf.close()
class OOSpreadData(list):
"""OOspreadData: a=OOspreadData("file",trim=True,strip=False)
the class OOspreadData which is simply a list of lists initialized with the
contents of the spreadsheet stored in a (typically .sxc) file (passed as an
argument). Note: there is no validity analysis on the data.
Garbage in, garbage out, or unexepected execptions.
If trim is true, multiple void cell at the end of a row and void rows are
trimmed out; otherwise, all the cells are reported.
If strip is true, every cell content is stripped of blanks.
"""
def __init__(self, fname,trim=True,strip=False):
global tabla, row, cell, rept, last_repeat_col, incol, compact, str_strip
tabla=[]
row=[]
incol=False
cell=u''
last_repeat_col=0
compact=trim
str_strip=strip
# ok, do the hard work
read_and_parse(fname)
list.__init__(self, tabla)
if __name__=="__main__":
if len(sys.argv)==2:
oosxc = readsxc.OOSpreadData(sys.argv[1])
else:
print >> sys.stderr, "Usage: %s <OO_calc_file>" % sys.argv[0]
sys.exit(1)
for l in oosxc:
a = ['"%s"' % i for i in l]
print ",".join(a)
sys.exit(0)
|
I know it's possible to use UNO and similar things, but this is stand-alone, little and suitable for command-line gimmick with cut, awk and co... 2005-11-28 Corrected a bug and a cut'n'paste typo.
typo. Thanks for the code, but there's a little typo in it. "dataclass" is the concatenation of data and a new class. So insert a few newlines and it's fixed.
Yes, I missed a few lines during cut'n'paste. I edited the recipe. Thank you!
read_and_parse? When trying to run the file standalone, i was prompted to give code for "read_and_parse" as it could not be found... So i guess the code is broken?
thanks. Noted you entered the read_and_parse code as well. :) Thank you!
2005-11-28 Bug correction. There where a bug (a missing row = [] in end_element()). Now it is corrected. I hope it helps...