Easily extract data from microsoft excel files using this wrapper class for xlrd (http://www.lexicon.net/sjmachin/xlrd.htm). The class allows you to create a generator which returns excel data one row at a time as either a list or dictionary. I found this very handy for easily pulling in a variety of excel files without having to deal with COM calls or even needing to have windows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | class readexcel(object):
""" Simple OS Independent Class for Extracting Data from Excel Files
the using xlrd module found at http://www.lexicon.net/sjmachin/xlrd.htm
Versions of Excel supported: 2004, 2002, XP, 2000, 97, 95, 5, 4, 3
xlrd version tested: 0.5.2
Data is extracted by creating a iterator object which can be used to
return data one row at a time. The default extraction method assumes
that the worksheet is in tabular format with the first nonblank row
containing variable names and all subsequent rows containing values.
This method returns a dictionary which uses the variables names as keys
for each piece of data in the row. Data can also be extracted with
each row represented by a list.
Extracted data is represented fairly logically. By default dates are
returned as strings in "yyyy/mm/dd" format or "yyyy/mm/dd hh:mm:ss",
as appropriate. However, dates can be return as a tuple containing
(Year, Month, Day, Hour, Min, Second) which is appropriate for usage
with mxDateTime or DateTime. Numbers are returned as either INT or
FLOAT, whichever is needed to support the data. Text, booleans, and
error codes are also returned as appropriate representations.
Quick Example:
xl = readexcel('testdata.xls')
sheetnames = xl.worksheets()
for sheet in sheetnames:
print sheet
for row in xl.getiter(sheet):
# Do Something here
"""
def __init__(self, filename):
""" Returns a readexcel object of the specified filename - this may
take a little while because the file must be parsed into memory """
import xlrd
import os.path
if not os.path.isfile(filename):
raise NameError, "%s is not a valid filename" % filename
self.__filename__ = filename
self.__book__ = xlrd.open_workbook(filename)
self.__sheets__ = {}
self.__sheetnames__ = []
for i in self.__book__.sheet_names():
uniquevars = []
firstrow = 0
sheet = self.__book__.sheet_by_name(i)
for row in range(sheet.nrows):
types,values = sheet.row_types(row),sheet.row_values(row)
nonblank = False
for j in values:
if j != '':
nonblank=True
break
if nonblank:
# Generate a listing of Unique Variable Names for Use as
# Dictionary Keys In Extraction. Duplicate Names will
# be replaced with "F#"
variables = self.__formatrow__(types,values,False)
unknown = 1
while variables:
var = variables.pop(0)
if var in uniquevars or var == '':
var = 'F' + str(unknown)
unknown += 1
uniquevars.append(str(var))
firstrow = row + 1
break
self.__sheetnames__.append(i)
self.__sheets__.setdefault(i,{}).__setitem__('rows',sheet.nrows)
self.__sheets__.setdefault(i,{}).__setitem__('cols',sheet.ncols)
self.__sheets__.setdefault(i,{}).__setitem__('firstrow',firstrow)
self.__sheets__.setdefault(i,{}).__setitem__('variables',uniquevars[:])
def getiter(self, sheetname, returnlist=False, returntupledate=False):
""" Return an generator object which yields the lines of a worksheet;
Default returns a dictionary, specifing returnlist=True causes lists
to be returned. Calling returntupledate=True causes dates to returned
as tuples of (Year, Month, Day, Hour, Min, Second) instead of as a
string """
if sheetname not in self.__sheets__.keys():
raise NameError, "%s is not present in %s" % (sheetname,\
self.__filename__)
if returnlist:
return __iterlist__(self, sheetname, returntupledate)
else:
return __iterdict__(self, sheetname, returntupledate)
def worksheets(self):
""" Returns a list of the Worksheets in the Excel File """
return self.__sheetnames__
def nrows(self, worksheet):
""" Return the number of rows in a worksheet """
return self.__sheets__[worksheet]['rows']
def ncols(self, worksheet):
""" Return the number of columns in a worksheet """
return self.__sheets__[worksheet]['cols']
def variables(self,worksheet):
""" Returns a list of Column Names in the file,
assuming a tabular format of course. """
return self.__sheets__[worksheet]['variables']
def __formatrow__(self, types, values, wanttupledate):
""" Internal function used to clean up the incoming excel data """
## Data Type Codes:
## EMPTY 0
## TEXT 1 a Unicode string
## NUMBER 2 float
## DATE 3 float
## BOOLEAN 4 int; 1 means TRUE, 0 means FALSE
## ERROR 5
import xlrd
returnrow = []
for i in range(len(types)):
type,value = types[i],values[i]
if type == 2:
if value == int(value):
value = int(value)
elif type == 3:
datetuple = xlrd.xldate_as_tuple(value, self.__book__.datemode)
if wanttupledate:
value = datetuple
else:
# time only no date component
if datetuple[0] == 0 and datetuple[1] == 0 and \
datetuple[2] == 0:
value = "%02d:%02d:%02d" % datetuple[3:]
# date only, no time
elif datetuple[3] == 0 and datetuple[4] == 0 and \
datetuple[5] == 0:
value = "%04d/%02d/%02d" % datetuple[:3]
else: # full date
value = "%04d/%02d/%02d %02d:%02d:%02d" % datetuple
elif type == 5:
value = xlrd.error_text_from_code[value]
returnrow.append(value)
return returnrow
def __iterlist__(excel, sheetname, tupledate):
""" Function Used To Create the List Iterator """
sheet = excel.__book__.sheet_by_name(sheetname)
for row in range(excel.__sheets__[sheetname]['rows']):
types,values = sheet.row_types(row),sheet.row_values(row)
yield excel.__formatrow__(types, values, tupledate)
def __iterdict__(excel, sheetname, tupledate):
""" Function Used To Create the Dictionary Iterator """
sheet = excel.__book__.sheet_by_name(sheetname)
for row in range(excel.__sheets__[sheetname]['firstrow'],\
excel.__sheets__[sheetname]['rows']):
types,values = sheet.row_types(row),sheet.row_values(row)
formattedrow = excel.__formatrow__(types, values, tupledate)
# Pad a Short Row With Blanks if Needed
for i in range(len(formattedrow),\
len(excel.__sheets__[sheetname]['variables'])):
formattedrow.append('')
yield dict(zip(excel.__sheets__[sheetname]['variables'],formattedrow))
|
I use this class all the time to simplify excel imports. I hope some people find it useful for them as well.
Nice Code. Hallo Brian,
I'm new on Python but have to work on a Python-Project.There are a task to handling "Excel-Sheets". And your snippet of code works verry well for my job without reading Modul-Interfaces. It saved me a lot of time. Never the less - in future I'll reading the Modul-Interface of xlrd soon ;-)
Thanks for your work Regards Timothy
a revised, somewhat shorter version. Yes thanks, it certainly got me going faster with using xlrd. Going thru the code do some adjustments I needed for my project, I ended up revising the class, ending up with a shorter version that retains all this functionality, adding also a little more. The revised code is at: http://gizmojo.org/code/readexcel/
Thank you for posting this python class. I have been using the two perl modules, Spreadsheet::ParseExcel and Spreadsheet::WriteExcel to read and write Excel files and I was looking for the equivalent functionality in python.
Regards, B. Zoltick, NIH
Thank you! Your code saved me lots of time.
How do you save off a piece of the information from the generator object?
An issue I had with this code was that floats in excel spreadsheets can be "inaccurate" (e.g. too many digits) and need to be rounded according to the cell's number format before they are "correct". For example, I've seen something like 23.99999999991 when the person who created the spreadsheet thought it was 24. That means you have to find and interpret the format for each cell. Also, I wanted a lower-level version that didn't try to interpret the first row as column headers and that returned everything as strings so that I could read CSV and XLS files using the same API. That also means it assumes you only want to read one sheet (at a time). I thought that the format-checking code was worth mentioning here. The code is at http://raf.org/as/xls.py and it can be used as a module or as a stand-alone xls2csv translator.
I'm using this in production. I'm working on a feature to drop column that's all empty