Welcome, guest | Sign In | My Account | Store | Cart

Easily extract data from microsoft excel files using this wrapper class for xlrd (http://www.lexicon.net/sjmachin/xlrd.htm). The class allows you to create a generator which returns excel data one row at a time as either a list or dictionary. I found this very handy for easily pulling in a variety of excel files without having to deal with COM calls or even needing to have windows.

Python, 153 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
class readexcel(object):
    """ Simple OS Independent Class for Extracting Data from Excel Files 
        the using xlrd module found at http://www.lexicon.net/sjmachin/xlrd.htm
        
        Versions of Excel supported: 2004, 2002, XP, 2000, 97, 95, 5, 4, 3
        xlrd version tested: 0.5.2
        
        Data is extracted by creating a iterator object which can be used to 
        return data one row at a time. The default extraction method assumes 
        that the worksheet is in tabular format with the first nonblank row
        containing variable names and all subsequent rows containing values.
        This method returns a dictionary which uses the variables names as keys
        for each piece of data in the row.  Data can also be extracted with 
        each row represented by a list.
        
        Extracted data is represented fairly logically. By default dates are
        returned as strings in "yyyy/mm/dd" format or "yyyy/mm/dd hh:mm:ss",
        as appropriate.  However, dates can be return as a tuple containing
        (Year, Month, Day, Hour, Min, Second) which is appropriate for usage
        with mxDateTime or DateTime.  Numbers are returned as either INT or 
        FLOAT, whichever is needed to support the data.  Text, booleans, and
        error codes are also returned as appropriate representations.
        
        Quick Example:
        xl = readexcel('testdata.xls')
        sheetnames = xl.worksheets()
        for sheet in sheetnames:
            print sheet
            for row in xl.getiter(sheet):
                # Do Something here
        """ 
    def __init__(self, filename):
        """ Returns a readexcel object of the specified filename - this may
        take a little while because the file must be parsed into memory """
        import xlrd
        import os.path
        if not os.path.isfile(filename):
            raise NameError, "%s is not a valid filename" % filename
        self.__filename__ = filename
        self.__book__ = xlrd.open_workbook(filename)
        self.__sheets__ = {}
        self.__sheetnames__ = []
        for i in self.__book__.sheet_names():
            uniquevars = []
            firstrow = 0
            sheet = self.__book__.sheet_by_name(i)
            for row in range(sheet.nrows):
                types,values = sheet.row_types(row),sheet.row_values(row)
                nonblank = False
                for j in values:
                    if j != '':
                        nonblank=True
                        break
                if nonblank:
                    # Generate a listing of Unique Variable Names for Use as
                    # Dictionary Keys In Extraction. Duplicate Names will
                    # be replaced with "F#"
                    variables = self.__formatrow__(types,values,False)
                    unknown = 1
                    while variables:
                        var = variables.pop(0)
                        if var in uniquevars or var == '':
                            var = 'F' + str(unknown)
                            unknown += 1
                        uniquevars.append(str(var))
                    firstrow = row + 1
                    break
            self.__sheetnames__.append(i)
            self.__sheets__.setdefault(i,{}).__setitem__('rows',sheet.nrows)
            self.__sheets__.setdefault(i,{}).__setitem__('cols',sheet.ncols)
            self.__sheets__.setdefault(i,{}).__setitem__('firstrow',firstrow)
            self.__sheets__.setdefault(i,{}).__setitem__('variables',uniquevars[:])
    def getiter(self, sheetname, returnlist=False, returntupledate=False):
        """ Return an generator object which yields the lines of a worksheet;
        Default returns a dictionary, specifing returnlist=True causes lists
        to be returned.  Calling returntupledate=True causes dates to returned
        as tuples of (Year, Month, Day, Hour, Min, Second) instead of as a
        string """
        if sheetname not in self.__sheets__.keys():
            raise NameError, "%s is not present in %s" % (sheetname,\
                                                          self.__filename__)
        if returnlist:
            return __iterlist__(self, sheetname, returntupledate)
        else:
            return __iterdict__(self, sheetname, returntupledate)
    def worksheets(self):
        """ Returns a list of the Worksheets in the Excel File """
        return self.__sheetnames__
    def nrows(self, worksheet):
        """ Return the number of rows in a worksheet """
        return self.__sheets__[worksheet]['rows']
    def ncols(self, worksheet):
        """ Return the number of columns in a worksheet """
        return self.__sheets__[worksheet]['cols']
    def variables(self,worksheet):
        """ Returns a list of Column Names in the file,
            assuming a tabular format of course. """
        return self.__sheets__[worksheet]['variables']
    def __formatrow__(self, types, values, wanttupledate):
        """ Internal function used to clean up the incoming excel data """
        ##  Data Type Codes:
        ##  EMPTY 0
        ##  TEXT 1 a Unicode string 
        ##  NUMBER 2 float 
        ##  DATE 3 float 
        ##  BOOLEAN 4 int; 1 means TRUE, 0 means FALSE 
        ##  ERROR 5 
        import xlrd
        returnrow = []
        for i in range(len(types)):
            type,value = types[i],values[i]
            if type == 2:
                if value == int(value):
                    value = int(value)
            elif type == 3:
                datetuple = xlrd.xldate_as_tuple(value, self.__book__.datemode)
                if wanttupledate:
                    value = datetuple
                else:
                    # time only no date component
                    if datetuple[0] == 0 and datetuple[1] == 0 and \
                       datetuple[2] == 0: 
                        value = "%02d:%02d:%02d" % datetuple[3:]
                    # date only, no time
                    elif datetuple[3] == 0 and datetuple[4] == 0 and \
                         datetuple[5] == 0:
                        value = "%04d/%02d/%02d" % datetuple[:3]
                    else: # full date
                        value = "%04d/%02d/%02d %02d:%02d:%02d" % datetuple
            elif type == 5:
                value = xlrd.error_text_from_code[value]
            returnrow.append(value)
        return returnrow
    
def __iterlist__(excel, sheetname, tupledate):
    """ Function Used To Create the List Iterator """
    sheet = excel.__book__.sheet_by_name(sheetname)
    for row in range(excel.__sheets__[sheetname]['rows']):
        types,values = sheet.row_types(row),sheet.row_values(row)
        yield excel.__formatrow__(types, values, tupledate)

def __iterdict__(excel, sheetname, tupledate):
    """ Function Used To Create the Dictionary Iterator """
    sheet = excel.__book__.sheet_by_name(sheetname)
    for row in range(excel.__sheets__[sheetname]['firstrow'],\
                     excel.__sheets__[sheetname]['rows']):
        types,values = sheet.row_types(row),sheet.row_values(row)
        formattedrow = excel.__formatrow__(types, values, tupledate)
        # Pad a Short Row With Blanks if Needed
        for i in range(len(formattedrow),\
                       len(excel.__sheets__[sheetname]['variables'])):
            formattedrow.append('')
        yield dict(zip(excel.__sheets__[sheetname]['variables'],formattedrow))

I use this class all the time to simplify excel imports. I hope some people find it useful for them as well.

7 comments

Timothy Kesten 15 years, 4 months ago  # | flag

Nice Code. Hallo Brian,

I'm new on Python but have to work on a Python-Project.There are a task to handling "Excel-Sheets". And your snippet of code works verry well for my job without reading Modul-Interfaces. It saved me a lot of time. Never the less - in future I'll reading the Modul-Interface of xlrd soon ;-)

Thanks for your work Regards Timothy

Mario Ruggier 13 years, 9 months ago  # | flag

a revised, somewhat shorter version. Yes thanks, it certainly got me going faster with using xlrd. Going thru the code do some adjustments I needed for my project, I ended up revising the class, ending up with a shorter version that retains all this functionality, adding also a little more. The revised code is at: http://gizmojo.org/code/readexcel/

Brad J Zoltick 13 years, 3 months ago  # | flag

Thank you for posting this python class. I have been using the two perl modules, Spreadsheet::ParseExcel and Spreadsheet::WriteExcel to read and write Excel files and I was looking for the equivalent functionality in python.

Regards, B. Zoltick, NIH

Daniel Serodio 13 years ago  # | flag

Thank you! Your code saved me lots of time.

Brandon Dixon 12 years, 9 months ago  # | flag

How do you save off a piece of the information from the generator object?

Robert Fabian 12 years, 2 months ago  # | flag

An issue I had with this code was that floats in excel spreadsheets can be "inaccurate" (e.g. too many digits) and need to be rounded according to the cell's number format before they are "correct". For example, I've seen something like 23.99999999991 when the person who created the spreadsheet thought it was 24. That means you have to find and interpret the format for each cell. Also, I wanted a lower-level version that didn't try to interpret the first row as column headers and that returned everything as strings so that I could read CSV and XLS files using the same API. That also means it assumes you only want to read one sheet (at a time). I thought that the format-checking code was worth mentioning here. The code is at http://raf.org/as/xls.py and it can be used as a module or as a stand-alone xls2csv translator.

Roger 7 years, 3 months ago  # | flag

I'm using this in production. I'm working on a feature to drop column that's all empty

Created by Bryan Niederberger on Thu, 6 Apr 2006 (PSF)
Python recipes (4591)
Bryan Niederberger's recipes (2)

Required Modules

Other Information and Tasks