Welcome, guest | Sign In | My Account | Store | Cart

Read in a tab-delimited (or any separator-delimited like CSV) file and store each column in a list that can be referenced from a dictionary. The keys for the dictionary are the headings for the columns (if any). All data is read in as strings.

Python, 38 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def getColumns(inFile, delim="\t", header=True):
    """
    Get columns of data from inFile. The order of the rows is respected
    
    :param inFile: column file separated by delim
    :param header: if True the first line will be considered a header line
    :returns: a tuple of 2 dicts (cols, indexToName). cols dict has keys that 
    are headings in the inFile, and values are a list of all the entries in that
    column. indexToName dict maps column index to names that are used as keys in 
    the cols dict. The names are the same as the headings used in inFile. If
    header is False, then column indices (starting from 0) are used for the 
    heading names (i.e. the keys in the cols dict)
    """
    cols = {}
    indexToName = {}
    for lineNum, line in enumerate(inFile):
        if lineNum == 0:
            headings = line.split(delim)
            i = 0
            for heading in headings:
                heading = heading.strip()
                if header:
                    cols[heading] = []
                    indexToName[i] = heading
                else:
                    # in this case the heading is actually just a cell
                    cols[i] = [heading]
                    indexToName[i] = i
                i += 1
        else:
            cells = line.split(delim)
            i = 0
            for cell in cells:
                cell = cell.strip()
                cols[indexToName[i]] += [cell]
                i += 1
                
    return cols, indexToName

I find this recipe most useful when I want to find overlaps between different lists.

Suppose this is our sample file called ingredients.txt

Cake    Cookie  Muffin
Flour   Butter  Yogurt
Salt    Sugar   Eggs
Cocoa Powder    Eggs    Blueberries
Butter  Vanilla Vanilla
Sugar   Flour   Flour
Eggs    Salt    Sugar
Vanilla Baking Soda Baking Powder
Baking Soda Chocolate   Baking Soda

Each column is separated by a tab. The first line is a heading line. Note that the script can handle non-tab separator characters and lists with no headings too. All data is read in as strings.

To read in the file:

>>> ingredients = file("ingredients.txt", 'r')
>>> cols, indexToName = getColumns(ingredients)
>>> ingredients.close()

Now you can work with the data in python. Here are some examples:

>>> len(cols['Cake'])
8

>>> cols['Cake'] == cols[indexToName[0]]
True
>>> cols['Cookie'] == cols[indexToName[1]]
True

>>> cols['Muffin']
['Yogurt', 'Eggs', 'Blueberries', 'Vanilla', 'Flour', 'Sugar', 'Baking Powder', 'Baking Soda']
>>> cols[indexToName[2]]
['Yogurt', 'Eggs', 'Blueberries', 'Vanilla', 'Flour', 'Sugar', 'Baking Powder', 'Baking Soda']

>>> commonCakeCookie = set(cols['Cake']).intersection(set(cols['Cookie']))
>>> commonCakeCookie
set(['Butter', 'Flour', 'Eggs', 'Sugar', 'Salt', 'Baking Soda', 'Vanilla'])
>>> len(commonCakeCookie)
7

>>> commonCakeCookieMuffin = commonCakeCookie.intersection(cols['Muffin'])
>>> commonCakeCookieMuffin
set(['Flour', 'Eggs', 'Baking Soda', 'Vanilla', 'Sugar'])

You can also choose to read in the file by not ignoring the heading line. In this case cols is indexed by the column index and you don't need to use the indexToName dictionary:

>>> cols, indexToName = getColumns(ingredients, header=False)

>>> cols[0]
['Cake', 'Flour', 'Salt', 'Cocoa Powder', 'Butter', 'Sugar', 'Eggs', 'Vanilla', 'Baking Soda']
>>> cols[indexToName[0]] == cols[0]
True

Similarly you can specify a different delimiter character when calling getColumns():

>>> getColumns(csvFile, delim=",")

Project home: http://bitbucket.org/alinium/fileparsers

2 comments

Pierre Quentel 11 years ago  # | flag

The csv module in the standard distribution does the same as your recipe, and is more robust : for instance if a field happens to contain the delimiter your recipe will split at the wrong place

alinium (author) 11 years ago  # | flag

That's true. I only discovered the csv module after have written this recipe. I'll update this to use the csv module when I have some spare time.

Created by alinium on Thu, 28 Oct 2010 (MIT)
Python recipes (4591)
alinium's recipes (1)

Required Modules

  • (none specified)

Other Information and Tasks