Read in a tab-delimited (or any separator-delimited like CSV) file and store each column in a list that can be referenced from a dictionary. The keys for the dictionary are the headings for the columns (if any). All data is read in as strings.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | def getColumns(inFile, delim="\t", header=True):
"""
Get columns of data from inFile. The order of the rows is respected
:param inFile: column file separated by delim
:param header: if True the first line will be considered a header line
:returns: a tuple of 2 dicts (cols, indexToName). cols dict has keys that
are headings in the inFile, and values are a list of all the entries in that
column. indexToName dict maps column index to names that are used as keys in
the cols dict. The names are the same as the headings used in inFile. If
header is False, then column indices (starting from 0) are used for the
heading names (i.e. the keys in the cols dict)
"""
cols = {}
indexToName = {}
for lineNum, line in enumerate(inFile):
if lineNum == 0:
headings = line.split(delim)
i = 0
for heading in headings:
heading = heading.strip()
if header:
cols[heading] = []
indexToName[i] = heading
else:
# in this case the heading is actually just a cell
cols[i] = [heading]
indexToName[i] = i
i += 1
else:
cells = line.split(delim)
i = 0
for cell in cells:
cell = cell.strip()
cols[indexToName[i]] += [cell]
i += 1
return cols, indexToName
|
I find this recipe most useful when I want to find overlaps between different lists.
Suppose this is our sample file called ingredients.txt
Cake Cookie Muffin
Flour Butter Yogurt
Salt Sugar Eggs
Cocoa Powder Eggs Blueberries
Butter Vanilla Vanilla
Sugar Flour Flour
Eggs Salt Sugar
Vanilla Baking Soda Baking Powder
Baking Soda Chocolate Baking Soda
Each column is separated by a tab. The first line is a heading line. Note that the script can handle non-tab separator characters and lists with no headings too. All data is read in as strings.
To read in the file:
>>> ingredients = file("ingredients.txt", 'r')
>>> cols, indexToName = getColumns(ingredients)
>>> ingredients.close()
Now you can work with the data in python. Here are some examples:
>>> len(cols['Cake'])
8
>>> cols['Cake'] == cols[indexToName[0]]
True
>>> cols['Cookie'] == cols[indexToName[1]]
True
>>> cols['Muffin']
['Yogurt', 'Eggs', 'Blueberries', 'Vanilla', 'Flour', 'Sugar', 'Baking Powder', 'Baking Soda']
>>> cols[indexToName[2]]
['Yogurt', 'Eggs', 'Blueberries', 'Vanilla', 'Flour', 'Sugar', 'Baking Powder', 'Baking Soda']
>>> commonCakeCookie = set(cols['Cake']).intersection(set(cols['Cookie']))
>>> commonCakeCookie
set(['Butter', 'Flour', 'Eggs', 'Sugar', 'Salt', 'Baking Soda', 'Vanilla'])
>>> len(commonCakeCookie)
7
>>> commonCakeCookieMuffin = commonCakeCookie.intersection(cols['Muffin'])
>>> commonCakeCookieMuffin
set(['Flour', 'Eggs', 'Baking Soda', 'Vanilla', 'Sugar'])
You can also choose to read in the file by not ignoring the heading line. In this case cols
is indexed by the column index and you don't need to use the indexToName
dictionary:
>>> cols, indexToName = getColumns(ingredients, header=False)
>>> cols[0]
['Cake', 'Flour', 'Salt', 'Cocoa Powder', 'Butter', 'Sugar', 'Eggs', 'Vanilla', 'Baking Soda']
>>> cols[indexToName[0]] == cols[0]
True
Similarly you can specify a different delimiter character when calling getColumns()
:
>>> getColumns(csvFile, delim=",")
Project home: http://bitbucket.org/alinium/fileparsers
The csv module in the standard distribution does the same as your recipe, and is more robust : for instance if a field happens to contain the delimiter your recipe will split at the wrong place
That's true. I only discovered the csv module after have written this recipe. I'll update this to use the csv module when I have some spare time.