How to emulate SQL aggregate functions AVG, COUNT, MAX, MIN and SUM on csv type data files using tools from the standard library.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | #How to do aggregates using groupby, defaultdict and Counter on flat files
#This Example computes groups for:
#region, count(region), sum(sales), avg(sales), max(sales), min(sales)
#
#By Nestor Nissen
from csv import DictReader
from operator import itemgetter
from itertools import groupby
from collections import defaultdict, Counter
rawdata = '''
Region,City,Sales
Scotland,Edinburgh,20000
Scotland,Glasgow,12500
Wales,Cardiff,29700
Wales,Bangor,12800
England,London,90000
England,Manchester,45600
England,Liverpool,29700
'''.splitlines()[1:]
indata = list(DictReader(rawdata))
print('Using sort and groupby:')
counts = []
sums = []
avgs = []
maxs = []
mins = []
ordered_data = sorted(indata, key=itemgetter('Region'))
for region, group in groupby(ordered_data, key=itemgetter('Region')):
group_list = list(group)
count = sum(1 for city in group_list)
total = sum(int(city['Sales']) for city in group_list)
maxsale = max(int(city['Sales']) for city in group_list)
minsale = min(int(city['Sales']) for city in group_list)
counts.append((region, count))
sums.append((region, total))
avgs.append((region, total/count))
maxs.append((region, maxsale))
mins.append((region, minsale))
print('count:',counts, '\nsum:',sums, '\navg:',avgs,
'\nmax:',maxs, '\nmin:',mins, '\n')
print('Using defaultdict:')
dd_counts = defaultdict(int)
dd_sales = defaultdict(int)
dd_maxs = defaultdict(int)
dd_mins = defaultdict(lambda: 9**99)
for row in indata:
region = row['Region']
sales = int(row['Sales'])
dd_counts[region] += 1
dd_sales[region] += sales
dd_maxs[region] = max(dd_maxs[region], sales)
dd_mins[region] = min(dd_mins[region], sales)
counts = list(dd_counts.items())
sums = list(dd_sales.items())
avgs = [(key, dd_sales[key]/count) for key, count in dd_counts.items()]
maxs = list(dd_maxs.items())
mins = list(dd_mins.items())
print('count:',counts, '\nsum:',sums, '\navg:',avgs,
'\nmax:',maxs, '\nmin:',mins, '\n')
print('Using counter:')
counts = list(Counter(map(itemgetter('Region'), indata)).items())
print('count:',counts)
|
Replace "rawdata" by a file object to read from a file.
This code is a compromise between calculating only one aggregate or many. Many of the lists can (and possibly should) be omitted when adapting the code; they are only here to be able to reuse the data from iterators and for pretty printing. The generator expressions can be converted to traditional loops to do multiple calculations in one go.
See also: http://code.activestate.com/recipes/304162-summary-reports-using-itertoolsgroupby/ for an alternative implementation of sum.
Could you use one Python function for one SQL aggregate function? It’s a bit hard to see what’s the generic code and what’s the example in the recipe.