Implemented an SQL style INNER JOIN for two lists of tuples to be joined on a common field.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | def inner_join(lot0, lot1, key0=None, key1=None, sort=True):
''' Joins two lists of tuples (or namedtuples) on a key field.
If a keyfunc is not specified, it defaults to the first field in each tuple.
Returns a sorted list in the form: result [(key, lot1_tuple, lot2_tuple) ...]
Keys in lot0 but not lot1 raise a KeyError.
Keys in lot1 but not lot0 are silently ignored.
Tuples with a non-unique key (duplicates) are silently ignored.
Keys must be hashable and orderable or a TypeError is raised.
Running time is O(n log n) due to the sorting step.
from collections import namedtuple
Employee = namedtuple('Employee', 'name dept paygrade')
Person = namedtuple('Person', 'name age gender status')
ee = [Employee('adam', 'acct', 8),
... Employee('betty', 'sales', 9),
... Employee('charlie', 'acct', 2),
... ]
pp = [Person('adam', 30, 'm', 'single'),
... Person('betty', 28, 'f', 'married'),
... Person('charlie', 40, 'm', 'single'),
... ]
for k, e, p in inner_join(ee, pp):
... print k, e.dept, p.age
adam acct 30
betty sales 28
charlie acct 40
'''
key0 = key0 or (lambda r: r[0])
key1 = key1 or (lambda r: r[0])
d0 = {key0(r): r for r in lot0}
d1 = {key1(r): r for r in lot1}
seq = sorted(d0) if sort else iter(d0)
return [(k, d0[k], d1[k]) for k in seq]
if __name__ == '__main__':
import doctest
print doctest.testmod()
|
This isn't quite an inner join in the SQL sense, it's more like a merge operation. A SQL inner join would allow there to be key in one of the lots that isn't in the other, and that key wouldn't be present in the output. The simple fix for this is to replace your return statement with:
Unfortunately this changes the running time to
O(n^2)
because of the worst case lookup of entries in a hash table. You could stored1
in a tree based set or sort it as well and use a binary search for lookups. Either of those approaches should get you back to anO(n log n)
running time.Also once you permit there to be entries in one of your lots that doesn't appear in the returned join you'll probably also want to take the sizes of the lots into account since
O(m log n)
can be drastically different fromO(n log m)
.Finally there is yet another approach to solving this problem since you are constraining your lots to be orderable. You can sort each lot then do a parallel iteration over them. Say you're using
p0
andp1
as the pointers intolot0
andlot1
. In this case iflot0[p0] == lot1[p1]
you'll want to emit an entry and increment bothp0
andp1
. Iflot0[p0] < lot1[p1]
then you'll increment justp0
, and the opposite for thelot0[p0] > lot1[p1]
case. This gives you anO(n log n)
running time that doesn't require any extra space (except that for the returned tuples of course).It might be a join, even a useful join, but it's definitely not a SQL inner join :-)
You can find the three fundamental join algorithms on wikipedia: http://en.wikipedia.org/wiki/Nested_loop_join, http://en.wikipedia.org/wiki/Sort-merge_join, and http://en.wikipedia.org/wiki/Hash_join
It might not be as simple as your code, but you might want to check recipe 577825. The module is a first-attempt at writing a rudimentary database engine. Inner, full, left, and right joins are implemented along with unions. Most of the work is done in the
_join_loop
function.