Implemented an SQL style INNER JOIN for two lists of tuples to be joined on a common field.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ```
def inner_join(lot0, lot1, key0=None, key1=None, sort=True):
''' Joins two lists of tuples (or namedtuples) on a key field.
If a keyfunc is not specified, it defaults to the first field in each tuple.
Returns a sorted list in the form: result [(key, lot1_tuple, lot2_tuple) ...]
Keys in lot0 but not lot1 raise a KeyError.
Keys in lot1 but not lot0 are silently ignored.
Tuples with a non-unique key (duplicates) are silently ignored.
Keys must be hashable and orderable or a TypeError is raised.
Running time is O(n log n) due to the sorting step.
from collections import namedtuple
Employee = namedtuple('Employee', 'name dept paygrade')
Person = namedtuple('Person', 'name age gender status')
ee = [Employee('adam', 'acct', 8),
... Employee('betty', 'sales', 9),
... Employee('charlie', 'acct', 2),
... ]
pp = [Person('adam', 30, 'm', 'single'),
... Person('betty', 28, 'f', 'married'),
... Person('charlie', 40, 'm', 'single'),
... ]
for k, e, p in inner_join(ee, pp):
... print k, e.dept, p.age
adam acct 30
betty sales 28
charlie acct 40
'''
key0 = key0 or (lambda r: r[0])
key1 = key1 or (lambda r: r[0])
d0 = {key0(r): r for r in lot0}
d1 = {key1(r): r for r in lot1}
seq = sorted(d0) if sort else iter(d0)
return [(k, d0[k], d1[k]) for k in seq]
if __name__ == '__main__':
import doctest
print doctest.testmod()
``` |

This isn't quite an inner join in the SQL sense, it's more like a merge operation. A SQL inner join would allow there to be key in one of the lots that isn't in the other, and that key wouldn't be present in the output. The simple fix for this is to replace your return statement with:

Unfortunately this changes the running time to

`O(n^2)`

because of the worst case lookup of entries in a hash table. You could store`d1`

in a tree based set or sort it as well and use a binary search for lookups. Either of those approaches should get you back to an`O(n log n)`

running time.Also once you permit there to be entries in one of your lots that doesn't appear in the returned join you'll probably also want to take the sizes of the lots into account since

`O(m log n)`

can be drastically different from`O(n log m)`

.Finally there is yet another approach to solving this problem since you are constraining your lots to be orderable. You can sort each lot then do a parallel iteration over them. Say you're using

`p0`

and`p1`

as the pointers into`lot0`

and`lot1`

. In this case if`lot0[p0] == lot1[p1]`

you'll want to emit an entry and increment both`p0`

and`p1`

. If`lot0[p0] < lot1[p1]`

then you'll increment just`p0`

, and the opposite for the`lot0[p0] > lot1[p1]`

case. This gives you an`O(n log n)`

running time that doesn't require any extra space (except that for the returned tuples of course).It might be a join, even a useful join, but it's definitely not a SQL inner join :-)

You can find the three fundamental join algorithms on wikipedia: http://en.wikipedia.org/wiki/Nested_loop_join, http://en.wikipedia.org/wiki/Sort-merge_join, and http://en.wikipedia.org/wiki/Hash_join

It might not be as simple as your code, but you might want to check recipe 577825. The module is a first-attempt at writing a rudimentary database engine. Inner, full, left, and right joins are implemented along with unions. Most of the work is done in the

`_join_loop`

function.