Welcome, guest | Sign In | My Account | Store | Cart

Implemented an SQL style INNER JOIN for two lists of tuples to be joined on a common field.

Python, 41 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def inner_join(lot0, lot1, key0=None, key1=None, sort=True):
    ''' Joins two lists of tuples (or namedtuples) on a key field.
        If a keyfunc is not specified, it defaults to the first field in each tuple.
        Returns a sorted list in the form:  result [(key, lot1_tuple, lot2_tuple) ...]

        Keys in lot0 but not lot1 raise a KeyError.
        Keys in lot1 but not lot0 are silently ignored.
        Tuples with a non-unique key (duplicates) are silently ignored.

        Keys must be hashable and orderable or a TypeError is raised.
        Running time is O(n log n) due to the sorting step.

        from collections import namedtuple
        Employee = namedtuple('Employee', 'name dept paygrade')
        Person = namedtuple('Person', 'name age gender status')
        ee = [Employee('adam', 'acct', 8),
        ...   Employee('betty', 'sales', 9),
        ...   Employee('charlie', 'acct', 2),
        ... ]
        pp = [Person('adam', 30, 'm', 'single'),
        ...   Person('betty', 28, 'f', 'married'),
        ...   Person('charlie', 40, 'm', 'single'),
        ... ]
        for k, e, p in inner_join(ee, pp):
        ...    print k, e.dept, p.age
        adam acct 30
        betty sales 28
        charlie acct 40

    '''
    key0 = key0 or (lambda r: r[0])
    key1 = key1 or (lambda r: r[0])
    d0 = {key0(r): r for r in lot0}
    d1 = {key1(r): r for r in lot1}
    seq = sorted(d0) if sort else iter(d0)
    return [(k, d0[k], d1[k]) for k in seq]


if __name__ == '__main__':
    import doctest
    print doctest.testmod()

3 comments

Brandon Beck 12 years, 5 months ago  # | flag

This isn't quite an inner join in the SQL sense, it's more like a merge operation. A SQL inner join would allow there to be key in one of the lots that isn't in the other, and that key wouldn't be present in the output. The simple fix for this is to replace your return statement with:

return [(k, d0[k], d1[k]) for k in seq if k in d1]

Unfortunately this changes the running time to O(n^2) because of the worst case lookup of entries in a hash table. You could store d1 in a tree based set or sort it as well and use a binary search for lookups. Either of those approaches should get you back to an O(n log n) running time.

Also once you permit there to be entries in one of your lots that doesn't appear in the returned join you'll probably also want to take the sizes of the lots into account since O(m log n) can be drastically different from O(n log m).

Finally there is yet another approach to solving this problem since you are constraining your lots to be orderable. You can sort each lot then do a parallel iteration over them. Say you're using p0 and p1 as the pointers into lot0 and lot1. In this case if lot0[p0] == lot1[p1] you'll want to emit an entry and increment both p0 and p1. If lot0[p0] < lot1[p1] then you'll increment just p0, and the opposite for the lot0[p0] > lot1[p1] case. This gives you an O(n log n) running time that doesn't require any extra space (except that for the returned tuples of course).

Bjorn Pettersen 12 years, 5 months ago  # | flag

It might be a join, even a useful join, but it's definitely not a SQL inner join :-)

You can find the three fundamental join algorithms on wikipedia: http://en.wikipedia.org/wiki/Nested_loop_join, http://en.wikipedia.org/wiki/Sort-merge_join, and http://en.wikipedia.org/wiki/Hash_join

Stephen Chappell 12 years, 5 months ago  # | flag

It might not be as simple as your code, but you might want to check recipe 577825. The module is a first-attempt at writing a rudimentary database engine. Inner, full, left, and right joins are implemented along with unions. Most of the work is done in the _join_loop function.