Welcome, guest | Sign In | My Account | Store | Cart

Implements the three standard relational join algorithms: nested loops join, hash join, and merge join, using the iterator algebra support in Python 2.4. This recipe presents code that can be used for inner join, left outer join, full outer join, and semijoins. The nested loops join supports a theta join.

Python, 227 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
"""
Iterator algebra implementations of join algorithms: hash join, merge
join, and nested loops join, as well as a variant I dub "bisect join".

Requires Python 2.4.

Author: Jim Baker, jbaker@zyasoft.com

Define some relations to join:
>>> accounts = [
...    ('spender', 'Big', 'Spender', '123-456-7890'),
...    ('saver', 'Thrifty', 'Saver', '234-567-8901'),
...    ('nobody', 'A', 'Nobody', '999-999-9999')]
>>> transactions = [
...    ('spender', 'deposited', 100),
...    ('spender', 'withdraw', 40),
...    ('spender', 'withdraw', 15),
...    ('spender', 'withdraw', 25),
...    ('saver',  'deposited', 30)]

Given the above data, a useful join predicate is the first element of
each tuple. This will give us a usable equijoin:
>>> def first(x): return x[0]

Test that the implemented join methods are equivalent:
>>> nj = set(nested_loops_join(accounts, transactions, first))
>>> mj = set(merge_join(accounts, transactions, first))
>>> hj = set(hash_join(accounts, transactions, first))
>>> bj = set(bisect_join(accounts, transactions, first))
>>> nj == mj
True
>>> nj == hj
True
>>> mj == hj
True
>>> bj == hj
True

Perform a Cartesian join. In this case, we need a predicate that
returns the same value for every tuple, so in this case returning True
is arbitrary:
>>> def always(x): return True

With 5 tuples in transactions and 3 tuples in accounts, we should get
15 tuples:
>>> len(list(hash_join(accounts, transactions, always)))
15

With semijoins and outer joins, we now see the power of the
iterator approach. A left join will return a tuple even if there is
not a match on the predicate:
>>> lj = set(hash_join(accounts, transactions, first, 
...                    join=lambda X: left(X, when_empty=tuple([None]*3))))
>>> len(lj)
6

A full outer join is simply the union of a left join and a right
join. Let's make a right join by reversing the order of the join. We
will also to need to reverse how we combine the tuples:
>>> rj = set(nested_loops_join(transactions, accounts, first,
...                            combine=lambda r, s: s + r,
...                            join=lambda X: left(X, when_empty=tuple([None]*2))))
>>> len(rj)
5

Note that we have no activity in this example without an accompanying
account, so it's the same as an inner join. Now let's compute the full
outer join:
>>> foj = lj | rj
>>> len(foj)
6

A semijoin will return only the first matching tuple:
>>> len(list(hash_join(accounts, transactions, first, semi)))
2

>>> len(list(merge_join(accounts, transactions, first, semi)))
2

Semijoins are useful, especially in conjunction with hierarchical
queries. In SQL, if supported, they are the efficient alternative to
writing 'select distinct', although they make one's code difficult to
understand by the unversed.  Hopefully showing this with procedural
code can make this useful feature less mysterious.

"""

import operator

def identity(x):
    """x -> x

    As a predicate, this function is useless, but it provides a
    useful/correct default.

    >>> identity(('apple', 'banana', 'cherry'))
    ('apple', 'banana', 'cherry')
    """
    return x

def inner(X):
    """
    >>> X = [1, 2, 3, 4, 5]
    >>> list(inner(X))
    [1, 2, 3, 4, 5]
    """
    for x in X:
        yield x

def semi(X):
    """Returns X[0] if X is not empty

    >>> X = [1, 2, 3, 4, 5]
    >>> list(semi(X))
    [1]
    """
    
    for x in X:
        yield x
        break

def left(X, when_empty=()):
    empty = True
    for x in X:
        yield x
        empty = False
    if empty:
        yield when_empty
        

# Xp can mean X-prime or X-with-predicate; we use both meanings
# interchangeably here in the shorthand of this code.
# Xp -> (p(x0), x0), (p(x1), x1), ... 

# throughout, we assume S is the smaller relation of R and S.

def hash_join(R, S, predicate=identity, join=inner, combine=operator.concat):
    hashed = {}
    for s in S:
        hashed.setdefault(predicate(s),[]).append(s)
    for r in R:
        for s in join(hashed.get(predicate(r),())):
            yield combine(r, s)

def nested_loops_join(R, S, predicate=identity, join=inner,
                      combine=operator.concat, theta=operator.eq):
    Sp = [(predicate(s), s) for s in S]
    for r in R:
        rp = predicate(r) 
        for s in join(s for sp, s in Sp if theta(rp, sp)):
            yield combine(r, s)

def bisect_join(R, S, predicate=identity, join=inner, combine=operator.concat):
    """
    I have not found discussion of this variant on the sort-merge join anywhere.
    """

    from bisect import bisect_left

    def consume(Sp, si, rp):
        """This needs a better name..."""
        length = len(S)
        while si < length:
            sp, s = Sp[si]
            if rp == sp:
                yield s
            else:
                break
            si += 1

    Rp = sorted((predicate(r), r) for r in R)
    Sp = sorted((predicate(s), s) for s in S)

    for rp, r in Rp:
        si = bisect_left(Sp, (rp,))
        for s in join(consume(Sp, si, rp)):
            yield combine(r, s)


def merge_join(R, S, predicate=identity, join=inner, combine=operator.concat):
    """
    For obvious reasons, we depend on the predicate providing a
    sortable relation.

    Compare this presentation using iterator algebra with the much
    more difficult to follow presentation (IMHO) in
    http://en.wikipedia.org/wiki/Sort-Merge_Join
    """

    from itertools import groupby


    def advancer(Xp):
        """A simple wrapper of itertools.groupby, we simply need
        to follow our convention that Xp -> (xp0, x0), (xp1, x1), ...
        """

        for k, g in groupby(Xp, key=operator.itemgetter(0)):
            yield k, list(g)
    
    R_grouped = advancer(sorted((predicate(r), r) for r in R))
    S_grouped = advancer(sorted((predicate(s), s) for s in S))

    # in the join we need to distinguish rp from rk in the unpack, so
    # just use rk, sk
    rk, R_matched = R_grouped.next()
    sk, S_matched = S_grouped.next()

    while R_grouped and S_grouped:
        comparison = cmp(rk, sk)
        if comparison == 0:
            # standard Cartesian join here on the matched tuples, as
            # subsetted by the join method
            for rp, r in R_matched:
                for sp, s in join(S_matched):
                    yield combine(r, s)
            rk, R_matched = R_grouped.next()
            sk, S_matched = S_grouped.next()
        elif comparison > 0:
            sk, S_matched = S_grouped.next()
        else:
            rk, R_matched = R_grouped.next()


if __name__ == "__main__":
    import doctest
    doctest.testmod()

(First, I don't believe this recipe has any practical usage, but I would be very curious to be proven wrong!)

There are three standard algorithms for performing relational joins - nested loops, hash, and merge. Each has different requirements on the predicates (nlj - testable for equality, hj - hashable, mj - comparable). Standard pedagogical presentations, like this one in wikipedia (http://en.wikipedia.org/wiki/Sort-Merge_Join) typically assume a simple key, an inner join, and no mention of the predicate space. Executable pseudocode with no such restrictions seems to be hugely advantageous in comparison.

The merge_join function demonstrates the power of Python's iterator algebra support. Although hash and nested loops joins are comparatively simple, merge join is not so easy to describe. In this implementation we can see that the critical steps are to subgroup by the predicates, then to selectively advance one or both of the resulting iterators.

The nested loops join (no example here at this time) supports a theta join, that is a join on an arbitrary function of the predicates from the two relations, as opposed to just arbitrary equijoins.

Similar recipes include

Group by summarization: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/304162, http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/259173

Iterator merge: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/491285, which employs a priority queue heap to efficiently buffer the merge.

REFERENCES Joins - http://en.wikipedia.org/wiki/Join_(SQL) Hellerstein, Joseph, http://bnrg.cs.berkeley.edu/~adj/cs262/Lec_11_5.html Graefe, G. 1993,"Query Evaluation Techniques for Large Databases". ACM Computing Surveys 25(2): 73-170 Relational Python - http://jtauber.com/relational_python