Implements the three standard relational join algorithms: nested loops join, hash join, and merge join, using the iterator algebra support in Python 2.4. This recipe presents code that can be used for inner join, left outer join, full outer join, and semijoins. The nested loops join supports a theta join.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | """
Iterator algebra implementations of join algorithms: hash join, merge
join, and nested loops join, as well as a variant I dub "bisect join".
Requires Python 2.4.
Author: Jim Baker, jbaker@zyasoft.com
Define some relations to join:
>>> accounts = [
... ('spender', 'Big', 'Spender', '123-456-7890'),
... ('saver', 'Thrifty', 'Saver', '234-567-8901'),
... ('nobody', 'A', 'Nobody', '999-999-9999')]
>>> transactions = [
... ('spender', 'deposited', 100),
... ('spender', 'withdraw', 40),
... ('spender', 'withdraw', 15),
... ('spender', 'withdraw', 25),
... ('saver', 'deposited', 30)]
Given the above data, a useful join predicate is the first element of
each tuple. This will give us a usable equijoin:
>>> def first(x): return x[0]
Test that the implemented join methods are equivalent:
>>> nj = set(nested_loops_join(accounts, transactions, first))
>>> mj = set(merge_join(accounts, transactions, first))
>>> hj = set(hash_join(accounts, transactions, first))
>>> bj = set(bisect_join(accounts, transactions, first))
>>> nj == mj
True
>>> nj == hj
True
>>> mj == hj
True
>>> bj == hj
True
Perform a Cartesian join. In this case, we need a predicate that
returns the same value for every tuple, so in this case returning True
is arbitrary:
>>> def always(x): return True
With 5 tuples in transactions and 3 tuples in accounts, we should get
15 tuples:
>>> len(list(hash_join(accounts, transactions, always)))
15
With semijoins and outer joins, we now see the power of the
iterator approach. A left join will return a tuple even if there is
not a match on the predicate:
>>> lj = set(hash_join(accounts, transactions, first,
... join=lambda X: left(X, when_empty=tuple([None]*3))))
>>> len(lj)
6
A full outer join is simply the union of a left join and a right
join. Let's make a right join by reversing the order of the join. We
will also to need to reverse how we combine the tuples:
>>> rj = set(nested_loops_join(transactions, accounts, first,
... combine=lambda r, s: s + r,
... join=lambda X: left(X, when_empty=tuple([None]*2))))
>>> len(rj)
5
Note that we have no activity in this example without an accompanying
account, so it's the same as an inner join. Now let's compute the full
outer join:
>>> foj = lj | rj
>>> len(foj)
6
A semijoin will return only the first matching tuple:
>>> len(list(hash_join(accounts, transactions, first, semi)))
2
>>> len(list(merge_join(accounts, transactions, first, semi)))
2
Semijoins are useful, especially in conjunction with hierarchical
queries. In SQL, if supported, they are the efficient alternative to
writing 'select distinct', although they make one's code difficult to
understand by the unversed. Hopefully showing this with procedural
code can make this useful feature less mysterious.
"""
import operator
def identity(x):
"""x -> x
As a predicate, this function is useless, but it provides a
useful/correct default.
>>> identity(('apple', 'banana', 'cherry'))
('apple', 'banana', 'cherry')
"""
return x
def inner(X):
"""
>>> X = [1, 2, 3, 4, 5]
>>> list(inner(X))
[1, 2, 3, 4, 5]
"""
for x in X:
yield x
def semi(X):
"""Returns X[0] if X is not empty
>>> X = [1, 2, 3, 4, 5]
>>> list(semi(X))
[1]
"""
for x in X:
yield x
break
def left(X, when_empty=()):
empty = True
for x in X:
yield x
empty = False
if empty:
yield when_empty
# Xp can mean X-prime or X-with-predicate; we use both meanings
# interchangeably here in the shorthand of this code.
# Xp -> (p(x0), x0), (p(x1), x1), ...
# throughout, we assume S is the smaller relation of R and S.
def hash_join(R, S, predicate=identity, join=inner, combine=operator.concat):
hashed = {}
for s in S:
hashed.setdefault(predicate(s),[]).append(s)
for r in R:
for s in join(hashed.get(predicate(r),())):
yield combine(r, s)
def nested_loops_join(R, S, predicate=identity, join=inner,
combine=operator.concat, theta=operator.eq):
Sp = [(predicate(s), s) for s in S]
for r in R:
rp = predicate(r)
for s in join(s for sp, s in Sp if theta(rp, sp)):
yield combine(r, s)
def bisect_join(R, S, predicate=identity, join=inner, combine=operator.concat):
"""
I have not found discussion of this variant on the sort-merge join anywhere.
"""
from bisect import bisect_left
def consume(Sp, si, rp):
"""This needs a better name..."""
length = len(S)
while si < length:
sp, s = Sp[si]
if rp == sp:
yield s
else:
break
si += 1
Rp = sorted((predicate(r), r) for r in R)
Sp = sorted((predicate(s), s) for s in S)
for rp, r in Rp:
si = bisect_left(Sp, (rp,))
for s in join(consume(Sp, si, rp)):
yield combine(r, s)
def merge_join(R, S, predicate=identity, join=inner, combine=operator.concat):
"""
For obvious reasons, we depend on the predicate providing a
sortable relation.
Compare this presentation using iterator algebra with the much
more difficult to follow presentation (IMHO) in
http://en.wikipedia.org/wiki/Sort-Merge_Join
"""
from itertools import groupby
def advancer(Xp):
"""A simple wrapper of itertools.groupby, we simply need
to follow our convention that Xp -> (xp0, x0), (xp1, x1), ...
"""
for k, g in groupby(Xp, key=operator.itemgetter(0)):
yield k, list(g)
R_grouped = advancer(sorted((predicate(r), r) for r in R))
S_grouped = advancer(sorted((predicate(s), s) for s in S))
# in the join we need to distinguish rp from rk in the unpack, so
# just use rk, sk
rk, R_matched = R_grouped.next()
sk, S_matched = S_grouped.next()
while R_grouped and S_grouped:
comparison = cmp(rk, sk)
if comparison == 0:
# standard Cartesian join here on the matched tuples, as
# subsetted by the join method
for rp, r in R_matched:
for sp, s in join(S_matched):
yield combine(r, s)
rk, R_matched = R_grouped.next()
sk, S_matched = S_grouped.next()
elif comparison > 0:
sk, S_matched = S_grouped.next()
else:
rk, R_matched = R_grouped.next()
if __name__ == "__main__":
import doctest
doctest.testmod()
|
(First, I don't believe this recipe has any practical usage, but I would be very curious to be proven wrong!)
There are three standard algorithms for performing relational joins - nested loops, hash, and merge. Each has different requirements on the predicates (nlj - testable for equality, hj - hashable, mj - comparable). Standard pedagogical presentations, like this one in wikipedia (http://en.wikipedia.org/wiki/Sort-Merge_Join) typically assume a simple key, an inner join, and no mention of the predicate space. Executable pseudocode with no such restrictions seems to be hugely advantageous in comparison.
The merge_join function demonstrates the power of Python's iterator algebra support. Although hash and nested loops joins are comparatively simple, merge join is not so easy to describe. In this implementation we can see that the critical steps are to subgroup by the predicates, then to selectively advance one or both of the resulting iterators.
The nested loops join (no example here at this time) supports a theta join, that is a join on an arbitrary function of the predicates from the two relations, as opposed to just arbitrary equijoins.
Similar recipes include
Group by summarization: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/304162, http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/259173
Iterator merge: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/491285, which employs a priority queue heap to efficiently buffer the merge.
REFERENCES Joins - http://en.wikipedia.org/wiki/Join_(SQL) Hellerstein, Joseph, http://bnrg.cs.berkeley.edu/~adj/cs262/Lec_11_5.html Graefe, G. 1993,"Query Evaluation Techniques for Large Databases". ACM Computing Surveys 25(2): 73-170 Relational Python - http://jtauber.com/relational_python