For a number of years Python has provided developers with the special parameters 'cmp' and 'key' on list.sort and __builtin__.sorted. However Python does not provide a built-in mechanism for doing binary searches on such sorted lists. This recipe provides a simple function that allows you to perform binary searches on your sorted sequences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | #! /usr/bin/env python
######################################################################
# Written by Kevin L. Sitze on 2011-02-04
# This code may be used pursuant to the MIT License.
######################################################################
import __builtin__
__all__ = (
'lower_bound',
'upper_bound'
)
def lower_bound(haystack, needle, lo = 0, hi = None, cmp = None, key = None):
"""lower_bound(haystack, needle[, lo = 0[, hi = None[, cmp = None[, key = None]]]]) => n
Find \var{needle} via a binary search on \var{haystack}. Returns the
index of the first match if \var{needle} is found, else a negative
value \var{N} is returned indicating the index where \var{needle}
belongs with the formula "-\var{N}-1".
\var{haystack} - the ordered, indexable sequence to search.
\var{needle} - the value to locate in \var{haystack}.
\var{lo} and \var{hi} - the range in \var{haystack} to search.
\var{cmp} - the cmp function used to order the \var{haystack} items.
\var{key} - the key function used to extract keys from the elements.
"""
if cmp is None: cmp = __builtin__.cmp
if key is None: key = lambda x: x
if lo < 0: raise ValueError( 'lo cannot be negative' )
if hi is None: hi = len(haystack)
val = None
while lo < hi:
mid = (lo + hi) >> 1
val = cmp(key(haystack[mid]), needle)
if val < 0:
lo = mid + 1
else:
hi = mid
if val is None: return -1
elif val == 0: return lo
elif lo >= len(haystack): return -1 - lo
elif cmp(key(haystack[lo]), needle) == 0: return lo
else: return -1 - lo
def upper_bound(haystack, needle, lo = 0, hi = None, cmp = None, key = None):
"""upper_bound(haystack, needle[, lo = 0[, hi = None[, cmp = None[, key = None]]]]) => n
Find \var{needle} via a binary search on \var{haystack}. Returns the
non-negative index \var{N} of the element that immediately follows the
last match of \var{needle} if \var{needle} is found, else a negative
value \var{N} is returned indicating the index where \var{needle}
belongs with the formula "-\var{N}-1".
\var{haystack} - the ordered, indexable sequence to search.
\var{needle} - the value to locate in \var{haystack}.
\var{lo} and \var{hi} - the range in \var{haystack} to search.
\var{cmp} - the cmp function used to order the \var{haystack} items.
\var{key} - the key function used to extract keys from the elements.
"""
if cmp is None: cmp = __builtin__.cmp
if key is None: key = lambda x: x
if lo < 0: raise ValueError( 'lo cannot be negative' )
if hi is None: hi = len(haystack)
val = None
while lo < hi:
mid = (lo + hi) >> 1
val = cmp(key(haystack[mid]), needle)
if val > 0:
hi = mid
else:
lo = mid + 1
if val is None: return -1
elif val == 0: return lo
elif lo > 0 and cmp(key(haystack[lo - 1]), needle) == 0: return lo
else: return -1 - lo
if __name__ == '__main__':
from assertions import *
a = [0, 2, 4, 6, 8]
b = [0, 2, 2, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8, 8]
test_left = (
(-1, [] , -2), (-1, [] , -1), (-1, [] , 0), (-1, [] , 1), (-1, [] , 2),
(-1, [0], -2), (-1, [0], -1), ( 0, [0], 0), (-2, [0], 1), (-2, [0], 2),
(-1, [1], -2), (-1, [1], -1), (-1, [1], 0), ( 0, [1], 1), (-2, [1], 2),
(-1, [0, 0], -2), (-1, [0, 0], -1), ( 0, [0, 0], 0), (-3, [0, 0], 1), (-3, [0, 0], 2),
(-1, [0, 1], -2), (-1, [0, 1], -1), ( 0, [0, 1], 0), ( 1, [0, 1], 1), (-3, [0, 1], 2),
(-1, [1, 1], -2), (-1, [1, 1], -1), (-1, [1, 1], 0), ( 0, [1, 1], 1), (-3, [1, 1], 2),
(-1, a, -1),
( 0, a, 0), (-2, a, 1), ( 1, a, 2), (-3, a, 3), ( 2, a, 4),
(-4, a, 5), ( 3, a, 6), (-5, a, 7), ( 4, a, 8), (-6, a, 9),
(-6, a, 10),
(-1, b, -1),
( 0, b, 0), (-2, b, 1), ( 1, b, 2), (-4, b, 3), ( 3, b, 4),
(-7, b, 5), ( 6, b, 6), (-11, b, 7), (10, b, 8), (-16, b, 9),
(-16, b, 10)
)
for expect, haystack, needle in test_left:
assertEquals(expect, lower_bound(haystack, needle), 'haystack: %r - needle: %r' % (haystack, needle))
a = list(a)
for i in xrange(11):
n = lower_bound(a, i - 1)
if n < 0:
a[-n-1:-n-1] = [i-1]
assertEquals([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], a)
b = list(b)
for i in xrange(11):
n = lower_bound(b, i - 1)
if n < 0:
b.insert(-n-1, i-1)
assertEquals([-1, 0, 1, 2, 2, 3, 4, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 8, 8, 8, 9], b)
a = [0, 2, 4, 6, 8]
b = [0, 2, 2, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8, 8]
test_right = (
(-1, [] , -2), (-1, [] , -1), (-1, [] , 0), (-1, [] , 1), (-1, [] , 2),
(-1, [0], -2), (-1, [0], -1), ( 1, [0], 0), (-2, [0], 1), (-2, [0], 2),
(-1, [1], -2), (-1, [1], -1), (-1, [1], 0), ( 1, [1], 1), (-2, [1], 2),
(-1, [0, 0], -2), (-1, [0, 0], -1), ( 2, [0, 0], 0), (-3, [0, 0], 1), (-3, [0, 0], 2),
(-1, [0, 1], -2), (-1, [0, 1], -1), ( 1, [0, 1], 0), ( 2, [0, 1], 1), (-3, [0, 1], 2),
(-1, [1, 1], -2), (-1, [1, 1], -1), (-1, [1, 1], 0), ( 2, [1, 1], 1), (-3, [1, 1], 2),
(-1, a, -1),
( 1, a, 0), (-2, a, 1), ( 2, a, 2), (-3, a, 3), ( 3, a, 4),
(-4, a, 5), ( 4, a, 6), (-5, a, 7), ( 5, a, 8), (-6, a, 9),
(-6, a, 10),
(-1, b, -1),
( 1, b, 0), (-2, b, 1), ( 3, b, 2), (-4, b, 3), ( 6, b, 4),
(-7, b, 5), (10, b, 6), (-11, b, 7), (15, b, 8), (-16, b, 9),
(-16, b, 10)
)
for expect, haystack, needle in test_right:
assertEquals(expect, upper_bound(haystack, needle), 'haystack: %r - needle: %r' % (haystack, needle))
for i in xrange(11):
n = upper_bound(a, i - 1)
if n < 0:
a.insert(-n-1, i-1)
assertEquals([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], a)
import random
a = list(xrange(100))
b = list(a)
for j in xrange(10):
c = list()
random.shuffle(b)
for i in b:
n = lower_bound(c, i)
assertTrue(n < 0)
c[-n-1:-n-1] = [i]
assertEquals(a, c)
def rcmp(x, y):
return cmp(y, x)
a.sort(cmp = rcmp)
b.sort(cmp = rcmp)
for j in xrange(10):
c = list()
random.shuffle(b)
for i in b:
n = lower_bound(c, i, cmp = rcmp)
assertTrue(n < 0)
c.insert(-n-1, i)
assertEquals(a, c)
|
The Why
Using the bisect
package can be very trying at times.
This package works great so long as the sequence elements are also the key values.
What happens with object elements with keys that are object properties,
or elements that are subsequences with the key stored in a location other
than the first element of the subsequence?
Thankfully, sorting such a sequence is trivial thanks to the key
argument on list.sort
.
But bisect
is useless for subsequences and often objects too.
Other bisect
problems
bisect
is useless on objects that have no natural sort order. Unless the object class provides a natural ordering applicable to the problem domainbisect
is unable to determine how the objects should be compared.bisect
is useless in cases even where objects have a natural sort order but you are searching using only the key value itself.bisect
does not distinguish between "found" and "not-found" scenarios when searching. You still have to check the sequence after usingbisect_left
orbisect_right
, because the return value does not tell you whether the desired item is actually in the sequence.bisect
has a similar problem if you use theinsort_left
orinsort_right
functions. The item you are insorting will always be inserted, even if a duplicate exists.
Thus, if you're using bisect
to insert an element into a list, but only if it does not exist there already, then you have to write code that looks something like this:
def insort_left_unique(a, x, lo, hi):
pos = bisect.bisect_left(a, x, lo, hi)
# because __eq__ might not be available...
if pos == len(a) or (not a[pos] < x and not x < a[pos]):
a.insert(pos, x)
Doing some poking around on the net I determined that no one else out there has a decent Pythonic binary search function. So, of course the correct answer is to write my own.
The How
The functions in this package perform the equivalent function to bisect_left
and bisect_right
with the exception of differentiating between success and failure on finding the desired item.
It finds needle
via a binary search on haystack
.
The functions have guaranteed O(log(N))
performance on a sequence having N
elements.
If needle
is found in haystack
the index of the first element (the element whose index is closest to zero) that matches needle
is returned.
If the needle
is NOT found in haystack
then a negative value n
shall be returned
such that -n-1
is the location in haystack
that needle
should have been found.
This means that needle
can be sliced into the correct location in haystack
using the idiom haystack[-n-1:-n-1] = [needle]
or alternatively using haystack.insert(-n-1, needle)
.
Of course these assignment operations will work only if haystack
is mutable.
haystack
is a sequence with the following properties:
- The sequence must be indexable.
- The sequence must be ordered such that for any two sequence elements
A
andB
where the index ofA
is less than the index ofB
then the boolean expressioncmp(key(A), key(B)) <= 0
isTrue
.
needle
is the key of the element to locate.lo
is the minimum index inhaystack
to search. This parameter is optional, if not specified then index0
is used.hi
is one past the maximum index inhaystack
to search. This parameter is optional, if not specified then the length ofhaystack
is used (i.e.,len(haystack)
).cmp
is a function of two arguments returning -1, 0 or 1 if the first argument is less than, equal to or greater than the second argument respectively. The first argument will always be a key taken from thehaystack
and the second argument will always beneedle
. Ifcmp
isNone
then the builtin comparison function will be used. This implies that the actual function called for each processed element iscmp(key(x), needle)
wherex
is an element inhaystack
. The functioncmp
will be called a maximum of O(log(n
)) times.key
- a function that takes one argument. Its return value will be used as the key value for the comparison. If key isNone
then the identity function will be used. The functionkey
will be called a maximum of O(log(n
)) times.
You can find the
assertions
package used for the unit tests for this package in my recipe Poor Man unit tests.If you want a more Pythonic API to sorted lists, have a look at Daniel Stutzbach's amazing blist -- it provides an efficient implementation using B+trees of sorted lists, letting you modify and inspect the object with the usual interface of python containers. For example you can do l.add(x), l.remove(x), test x in l, etc.
A pure Python implementation using a regular list and a bisect implementation would be very easy with the abstract base classes in the collections module, but a much better approach -- with logarithmic insertion and deletion -- would be using the indexable skiplist that Raymond Hettinger used in these two recipes.
If you're looking for an API similar to that provided by a blist.sortedlist, check out the sortedcontainers module. It implements sorted list, sorted dict, and sorted set data types in pure-Python and is fast-as-C implementations (even faster!). Learn more about sortedcontainers, available on PyPI and github.