For a number of years Python has provided developers with the special parameters 'cmp' and 'key' on list.sort and __builtin__.sorted. However Python does not provide a built-in mechanism for doing binary searches on such sorted lists. This recipe provides a simple function that allows you to perform binary searches on your sorted sequences.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | ```
#! /usr/bin/env python
######################################################################
# Written by Kevin L. Sitze on 2011-02-04
# This code may be used pursuant to the MIT License.
######################################################################
import __builtin__
__all__ = (
'lower_bound',
'upper_bound'
)
def lower_bound(haystack, needle, lo = 0, hi = None, cmp = None, key = None):
"""lower_bound(haystack, needle[, lo = 0[, hi = None[, cmp = None[, key = None]]]]) => n
Find \var{needle} via a binary search on \var{haystack}. Returns the
index of the first match if \var{needle} is found, else a negative
value \var{N} is returned indicating the index where \var{needle}
belongs with the formula "-\var{N}-1".
\var{haystack} - the ordered, indexable sequence to search.
\var{needle} - the value to locate in \var{haystack}.
\var{lo} and \var{hi} - the range in \var{haystack} to search.
\var{cmp} - the cmp function used to order the \var{haystack} items.
\var{key} - the key function used to extract keys from the elements.
"""
if cmp is None: cmp = __builtin__.cmp
if key is None: key = lambda x: x
if lo < 0: raise ValueError( 'lo cannot be negative' )
if hi is None: hi = len(haystack)
val = None
while lo < hi:
mid = (lo + hi) >> 1
val = cmp(key(haystack[mid]), needle)
if val < 0:
lo = mid + 1
else:
hi = mid
if val is None: return -1
elif val == 0: return lo
elif lo >= len(haystack): return -1 - lo
elif cmp(key(haystack[lo]), needle) == 0: return lo
else: return -1 - lo
def upper_bound(haystack, needle, lo = 0, hi = None, cmp = None, key = None):
"""upper_bound(haystack, needle[, lo = 0[, hi = None[, cmp = None[, key = None]]]]) => n
Find \var{needle} via a binary search on \var{haystack}. Returns the
non-negative index \var{N} of the element that immediately follows the
last match of \var{needle} if \var{needle} is found, else a negative
value \var{N} is returned indicating the index where \var{needle}
belongs with the formula "-\var{N}-1".
\var{haystack} - the ordered, indexable sequence to search.
\var{needle} - the value to locate in \var{haystack}.
\var{lo} and \var{hi} - the range in \var{haystack} to search.
\var{cmp} - the cmp function used to order the \var{haystack} items.
\var{key} - the key function used to extract keys from the elements.
"""
if cmp is None: cmp = __builtin__.cmp
if key is None: key = lambda x: x
if lo < 0: raise ValueError( 'lo cannot be negative' )
if hi is None: hi = len(haystack)
val = None
while lo < hi:
mid = (lo + hi) >> 1
val = cmp(key(haystack[mid]), needle)
if val > 0:
hi = mid
else:
lo = mid + 1
if val is None: return -1
elif val == 0: return lo
elif lo > 0 and cmp(key(haystack[lo - 1]), needle) == 0: return lo
else: return -1 - lo
if __name__ == '__main__':
from assertions import *
a = [0, 2, 4, 6, 8]
b = [0, 2, 2, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8, 8]
test_left = (
(-1, [] , -2), (-1, [] , -1), (-1, [] , 0), (-1, [] , 1), (-1, [] , 2),
(-1, [0], -2), (-1, [0], -1), ( 0, [0], 0), (-2, [0], 1), (-2, [0], 2),
(-1, [1], -2), (-1, [1], -1), (-1, [1], 0), ( 0, [1], 1), (-2, [1], 2),
(-1, [0, 0], -2), (-1, [0, 0], -1), ( 0, [0, 0], 0), (-3, [0, 0], 1), (-3, [0, 0], 2),
(-1, [0, 1], -2), (-1, [0, 1], -1), ( 0, [0, 1], 0), ( 1, [0, 1], 1), (-3, [0, 1], 2),
(-1, [1, 1], -2), (-1, [1, 1], -1), (-1, [1, 1], 0), ( 0, [1, 1], 1), (-3, [1, 1], 2),
(-1, a, -1),
( 0, a, 0), (-2, a, 1), ( 1, a, 2), (-3, a, 3), ( 2, a, 4),
(-4, a, 5), ( 3, a, 6), (-5, a, 7), ( 4, a, 8), (-6, a, 9),
(-6, a, 10),
(-1, b, -1),
( 0, b, 0), (-2, b, 1), ( 1, b, 2), (-4, b, 3), ( 3, b, 4),
(-7, b, 5), ( 6, b, 6), (-11, b, 7), (10, b, 8), (-16, b, 9),
(-16, b, 10)
)
for expect, haystack, needle in test_left:
assertEquals(expect, lower_bound(haystack, needle), 'haystack: %r - needle: %r' % (haystack, needle))
a = list(a)
for i in xrange(11):
n = lower_bound(a, i - 1)
if n < 0:
a[-n-1:-n-1] = [i-1]
assertEquals([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], a)
b = list(b)
for i in xrange(11):
n = lower_bound(b, i - 1)
if n < 0:
b.insert(-n-1, i-1)
assertEquals([-1, 0, 1, 2, 2, 3, 4, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 8, 8, 8, 9], b)
a = [0, 2, 4, 6, 8]
b = [0, 2, 2, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8, 8]
test_right = (
(-1, [] , -2), (-1, [] , -1), (-1, [] , 0), (-1, [] , 1), (-1, [] , 2),
(-1, [0], -2), (-1, [0], -1), ( 1, [0], 0), (-2, [0], 1), (-2, [0], 2),
(-1, [1], -2), (-1, [1], -1), (-1, [1], 0), ( 1, [1], 1), (-2, [1], 2),
(-1, [0, 0], -2), (-1, [0, 0], -1), ( 2, [0, 0], 0), (-3, [0, 0], 1), (-3, [0, 0], 2),
(-1, [0, 1], -2), (-1, [0, 1], -1), ( 1, [0, 1], 0), ( 2, [0, 1], 1), (-3, [0, 1], 2),
(-1, [1, 1], -2), (-1, [1, 1], -1), (-1, [1, 1], 0), ( 2, [1, 1], 1), (-3, [1, 1], 2),
(-1, a, -1),
( 1, a, 0), (-2, a, 1), ( 2, a, 2), (-3, a, 3), ( 3, a, 4),
(-4, a, 5), ( 4, a, 6), (-5, a, 7), ( 5, a, 8), (-6, a, 9),
(-6, a, 10),
(-1, b, -1),
( 1, b, 0), (-2, b, 1), ( 3, b, 2), (-4, b, 3), ( 6, b, 4),
(-7, b, 5), (10, b, 6), (-11, b, 7), (15, b, 8), (-16, b, 9),
(-16, b, 10)
)
for expect, haystack, needle in test_right:
assertEquals(expect, upper_bound(haystack, needle), 'haystack: %r - needle: %r' % (haystack, needle))
for i in xrange(11):
n = upper_bound(a, i - 1)
if n < 0:
a.insert(-n-1, i-1)
assertEquals([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], a)
import random
a = list(xrange(100))
b = list(a)
for j in xrange(10):
c = list()
random.shuffle(b)
for i in b:
n = lower_bound(c, i)
assertTrue(n < 0)
c[-n-1:-n-1] = [i]
assertEquals(a, c)
def rcmp(x, y):
return cmp(y, x)
a.sort(cmp = rcmp)
b.sort(cmp = rcmp)
for j in xrange(10):
c = list()
random.shuffle(b)
for i in b:
n = lower_bound(c, i, cmp = rcmp)
assertTrue(n < 0)
c.insert(-n-1, i)
assertEquals(a, c)
``` |

#### The Why

Using the `bisect`

package can be very trying at times.
This package works great so long as the sequence elements are also the key values.
What happens with object elements with keys that are object properties,
or elements that are subsequences with the key stored in a location other
than the first element of the subsequence?
Thankfully, sorting such a sequence is trivial thanks to the `key`

argument on `list.sort`

.
But `bisect`

is useless for subsequences and often objects too.

##### Other `bisect`

problems

`bisect`

is useless on objects that have no natural sort order. Unless the object class provides a natural ordering applicable to the problem domain`bisect`

is unable to determine how the objects should be compared.`bisect`

is useless in cases even where objects have a natural sort order but you are searching using only the key value itself.`bisect`

does not distinguish between "found" and "not-found" scenarios when searching. You still have to check the sequence after using`bisect_left`

or`bisect_right`

, because the return value does not tell you whether the desired item is actually in the sequence.`bisect`

has a similar problem if you use the`insort_left`

or`insort_right`

functions. The item you are insorting will always be inserted, even if a duplicate exists.

Thus, if you're using `bisect`

to insert an element into a list, but only if it does not exist there already, then you have to write code that looks something like this:

```
def insort_left_unique(a, x, lo, hi):
pos = bisect.bisect_left(a, x, lo, hi)
# because __eq__ might not be available...
if pos == len(a) or (not a[pos] < x and not x < a[pos]):
a.insert(pos, x)
```

Doing some poking around on the net I determined that no one else out there has a decent Pythonic binary search function. So, of course the correct answer is to write my own.

#### The How

The functions in this package perform the equivalent function to `bisect_left`

and `bisect_right`

with the exception of differentiating between success and failure on finding the desired item.
It finds `needle`

via a binary search on `haystack`

.
The functions have guaranteed `O(log(N))`

performance on a sequence having `N`

elements.

If `needle`

is found in `haystack`

the index of the first element (the element whose index is closest to zero) that matches `needle`

is returned.

If the `needle`

is NOT found in `haystack`

then a negative value `n`

shall be returned
such that `-n-1`

is the location in `haystack`

that `needle`

should have been found.
This means that `needle`

can be sliced into the correct location in `haystack`

using the idiom `haystack[-n-1:-n-1] = [needle]`

or alternatively using `haystack.insert(-n-1, needle)`

.
Of course these assignment operations will work only if `haystack`

is mutable.

`haystack`

is a sequence with the following properties:

- The sequence must be indexable.
- The sequence must be ordered such that for any two sequence elements
`A`

and`B`

where the index of`A`

is less than the index of`B`

then the boolean expression`cmp(key(A), key(B)) <= 0`

is`True`

.

`needle`

is the key of the element to locate.`lo`

is the minimum index in`haystack`

to search. This parameter is optional, if not specified then index`0`

is used.`hi`

is one past the maximum index in`haystack`

to search. This parameter is optional, if not specified then the length of`haystack`

is used (i.e.,`len(haystack)`

).`cmp`

is a function of two arguments returning -1, 0 or 1 if the first argument is less than, equal to or greater than the second argument respectively. The first argument will always be a key taken from the`haystack`

and the second argument will always be`needle`

. If`cmp`

is`None`

then the builtin comparison function will be used. This implies that the actual function called for each processed element is`cmp(key(x), needle)`

where`x`

is an element in`haystack`

. The function`cmp`

will be called a maximum of O(log(`n`

)) times.`key`

- a function that takes one argument. Its return value will be used as the key value for the comparison. If key is`None`

then the identity function will be used. The function`key`

will be called a maximum of O(log(`n`

)) times.

You can find the

`assertions`

package used for the unit tests for this package in my recipe Poor Man unit tests.If you want a more Pythonic API to sorted lists, have a look at Daniel Stutzbach's amazing blist -- it provides an efficient implementation using B+trees of sorted lists, letting you modify and inspect the object with the usual interface of python containers. For example you can do l.add(x), l.remove(x), test x in l, etc.

A pure Python implementation using a regular list and a bisect implementation would be very easy with the abstract base classes in the collections module, but a much better approach -- with logarithmic insertion and deletion -- would be using the indexable skiplist that Raymond Hettinger used in these two recipes.

If you're looking for an API similar to that provided by a blist.sortedlist, check out the sortedcontainers module. It implements sorted list, sorted dict, and sorted set data types in pure-Python and is fast-as-C implementations (even faster!). Learn more about sortedcontainers, available on PyPI and github.