Welcome, guest | Sign In | My Account | Store | Cart

For a number of years Python has provided developers with the special parameters 'cmp' and 'key' on list.sort and __builtin__.sorted. However Python does not provide a built-in mechanism for doing binary searches on such sorted lists. This recipe provides a simple function that allows you to perform binary searches on your sorted sequences.

Python, 174 lines
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 #! /usr/bin/env python ###################################################################### # Written by Kevin L. Sitze on 2011-02-04 # This code may be used pursuant to the MIT License. ###################################################################### import __builtin__ __all__ = ( 'lower_bound', 'upper_bound' ) def lower_bound(haystack, needle, lo = 0, hi = None, cmp = None, key = None): """lower_bound(haystack, needle[, lo = 0[, hi = None[, cmp = None[, key = None]]]]) => n Find \var{needle} via a binary search on \var{haystack}. Returns the index of the first match if \var{needle} is found, else a negative value \var{N} is returned indicating the index where \var{needle} belongs with the formula "-\var{N}-1". \var{haystack} - the ordered, indexable sequence to search. \var{needle} - the value to locate in \var{haystack}. \var{lo} and \var{hi} - the range in \var{haystack} to search. \var{cmp} - the cmp function used to order the \var{haystack} items. \var{key} - the key function used to extract keys from the elements. """ if cmp is None: cmp = __builtin__.cmp if key is None: key = lambda x: x if lo < 0: raise ValueError( 'lo cannot be negative' ) if hi is None: hi = len(haystack) val = None while lo < hi: mid = (lo + hi) >> 1 val = cmp(key(haystack[mid]), needle) if val < 0: lo = mid + 1 else: hi = mid if val is None: return -1 elif val == 0: return lo elif lo >= len(haystack): return -1 - lo elif cmp(key(haystack[lo]), needle) == 0: return lo else: return -1 - lo def upper_bound(haystack, needle, lo = 0, hi = None, cmp = None, key = None): """upper_bound(haystack, needle[, lo = 0[, hi = None[, cmp = None[, key = None]]]]) => n Find \var{needle} via a binary search on \var{haystack}. Returns the non-negative index \var{N} of the element that immediately follows the last match of \var{needle} if \var{needle} is found, else a negative value \var{N} is returned indicating the index where \var{needle} belongs with the formula "-\var{N}-1". \var{haystack} - the ordered, indexable sequence to search. \var{needle} - the value to locate in \var{haystack}. \var{lo} and \var{hi} - the range in \var{haystack} to search. \var{cmp} - the cmp function used to order the \var{haystack} items. \var{key} - the key function used to extract keys from the elements. """ if cmp is None: cmp = __builtin__.cmp if key is None: key = lambda x: x if lo < 0: raise ValueError( 'lo cannot be negative' ) if hi is None: hi = len(haystack) val = None while lo < hi: mid = (lo + hi) >> 1 val = cmp(key(haystack[mid]), needle) if val > 0: hi = mid else: lo = mid + 1 if val is None: return -1 elif val == 0: return lo elif lo > 0 and cmp(key(haystack[lo - 1]), needle) == 0: return lo else: return -1 - lo if __name__ == '__main__': from assertions import * a = [0, 2, 4, 6, 8] b = [0, 2, 2, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8, 8] test_left = ( (-1, [] , -2), (-1, [] , -1), (-1, [] , 0), (-1, [] , 1), (-1, [] , 2), (-1, , -2), (-1, , -1), ( 0, , 0), (-2, , 1), (-2, , 2), (-1, , -2), (-1, , -1), (-1, , 0), ( 0, , 1), (-2, , 2), (-1, [0, 0], -2), (-1, [0, 0], -1), ( 0, [0, 0], 0), (-3, [0, 0], 1), (-3, [0, 0], 2), (-1, [0, 1], -2), (-1, [0, 1], -1), ( 0, [0, 1], 0), ( 1, [0, 1], 1), (-3, [0, 1], 2), (-1, [1, 1], -2), (-1, [1, 1], -1), (-1, [1, 1], 0), ( 0, [1, 1], 1), (-3, [1, 1], 2), (-1, a, -1), ( 0, a, 0), (-2, a, 1), ( 1, a, 2), (-3, a, 3), ( 2, a, 4), (-4, a, 5), ( 3, a, 6), (-5, a, 7), ( 4, a, 8), (-6, a, 9), (-6, a, 10), (-1, b, -1), ( 0, b, 0), (-2, b, 1), ( 1, b, 2), (-4, b, 3), ( 3, b, 4), (-7, b, 5), ( 6, b, 6), (-11, b, 7), (10, b, 8), (-16, b, 9), (-16, b, 10) ) for expect, haystack, needle in test_left: assertEquals(expect, lower_bound(haystack, needle), 'haystack: %r - needle: %r' % (haystack, needle)) a = list(a) for i in xrange(11): n = lower_bound(a, i - 1) if n < 0: a[-n-1:-n-1] = [i-1] assertEquals([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], a) b = list(b) for i in xrange(11): n = lower_bound(b, i - 1) if n < 0: b.insert(-n-1, i-1) assertEquals([-1, 0, 1, 2, 2, 3, 4, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 8, 8, 8, 9], b) a = [0, 2, 4, 6, 8] b = [0, 2, 2, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8, 8] test_right = ( (-1, [] , -2), (-1, [] , -1), (-1, [] , 0), (-1, [] , 1), (-1, [] , 2), (-1, , -2), (-1, , -1), ( 1, , 0), (-2, , 1), (-2, , 2), (-1, , -2), (-1, , -1), (-1, , 0), ( 1, , 1), (-2, , 2), (-1, [0, 0], -2), (-1, [0, 0], -1), ( 2, [0, 0], 0), (-3, [0, 0], 1), (-3, [0, 0], 2), (-1, [0, 1], -2), (-1, [0, 1], -1), ( 1, [0, 1], 0), ( 2, [0, 1], 1), (-3, [0, 1], 2), (-1, [1, 1], -2), (-1, [1, 1], -1), (-1, [1, 1], 0), ( 2, [1, 1], 1), (-3, [1, 1], 2), (-1, a, -1), ( 1, a, 0), (-2, a, 1), ( 2, a, 2), (-3, a, 3), ( 3, a, 4), (-4, a, 5), ( 4, a, 6), (-5, a, 7), ( 5, a, 8), (-6, a, 9), (-6, a, 10), (-1, b, -1), ( 1, b, 0), (-2, b, 1), ( 3, b, 2), (-4, b, 3), ( 6, b, 4), (-7, b, 5), (10, b, 6), (-11, b, 7), (15, b, 8), (-16, b, 9), (-16, b, 10) ) for expect, haystack, needle in test_right: assertEquals(expect, upper_bound(haystack, needle), 'haystack: %r - needle: %r' % (haystack, needle)) for i in xrange(11): n = upper_bound(a, i - 1) if n < 0: a.insert(-n-1, i-1) assertEquals([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], a) import random a = list(xrange(100)) b = list(a) for j in xrange(10): c = list() random.shuffle(b) for i in b: n = lower_bound(c, i) assertTrue(n < 0) c[-n-1:-n-1] = [i] assertEquals(a, c) def rcmp(x, y): return cmp(y, x) a.sort(cmp = rcmp) b.sort(cmp = rcmp) for j in xrange(10): c = list() random.shuffle(b) for i in b: n = lower_bound(c, i, cmp = rcmp) assertTrue(n < 0) c.insert(-n-1, i) assertEquals(a, c)

The Why

Using the bisect package can be very trying at times. This package works great so long as the sequence elements are also the key values. What happens with object elements with keys that are object properties, or elements that are subsequences with the key stored in a location other than the first element of the subsequence? Thankfully, sorting such a sequence is trivial thanks to the key argument on list.sort. But bisect is useless for subsequences and often objects too.

Other bisect problems
• bisect is useless on objects that have no natural sort order. Unless the object class provides a natural ordering applicable to the problem domain bisect is unable to determine how the objects should be compared.
• bisect is useless in cases even where objects have a natural sort order but you are searching using only the key value itself.
• bisect does not distinguish between "found" and "not-found" scenarios when searching. You still have to check the sequence after using bisect_left or bisect_right, because the return value does not tell you whether the desired item is actually in the sequence.
• bisect has a similar problem if you use the insort_left or insort_right functions. The item you are insorting will always be inserted, even if a duplicate exists.

Thus, if you're using bisect to insert an element into a list, but only if it does not exist there already, then you have to write code that looks something like this:

def insort_left_unique(a, x, lo, hi):
pos = bisect.bisect_left(a, x, lo, hi)
# because __eq__ might not be available...
if pos == len(a) or (not a[pos] < x and not x < a[pos]):
a.insert(pos, x)

Doing some poking around on the net I determined that no one else out there has a decent Pythonic binary search function. So, of course the correct answer is to write my own.

The How

The functions in this package perform the equivalent function to bisect_left and bisect_right with the exception of differentiating between success and failure on finding the desired item. It finds needle via a binary search on haystack. The functions have guaranteed O(log(N)) performance on a sequence having N elements.

If needle is found in haystack the index of the first element (the element whose index is closest to zero) that matches needle is returned.

If the needle is NOT found in haystack then a negative value n shall be returned such that -n-1 is the location in haystack that needle should have been found. This means that needle can be sliced into the correct location in haystack using the idiom haystack[-n-1:-n-1] = [needle] or alternatively using haystack.insert(-n-1, needle). Of course these assignment operations will work only if haystack is mutable.

• haystack is a sequence with the following properties:
1. The sequence must be indexable.
2. The sequence must be ordered such that for any two sequence elements A and B where the index of A is less than the index of B then the boolean expression cmp(key(A), key(B)) <= 0 is True.
• needle is the key of the element to locate.
• lo is the minimum index in haystack to search. This parameter is optional, if not specified then index 0 is used.
• hi is one past the maximum index in haystack to search. This parameter is optional, if not specified then the length of haystack is used (i.e., len(haystack)).
• cmp is a function of two arguments returning -1, 0 or 1 if the first argument is less than, equal to or greater than the second argument respectively. The first argument will always be a key taken from the haystack and the second argument will always be needle. If cmp is None then the builtin comparison function will be used. This implies that the actual function called for each processed element is cmp(key(x), needle) where x is an element in haystack. The function cmp will be called a maximum of O(log(n)) times.
• key - a function that takes one argument. Its return value will be used as the key value for the comparison. If key is None then the identity function will be used. The function key will be called a maximum of O(log(n)) times. Kevin L. Sitze (author) 10 years, 11 months ago

You can find the assertions package used for the unit tests for this package in my recipe Poor Man unit tests. Matteo Dell'Amico 10 years, 11 months ago

If you want a more Pythonic API to sorted lists, have a look at Daniel Stutzbach's amazing blist -- it provides an efficient implementation using B+trees of sorted lists, letting you modify and inspect the object with the usual interface of python containers. For example you can do l.add(x), l.remove(x), test x in l, etc.

A pure Python implementation using a regular list and a bisect implementation would be very easy with the abstract base classes in the collections module, but a much better approach -- with logarithmic insertion and deletion -- would be using the indexable skiplist that Raymond Hettinger used in these two recipes. Grant Jenks 7 years, 4 months ago

If you're looking for an API similar to that provided by a blist.sortedlist, check out the sortedcontainers module. It implements sorted list, sorted dict, and sorted set data types in pure-Python and is fast-as-C implementations (even faster!). Learn more about sortedcontainers, available on PyPI and github. Created by Kevin L. Sitze on Mon, 7 Feb 2011 (MIT)