Welcome, guest | Sign In | My Account | Store | Cart

For a number of years Python has provided developers with the special parameters 'cmp' and 'key' on list.sort and __builtin__.sorted. However Python does not provide a built-in mechanism for doing binary searches on such sorted lists. This recipe provides a simple function that allows you to perform binary searches on your sorted sequences.

Python, 174 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
#! /usr/bin/env python
######################################################################
#  Written by Kevin L. Sitze on 2011-02-04
#  This code may be used pursuant to the MIT License.
######################################################################

import __builtin__

__all__ = (
    'lower_bound',
    'upper_bound'
)

def lower_bound(haystack, needle, lo = 0, hi = None, cmp = None, key = None):
    """lower_bound(haystack, needle[, lo = 0[, hi = None[, cmp = None[, key = None]]]]) => n

Find \var{needle} via a binary search on \var{haystack}.  Returns the
index of the first match if \var{needle} is found, else a negative
value \var{N} is returned indicating the index where \var{needle}
belongs with the formula "-\var{N}-1".

\var{haystack} - the ordered, indexable sequence to search.
\var{needle} - the value to locate in \var{haystack}.
\var{lo} and \var{hi} - the range in \var{haystack} to search.
\var{cmp} - the cmp function used to order the \var{haystack} items.
\var{key} - the key function used to extract keys from the elements.
"""
    if cmp is None: cmp = __builtin__.cmp
    if key is None: key = lambda x: x
    if lo < 0: raise ValueError( 'lo cannot be negative' )
    if hi is None: hi = len(haystack)

    val = None
    while lo < hi:
        mid = (lo + hi) >> 1
        val = cmp(key(haystack[mid]), needle)
        if val < 0:
            lo = mid + 1
        else:
            hi = mid
    if val is None: return -1
    elif val == 0: return lo
    elif lo >= len(haystack): return -1 - lo
    elif cmp(key(haystack[lo]), needle) == 0: return lo
    else: return -1 - lo

def upper_bound(haystack, needle, lo = 0, hi = None, cmp = None, key = None):
    """upper_bound(haystack, needle[, lo = 0[, hi = None[, cmp = None[, key = None]]]]) => n

Find \var{needle} via a binary search on \var{haystack}.  Returns the
non-negative index \var{N} of the element that immediately follows the
last match of \var{needle} if \var{needle} is found, else a negative
value \var{N} is returned indicating the index where \var{needle}
belongs with the formula "-\var{N}-1".

\var{haystack} - the ordered, indexable sequence to search.
\var{needle} - the value to locate in \var{haystack}.
\var{lo} and \var{hi} - the range in \var{haystack} to search.
\var{cmp} - the cmp function used to order the \var{haystack} items.
\var{key} - the key function used to extract keys from the elements.
"""
    if cmp is None: cmp = __builtin__.cmp
    if key is None: key = lambda x: x
    if lo < 0: raise ValueError( 'lo cannot be negative' )
    if hi is None: hi = len(haystack)

    val = None
    while lo < hi:
        mid = (lo + hi) >> 1
        val = cmp(key(haystack[mid]), needle)
        if val > 0:
            hi = mid
        else:
            lo = mid + 1
    if val is None: return -1
    elif val == 0: return lo
    elif lo > 0 and cmp(key(haystack[lo - 1]), needle) == 0: return lo
    else: return -1 - lo

if __name__ == '__main__':
    from assertions import *

    a = [0, 2, 4, 6, 8]
    b = [0, 2, 2, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8, 8]
    test_left = (
        (-1, [] , -2), (-1, [] , -1), (-1, [] , 0), (-1, [] , 1), (-1, [] , 2),
        (-1, [0], -2), (-1, [0], -1), ( 0, [0], 0), (-2, [0], 1), (-2, [0], 2),
        (-1, [1], -2), (-1, [1], -1), (-1, [1], 0), ( 0, [1], 1), (-2, [1], 2),

        (-1, [0, 0], -2), (-1, [0, 0], -1), ( 0, [0, 0], 0), (-3, [0, 0], 1), (-3, [0, 0], 2),
        (-1, [0, 1], -2), (-1, [0, 1], -1), ( 0, [0, 1], 0), ( 1, [0, 1], 1), (-3, [0, 1], 2),
        (-1, [1, 1], -2), (-1, [1, 1], -1), (-1, [1, 1], 0), ( 0, [1, 1], 1), (-3, [1, 1], 2),

        (-1, a, -1),
        ( 0, a,  0), (-2, a,  1), ( 1, a,  2), (-3, a,  3), ( 2, a,  4),
        (-4, a,  5), ( 3, a,  6), (-5, a,  7), ( 4, a,  8), (-6, a,  9),
        (-6, a, 10),

        (-1, b, -1),
        ( 0, b,  0), (-2, b,  1), (  1, b,  2), (-4, b,  3), (  3, b,  4),
        (-7, b,  5), ( 6, b,  6), (-11, b,  7), (10, b,  8), (-16, b,  9),
        (-16, b, 10)
    )
    for expect, haystack, needle in test_left:
        assertEquals(expect, lower_bound(haystack, needle), 'haystack: %r - needle: %r' % (haystack, needle))

    a = list(a)
    for i in xrange(11):
        n = lower_bound(a, i - 1)
        if n < 0:
            a[-n-1:-n-1] = [i-1]
    assertEquals([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], a)

    b = list(b)
    for i in xrange(11):
        n = lower_bound(b, i - 1)
        if n < 0:
            b.insert(-n-1, i-1)
    assertEquals([-1, 0, 1, 2, 2, 3, 4, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 8, 8, 8, 9], b)

    a = [0, 2, 4, 6, 8]
    b = [0, 2, 2, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8, 8]
    test_right = (
        (-1, [] , -2), (-1, [] , -1), (-1, [] , 0), (-1, [] , 1), (-1, [] , 2),
        (-1, [0], -2), (-1, [0], -1), ( 1, [0], 0), (-2, [0], 1), (-2, [0], 2),
        (-1, [1], -2), (-1, [1], -1), (-1, [1], 0), ( 1, [1], 1), (-2, [1], 2),

        (-1, [0, 0], -2), (-1, [0, 0], -1), ( 2, [0, 0], 0), (-3, [0, 0], 1), (-3, [0, 0], 2),
        (-1, [0, 1], -2), (-1, [0, 1], -1), ( 1, [0, 1], 0), ( 2, [0, 1], 1), (-3, [0, 1], 2),
        (-1, [1, 1], -2), (-1, [1, 1], -1), (-1, [1, 1], 0), ( 2, [1, 1], 1), (-3, [1, 1], 2),

        (-1, a, -1),
        ( 1, a,  0), (-2, a,  1), ( 2, a,  2), (-3, a,  3), ( 3, a,  4),
        (-4, a,  5), ( 4, a,  6), (-5, a,  7), ( 5, a,  8), (-6, a,  9),
        (-6, a, 10),

        (-1, b, -1),
        ( 1, b,  0), (-2, b,  1), (  3, b,  2), (-4, b,  3), (  6, b,  4),
        (-7, b,  5), (10, b,  6), (-11, b,  7), (15, b,  8), (-16, b,  9),
        (-16, b, 10)
    )
    for expect, haystack, needle in test_right:
        assertEquals(expect, upper_bound(haystack, needle), 'haystack: %r - needle: %r' % (haystack, needle))
    for i in xrange(11):
        n = upper_bound(a, i - 1)
        if n < 0:
            a.insert(-n-1, i-1)
    assertEquals([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], a)

    import random
    a = list(xrange(100))
    b = list(a)
    for j in xrange(10):
        c = list()
        random.shuffle(b)
        for i in b:
            n = lower_bound(c, i)
            assertTrue(n < 0)
            c[-n-1:-n-1] = [i]
        assertEquals(a, c)

    def rcmp(x, y):
        return cmp(y, x)

    a.sort(cmp = rcmp)
    b.sort(cmp = rcmp)
    for j in xrange(10):
        c = list()
        random.shuffle(b)
        for i in b:
            n = lower_bound(c, i, cmp = rcmp)
            assertTrue(n < 0)
            c.insert(-n-1, i)
        assertEquals(a, c)

The Why

Using the bisect package can be very trying at times. This package works great so long as the sequence elements are also the key values. What happens with object elements with keys that are object properties, or elements that are subsequences with the key stored in a location other than the first element of the subsequence? Thankfully, sorting such a sequence is trivial thanks to the key argument on list.sort. But bisect is useless for subsequences and often objects too.

Other bisect problems
  • bisect is useless on objects that have no natural sort order. Unless the object class provides a natural ordering applicable to the problem domain bisect is unable to determine how the objects should be compared.
  • bisect is useless in cases even where objects have a natural sort order but you are searching using only the key value itself.
  • bisect does not distinguish between "found" and "not-found" scenarios when searching. You still have to check the sequence after using bisect_left or bisect_right, because the return value does not tell you whether the desired item is actually in the sequence.
  • bisect has a similar problem if you use the insort_left or insort_right functions. The item you are insorting will always be inserted, even if a duplicate exists.

Thus, if you're using bisect to insert an element into a list, but only if it does not exist there already, then you have to write code that looks something like this:

def insort_left_unique(a, x, lo, hi):
    pos = bisect.bisect_left(a, x, lo, hi)
    # because __eq__ might not be available...
    if pos == len(a) or (not a[pos] < x and not x < a[pos]):
        a.insert(pos, x)

Doing some poking around on the net I determined that no one else out there has a decent Pythonic binary search function. So, of course the correct answer is to write my own.

The How

The functions in this package perform the equivalent function to bisect_left and bisect_right with the exception of differentiating between success and failure on finding the desired item. It finds needle via a binary search on haystack. The functions have guaranteed O(log(N)) performance on a sequence having N elements.

If needle is found in haystack the index of the first element (the element whose index is closest to zero) that matches needle is returned.

If the needle is NOT found in haystack then a negative value n shall be returned such that -n-1 is the location in haystack that needle should have been found. This means that needle can be sliced into the correct location in haystack using the idiom haystack[-n-1:-n-1] = [needle] or alternatively using haystack.insert(-n-1, needle). Of course these assignment operations will work only if haystack is mutable.

  • haystack is a sequence with the following properties:
  1. The sequence must be indexable.
  2. The sequence must be ordered such that for any two sequence elements A and B where the index of A is less than the index of B then the boolean expression cmp(key(A), key(B)) <= 0 is True.
  • needle is the key of the element to locate.
  • lo is the minimum index in haystack to search. This parameter is optional, if not specified then index 0 is used.
  • hi is one past the maximum index in haystack to search. This parameter is optional, if not specified then the length of haystack is used (i.e., len(haystack)).
  • cmp is a function of two arguments returning -1, 0 or 1 if the first argument is less than, equal to or greater than the second argument respectively. The first argument will always be a key taken from the haystack and the second argument will always be needle. If cmp is None then the builtin comparison function will be used. This implies that the actual function called for each processed element is cmp(key(x), needle) where x is an element in haystack. The function cmp will be called a maximum of O(log(n)) times.
  • key - a function that takes one argument. Its return value will be used as the key value for the comparison. If key is None then the identity function will be used. The function key will be called a maximum of O(log(n)) times.

3 comments

Kevin L. Sitze (author) 13 years, 1 month ago  # | flag

You can find the assertions package used for the unit tests for this package in my recipe Poor Man unit tests.

Matteo Dell'Amico 13 years, 1 month ago  # | flag

If you want a more Pythonic API to sorted lists, have a look at Daniel Stutzbach's amazing blist -- it provides an efficient implementation using B+trees of sorted lists, letting you modify and inspect the object with the usual interface of python containers. For example you can do l.add(x), l.remove(x), test x in l, etc.

A pure Python implementation using a regular list and a bisect implementation would be very easy with the abstract base classes in the collections module, but a much better approach -- with logarithmic insertion and deletion -- would be using the indexable skiplist that Raymond Hettinger used in these two recipes.

Grant Jenks 9 years, 6 months ago  # | flag

If you're looking for an API similar to that provided by a blist.sortedlist, check out the sortedcontainers module. It implements sorted list, sorted dict, and sorted set data types in pure-Python and is fast-as-C implementations (even faster!). Learn more about sortedcontainers, available on PyPI and github.