O(n) quicksort style algorithm for looking up data based on rank order. Useful for finding medians, percentiles, quartiles, and deciles. Equivalent to data[n] when the data is already sorted.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import random
def select(data, n):
"Find the nth rank ordered element (the least value has rank 0)."
data = list(data)
if not 0 <= n < len(data):
raise ValueError('not enough elements for the given rank')
while True:
pivot = random.choice(data)
pcount = 0
under, over = [], []
uappend, oappend = under.append, over.append
for elem in data:
if elem < pivot:
uappend(elem)
elif elem > pivot:
oappend(elem)
else:
pcount += 1
if n < len(under):
data = under
elif n < len(under) + pcount:
return pivot
else:
data = over
n -= len(under) + pcount
|
The input data can be any iterable.<pre></pre> The randomization of pivots makes the algorithm perform consistently even with unfavorable data orderings (the same kind that wreak havoc on quicksort). Makes approximately lg2(N) calls to random.choice().<pre></pre> Revised to include the pivot counts after David Eppstein pointed out that the originally posted algorithm ran slowly when all the inputs were equal.
Not always O(n). This takes quadratic time when all items are equal, no?
O(n) for a sequence of identical values.
</pre>
Invalid previous post. Don't know what's up, but I can't post the entire modifed algorithm...
Using cmp(). Even within the <pre> block, some additional markup is necessary to post code that includes less than or greater than comparisons. The markup is < and > respectively.
The idea you were starting to post looks like an attempt to use cmp() to avoid doing two comparisons. My benchmarks show that this runs more slowly than using two comparisons but that may be dependent on the type of data being selected.
My previous comment was for v1.2. v1.3 seems to fix the worst case performance, but at the expense of a larger number of comparisons. I wonder if something like the following code would be faster. I'm not sure the ops[cmp(...)] trickery is very Pythonic, though, especially as cmp doesn't seem to be guaranteed to return -1,0,1.
Using cmp() and computed append reference. This approach seems inspired but does not win in timings with integer data. Ultimately, it should win whenever comparisons are very expensive and the data objects implement a clever __cmp__() method (otherwise, the builtin cmp() function will end-up making two comparisons to differentiate the three cases).
Why? I'm unclear as to what kind of data you would use this for. I did some very simple tests, on lists of integers and strings, and in all cases that I tried, the obvious solution runs faster:
Can you give some guidelines as to when your solution would be better?
Why use select()? The difference between an O(n) algorithm and an O(n lg n) algorithm becomes more pronounced with long list lengths. This is particularly important when the constant factor is large (perhaps due to an expensive compare function).
For instance, try generating a million and one random points and find the median point using a custom ordering function (dist, angle):
Also, when you run comparisions with sort(), be sure to randomize the data beforehand (sort() will take shortcuts if the data is already partially ordered).