Welcome, guest | Sign In | My Account | Store | Cart

Each item has an equal chance of being picked. The iterator is processed only once, and only the selected items are stored, making this function memory efficient.

This is based on idea from Richard Papworth's recipe that picks a random from a file - see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/59865 - but is more general.

Python, 9 lines
1
2
3
4
5
6
7
8
9
def random_items(iterator, items_wanted=1):
    selected_items = [None] * items_wanted

    for item_index, item in enumerate(iterator):
        for selected_item_index in xrange(items_wanted):
            if not random.randint(0, item_index):
                selected_items[selected_item_index] = item

    return selected_items

7 comments

Steven Bethard 16 years, 5 months ago  # | flag

random.sample handles iterators. Worth noting that random.sample handles iterators:

py> import random
py> i = iter(range(10))
py> random.sample(i, 2)
[2, 0]

I don't know if it makes any guarantees about efficiency though.

Raymond Hettinger 16 years, 5 months ago  # | flag

Comparison with random.sample. Unlike random.sample, the OP's recipe samples with replacement -- whether that is a feature or bug is another question. Given an iterator input, the OP's function has the virtue of not pulling all the data into memory at once; in contrast, random.sample will sometimes (depending on the ratio of population size to sample size) pull all the data into memory. In terms of operating efficiency, the OP's recipe uses far too many calls to the underlying random number generator; in contrast, random.sample makes at many fewer calls. Here's an alternate recipe that gives the best of both worlds (memory efficiency and fewer calls to the generator):

from random import random, shuffle

def random_items(iterable, k=1):
    result = [None] * k
    for i, item in enumerate(iterable):
        if i < k:
            result[i] = item
        else:
            j = int(random() * (i+1))
            if j < k:
                result[j] = item
    shuffle(result)
    return result
Simon Brunning (author) 16 years, 5 months ago  # | flag

Clever. Oooh, nice, Raymond.

Martin Blais 16 years, 5 months ago  # | flag

further split your loop. assuming that generally len(iterable) >> k, for efficiency's sake you'd be better off splitting the loop in two parts, to avoid the soon-to-be-useless comparison i assuming that generally len(iterable) >> k, for efficiency's sake you'd be better off splitting the loop in two parts, to avoid the soon-to-be-useless comparison i

Steven James 16 years, 5 months ago  # | flag

Correctness. Please correct me if I'm wrong, but it seems like the first items in the iterable will have a much greater chance of being picked than the higher ones.

Consider

random_items(iter(range(1000)), 5)

In this case, after the first 5 items are in result the algorithm will begin generating random numbers and replacing items in result with k/n probability, with k being the desired number of random items and n being the index of each item in the iterable. So the chance for 20 being in the result (at some point during the algorithm) is 5/20, and the chance for 950 would be 5/950. Also, the lower numbers get replaced so in this case, there is almost no chance that the numbers 0-45 will be in the final array.

This is a good algorithm for picking some numbers from an iterable, but it does not seem to pick uniformly random items.

Amos Newcombe 16 years, 5 months ago  # | flag

And yet ... When we actually try it, say running it 1000 times so each item can expect to be picked 5 times in total:

lcTotalPicks = [0] * 1000
for i in range(1000):
    lcPicks = random_items(iter(range(1000)), 5)
    for j in lcPicks:
        lcTotalPicks[j] += 1

we find no significant difference between the first 20 and the last 20 items:

print lcTotalPicks[0:20]
print lcTotalPicks[980:1000]

[4, 4, 7, 9, 4, 2, 4, 2, 7, 4, 8, 9, 4, 7, 6, 7, 8, 8, 5, 1]
[4, 3, 1, 9, 9, 3, 4, 9, 8, 8, 3, 5, 6, 5, 4, 5, 5, 7, 4, 6]

Indeed, lower numbers are more likely to be in the result, and also more likely to be replaced. The two effects cancel out, and every number has the same "almost no chance" (5/1000) to be included in the final result.

Joel Miller 7 years, 3 months ago  # | flag

You can actually speed this up quite a bit. A lot of the 'j's that are created will result in no new entry. So our goal would be to generate a single random number that predicts how many steps forward to go to reach the next one that becomes a new entry in result. It's similar in spirit to the Gillespie algorithm.

The approach that I was initially going to suggest uses the fact that the negative binomial distribution can be thought of as giving the number of failures (assuming a constant probability) before a success. In this case however, they each have different probability. But if we know that each event happens with probability less than, say, 0.1, we can consider an equivalent problem: for each event happening with probability p, run a quick test with probability 0.1. If that passes, then run a second test with probability p/0.1 (I believe this is called rejection sampling). If both pass, then the event is a success. If the first fails, we don't do the second. Obviously the implementation I just described slows things down, but does give the same result.

The reason to consider the slower algorithm I described is that we can dramatically speed up the implementation of the first test. The negative binomial lets us use a single random number generation to jump to the first item that passes the test with 0.1. Then we do a second random number generation to determine if this one is really a success.

The use of 0.1 was just good for an example. What we really should do is when we look at a given i, we calculate its probability q of being added to the list. Every following probability p is less than q. So we do a negative binomial jump with q. Then whichever one we land at is a success with probability p/q (using its p). Now we reset q to be p. Repeat - the jump is probably longer now. So our average jump keeps increasing.

I believe a slightly more efficient algorithm is described in by Vitter 1985 Random sampling with a reservoir: http://www.cs.umd.edu/~samir/498/vitter.pdf The improvement comes from using a better jump than negative binomial. In principle, we not only know that each successive individual has p less than q, but we actually know in advance what those values of p are. So I think Vitter figures out the exact jump, but I haven't read the paper closely. I've just seen enough of it to know that the bright idea I had when I was looking at this algorithm is 30 years old.

Created by Simon Brunning on Wed, 22 Jun 2005 (PSF)
Python recipes (4591)
Simon Brunning's recipes (5)

Required Modules

  • (none specified)

Other Information and Tasks