Each item has an equal chance of being picked. The iterator is processed only once, and only the selected items are stored, making this function memory efficient.
This is based on idea from Richard Papworth's recipe that picks a random from a file - see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/59865 - but is more general.
1 2 3 4 5 6 7 8 9 | def random_items(iterator, items_wanted=1):
selected_items = [None] * items_wanted
for item_index, item in enumerate(iterator):
for selected_item_index in xrange(items_wanted):
if not random.randint(0, item_index):
selected_items[selected_item_index] = item
return selected_items
|
Tags: algorithms
random.sample handles iterators. Worth noting that random.sample handles iterators:
I don't know if it makes any guarantees about efficiency though.
Comparison with random.sample. Unlike random.sample, the OP's recipe samples with replacement -- whether that is a feature or bug is another question. Given an iterator input, the OP's function has the virtue of not pulling all the data into memory at once; in contrast, random.sample will sometimes (depending on the ratio of population size to sample size) pull all the data into memory. In terms of operating efficiency, the OP's recipe uses far too many calls to the underlying random number generator; in contrast, random.sample makes at many fewer calls. Here's an alternate recipe that gives the best of both worlds (memory efficiency and fewer calls to the generator):
Clever. Oooh, nice, Raymond.
further split your loop. assuming that generally len(iterable) >> k, for efficiency's sake you'd be better off splitting the loop in two parts, to avoid the soon-to-be-useless comparison i assuming that generally len(iterable) >> k, for efficiency's sake you'd be better off splitting the loop in two parts, to avoid the soon-to-be-useless comparison i
Correctness. Please correct me if I'm wrong, but it seems like the first items in the iterable will have a much greater chance of being picked than the higher ones.
Consider
In this case, after the first 5 items are in result the algorithm will begin generating random numbers and replacing items in result with k/n probability, with k being the desired number of random items and n being the index of each item in the iterable. So the chance for 20 being in the result (at some point during the algorithm) is 5/20, and the chance for 950 would be 5/950. Also, the lower numbers get replaced so in this case, there is almost no chance that the numbers 0-45 will be in the final array.
This is a good algorithm for picking some numbers from an iterable, but it does not seem to pick uniformly random items.
And yet ... When we actually try it, say running it 1000 times so each item can expect to be picked 5 times in total:
we find no significant difference between the first 20 and the last 20 items:
Indeed, lower numbers are more likely to be in the result, and also more likely to be replaced. The two effects cancel out, and every number has the same "almost no chance" (5/1000) to be included in the final result.
You can actually speed this up quite a bit. A lot of the 'j's that are created will result in no new entry. So our goal would be to generate a single random number that predicts how many steps forward to go to reach the next one that becomes a new entry in result. It's similar in spirit to the Gillespie algorithm.
The approach that I was initially going to suggest uses the fact that the negative binomial distribution can be thought of as giving the number of failures (assuming a constant probability) before a success. In this case however, they each have different probability. But if we know that each event happens with probability less than, say, 0.1, we can consider an equivalent problem: for each event happening with probability p, run a quick test with probability 0.1. If that passes, then run a second test with probability p/0.1 (I believe this is called rejection sampling). If both pass, then the event is a success. If the first fails, we don't do the second. Obviously the implementation I just described slows things down, but does give the same result.
The reason to consider the slower algorithm I described is that we can dramatically speed up the implementation of the first test. The negative binomial lets us use a single random number generation to jump to the first item that passes the test with 0.1. Then we do a second random number generation to determine if this one is really a success.
The use of 0.1 was just good for an example. What we really should do is when we look at a given i, we calculate its probability q of being added to the list. Every following probability p is less than q. So we do a negative binomial jump with q. Then whichever one we land at is a success with probability p/q (using its p). Now we reset q to be p. Repeat - the jump is probably longer now. So our average jump keeps increasing.
I believe a slightly more efficient algorithm is described in by Vitter 1985 Random sampling with a reservoir: http://www.cs.umd.edu/~samir/498/vitter.pdf The improvement comes from using a better jump than negative binomial. In principle, we not only know that each successive individual has p less than q, but we actually know in advance what those values of p are. So I think Vitter figures out the exact jump, but I haven't read the paper closely. I've just seen enough of it to know that the bright idea I had when I was looking at this algorithm is 30 years old.