|
-1
|
Each item has an equal chance of being picked. The iterator is processed only once, and only the selected items are stored, making this function memory efficient. This is based on idea from Richard Papworth's recipe that picks a random from a file - see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/59865 - but is more general.
Tags: algorithms
|
6 comments
Add a comment
Sign in to comment
Download
Copy to clipboard

random.sample handles iterators. Worth noting that random.sample handles iterators:
I don't know if it makes any guarantees about efficiency though.
Comparison with random.sample. Unlike random.sample, the OP's recipe samples with replacement -- whether that is a feature or bug is another question. Given an iterator input, the OP's function has the virtue of not pulling all the data into memory at once; in contrast, random.sample will sometimes (depending on the ratio of population size to sample size) pull all the data into memory. In terms of operating efficiency, the OP's recipe uses far too many calls to the underlying random number generator; in contrast, random.sample makes at many fewer calls. Here's an alternate recipe that gives the best of both worlds (memory efficiency and fewer calls to the generator):
Clever. Oooh, nice, Raymond.
further split your loop. assuming that generally len(iterable) >> k, for efficiency's sake you'd be better off splitting the loop in two parts, to avoid the soon-to-be-useless comparison i assuming that generally len(iterable) >> k, for efficiency's sake you'd be better off splitting the loop in two parts, to avoid the soon-to-be-useless comparison i
Correctness. Please correct me if I'm wrong, but it seems like the first items in the iterable will have a much greater chance of being picked than the higher ones.
Consider
In this case, after the first 5 items are in result the algorithm will begin generating random numbers and replacing items in result with k/n probability, with k being the desired number of random items and n being the index of each item in the iterable. So the chance for 20 being in the result (at some point during the algorithm) is 5/20, and the chance for 950 would be 5/950. Also, the lower numbers get replaced so in this case, there is almost no chance that the numbers 0-45 will be in the final array.
This is a good algorithm for picking some numbers from an iterable, but it does not seem to pick uniformly random items.
And yet ... When we actually try it, say running it 1000 times so each item can expect to be picked 5 times in total:
we find no significant difference between the first 20 and the last 20 items:
Indeed, lower numbers are more likely to be in the result, and also more likely to be replaced. The two effects cancel out, and every number has the same "almost no chance" (5/1000) to be included in the final result.