Welcome, guest | Sign In | My Account | Store | Cart
-1

Each item has an equal chance of being picked. The iterator is processed only once, and only the selected items are stored, making this function memory efficient.

This is based on idea from Richard Papworth's recipe that picks a random from a file - see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/59865 - but is more general.

Python, 9 lines
1
2
3
4
5
6
7
8
9
def random_items(iterator, items_wanted=1):
    selected_items = [None] * items_wanted

    for item_index, item in enumerate(iterator):
        for selected_item_index in xrange(items_wanted):
            if not random.randint(0, item_index):
                selected_items[selected_item_index] = item

    return selected_items

6 comments

Steven Bethard 7 years, 10 months ago  # | flag

random.sample handles iterators. Worth noting that random.sample handles iterators:

py> import random
py> i = iter(range(10))
py> random.sample(i, 2)
[2, 0]

I don't know if it makes any guarantees about efficiency though.

Raymond Hettinger 7 years, 10 months ago  # | flag

Comparison with random.sample. Unlike random.sample, the OP's recipe samples with replacement -- whether that is a feature or bug is another question. Given an iterator input, the OP's function has the virtue of not pulling all the data into memory at once; in contrast, random.sample will sometimes (depending on the ratio of population size to sample size) pull all the data into memory. In terms of operating efficiency, the OP's recipe uses far too many calls to the underlying random number generator; in contrast, random.sample makes at many fewer calls. Here's an alternate recipe that gives the best of both worlds (memory efficiency and fewer calls to the generator):

from random import random, shuffle

def random_items(iterable, k=1):
    result = [None] * k
    for i, item in enumerate(iterable):
        if i < k:
            result[i] = item
        else:
            j = int(random() * (i+1))
            if j < k:
                result[j] = item
    shuffle(result)
    return result
Simon Brunning (author) 7 years, 10 months ago  # | flag

Clever. Oooh, nice, Raymond.

Martin Blais 7 years, 10 months ago  # | flag

further split your loop. assuming that generally len(iterable) >> k, for efficiency's sake you'd be better off splitting the loop in two parts, to avoid the soon-to-be-useless comparison i assuming that generally len(iterable) >> k, for efficiency's sake you'd be better off splitting the loop in two parts, to avoid the soon-to-be-useless comparison i

Steven James 7 years, 10 months ago  # | flag

Correctness. Please correct me if I'm wrong, but it seems like the first items in the iterable will have a much greater chance of being picked than the higher ones.

Consider

random_items(iter(range(1000)), 5)

In this case, after the first 5 items are in result the algorithm will begin generating random numbers and replacing items in result with k/n probability, with k being the desired number of random items and n being the index of each item in the iterable. So the chance for 20 being in the result (at some point during the algorithm) is 5/20, and the chance for 950 would be 5/950. Also, the lower numbers get replaced so in this case, there is almost no chance that the numbers 0-45 will be in the final array.

This is a good algorithm for picking some numbers from an iterable, but it does not seem to pick uniformly random items.

Amos Newcombe 7 years, 10 months ago  # | flag

And yet ... When we actually try it, say running it 1000 times so each item can expect to be picked 5 times in total:

lcTotalPicks = [0] * 1000
for i in range(1000):
    lcPicks = random_items(iter(range(1000)), 5)
    for j in lcPicks:
        lcTotalPicks[j] += 1

we find no significant difference between the first 20 and the last 20 items:

print lcTotalPicks[0:20]
print lcTotalPicks[980:1000]

[4, 4, 7, 9, 4, 2, 4, 2, 7, 4, 8, 9, 4, 7, 6, 7, 8, 8, 5, 1]
[4, 3, 1, 9, 9, 3, 4, 9, 8, 8, 3, 5, 6, 5, 4, 5, 5, 7, 4, 6]

Indeed, lower numbers are more likely to be in the result, and also more likely to be replaced. The two effects cancel out, and every number has the same "almost no chance" (5/1000) to be included in the final result.

Add a comment

Sign in to comment

Created by Simon Brunning on Wed, 22 Jun 2005 (PSF)
Python recipes (4001)
Simon Brunning's recipes (5)

Required Modules

  • (none specified)

Other Information and Tasks