ActiveState Code

Recipe 426332: Picking random items from an iterator


Each item has an equal chance of being picked. The iterator is processed only once, and only the selected items are stored, making this function memory efficient.

This is based on idea from Richard Papworth's recipe that picks a random from a file - see http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/59865 - but is more general.

Python
1
2
3
4
5
6
7
8
9
def random_items(iterator, items_wanted=1):
    selected_items = [None] * items_wanted

    for item_index, item in enumerate(iterator):
        for selected_item_index in xrange(items_wanted):
            if not random.randint(0, item_index):
                selected_items[selected_item_index] = item

    return selected_items

Comments

  1. 1. At 3:15 p.m. on 25 jun 2005, Steven Bethard said:

    random.sample handles iterators. Worth noting that random.sample handles iterators:

    py> import random
    py> i = iter(range(10))
    py> random.sample(i, 2)
    [2, 0]
    

    I don't know if it makes any guarantees about efficiency though.

  2. 2. At 7:23 a.m. on 27 jun 2005, Raymond Hettinger said:

    Comparison with random.sample. Unlike random.sample, the OP's recipe samples with replacement -- whether that is a feature or bug is another question. Given an iterator input, the OP's function has the virtue of not pulling all the data into memory at once; in contrast, random.sample will sometimes (depending on the ratio of population size to sample size) pull all the data into memory. In terms of operating efficiency, the OP's recipe uses far too many calls to the underlying random number generator; in contrast, random.sample makes at many fewer calls. Here's an alternate recipe that gives the best of both worlds (memory efficiency and fewer calls to the generator):

    from random import random, shuffle
    
    def random_items(iterable, k=1):
        result = [None] * k
        for i, item in enumerate(iterable):
            if i < k:
                result[i] = item
            else:
                j = int(random() * (i+1))
                if j < k:
                    result[j] = item
        shuffle(result)
        return result
    
  3. 3. At 9:15 a.m. on 5 jul 2005, Simon Brunning (the author) said:

    Clever. Oooh, nice, Raymond.

  4. 4. At 6:05 a.m. on 6 jul 2005, Martin Blais said:

    further split your loop. assuming that generally len(iterable) >> k, for efficiency's sake you'd be better off splitting the loop in two parts, to avoid the soon-to-be-useless comparison i assuming that generally len(iterable) >> k, for efficiency's sake you'd be better off splitting the loop in two parts, to avoid the soon-to-be-useless comparison i

  5. 5. At 7 a.m. on 6 jul 2005, Steven James said:

    Correctness. Please correct me if I'm wrong, but it seems like the first items in the iterable will have a much greater chance of being picked than the higher ones.

    Consider

    random_items(iter(range(1000)), 5)
    

    In this case, after the first 5 items are in result the algorithm will begin generating random numbers and replacing items in result with k/n probability, with k being the desired number of random items and n being the index of each item in the iterable. So the chance for 20 being in the result (at some point during the algorithm) is 5/20, and the chance for 950 would be 5/950. Also, the lower numbers get replaced so in this case, there is almost no chance that the numbers 0-45 will be in the final array.

    This is a good algorithm for picking some numbers from an iterable, but it does not seem to pick uniformly random items.

  6. 6. At 4:59 a.m. on 9 jul 2005, Amos Newcombe said:

    And yet ... When we actually try it, say running it 1000 times so each item can expect to be picked 5 times in total:

    lcTotalPicks = [0] * 1000
    for i in range(1000):
        lcPicks = random_items(iter(range(1000)), 5)
        for j in lcPicks:
            lcTotalPicks[j] += 1
    

    we find no significant difference between the first 20 and the last 20 items:

    print lcTotalPicks[0:20]
    print lcTotalPicks[980:1000]
    
    [4, 4, 7, 9, 4, 2, 4, 2, 7, 4, 8, 9, 4, 7, 6, 7, 8, 8, 5, 1]
    [4, 3, 1, 9, 9, 3, 4, 9, 8, 8, 3, 5, 6, 5, 4, 5, 5, 7, 4, 6]
    

    Indeed, lower numbers are more likely to be in the result, and also more likely to be replaced. The two effects cancel out, and every number has the same "almost no chance" (5/1000) to be included in the final result.

Sign in to comment