Welcome, guest | Sign In | My Account | Store | Cart

This is a translation of the Fortran subroutine RANKSB which appears on pp 38-9 of Nijenhuis & Wilk (1975) Combinatorial Algorithms, Academic Press.

Python, 66 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
from random import random, randrange

def ranksb ( N, K ) :
    if N < K :
        raise Exception, "N must be no less than K"
    if K == 0 : return [ ]

    L2 = K + 1
    R = L2
    A = K * [ 0 ]
    while 1 :
        M = 1 + int ( random ( ) * N )
        I = 1 + ( M - 1 ) % K
        breakthencontinue = 0
        if A [ I - 1 ]  != 0 :
            while M != A [ I - 1 ] / L2 :
                LINK = A [ I - 1 ] % L2
                if LINK == 0 :
                    while 1 :
                        R -= 1
                        if R == 0 : return map ( lambda a : a / L2, A )
                        if A [ R - 1 ] <= 0 :
                            A [ I - 1 ]  += R
                            I = R
                            A [ I - 1 ] = L2 * M
                            break
                    breakthencontinue = 1
                    break
                I = LINK
            else :
                continue
        if breakthencontinue :
            continue
        A [ I - 1 ] = L2 * M

if __name__ == "__main__" :
    from fpformat import fix
    from time import time

    counts = { }
    n , k = 105, 90
    sampleSize = 1000

    timeStart = time ( )
    for s in xrange ( sampleSize ) :
        a = ranksb ( n, k )
        for i in a :
            if i in counts :
                counts [ i ] += 1
            else :
                counts [ i ] = 1
    print "Time to generate %i %i-subsets from set of size %i: %s seconds" \
        % ( sampleSize, k, n, fix ( time ( ) - timeStart, 3 ) )

    keys = counts . keys ( )
    keys . sort ( )
    totalCount = 0
    idealCount = sampleSize * k / n
    ChiSquare = 0
    print "Counts of occurrences of each sample element, "
    print "and difference between 'ideal' count and actual"
    for key in keys :
        print key, counts [ key ], abs ( counts [ key ] - idealCount )
        totalCount += counts [ key ]
        ChiSquare +=float ( pow ( counts [ key ] - idealCount, 2 ) ) / idealCount
    print "Chi-squared test of uniformity: %s on %i d.f." % ( fix ( ChiSquare, 3), n - 1 )

Originally written for a research project, this algorithm is useful wherever one needs to resample for use in jackknifing and related methods.

5 comments

Raymond Hettinger 21 years, 6 months ago  # | flag

Take advantage of dictionaries. Use alternate approach for difficult cases. The spirit of the above algorithm is to use the result array as a double chained hash table, make repeated random picks, and query the table to see if the number was already picked. The method works well with large values of n, consumes memory space proportional to k, and runs with average time proportional to k when k is less than 2/3's of n. If k is closer to n, there will be many collisions and the algorithm will thrash unacceptably.

The following code adopts an equivalent method but takes advantage of the efficient C-coded hashing already built into Python's dictionaries. In cases where k is more than 2/3's of n, the code switches to a shuffle algorithm which takes time proportional to k and space proportional to n.

import random

def ranksb1(n,k):
    if k > n: raise Exception, "N must be no less than K"
    if k > n * 2 // 3:
        pool = range(1,n+1)
        for i in xrange(n-1, n-k-1, -1):
            j = random.randrange(i+1)
            pool[i], pool[j] = pool[j], pool[i]
        return pool[-k:]
    selections = {}
    while k > len(selections):
        value = random.randrange(1, n+1)
        if value in selections: continue
        selections[value] = True
    return selections.keys()
Bill Bell (author) 21 years, 4 months ago  # | flag

Error in algorithm. The routine fails (or appears to fail) when n=k. Add a statement that returns the entire set when the subset size must be the same as the set size.

Raymond Hettinger 21 years, 4 months ago  # | flag

Refined version. After further development and peer review, the above code was refined to have a more parallel structure, have consistent variable names, use local variable optimization, and return selections in selection order. The new algorithm switchover point minimizes memory consumption while maintaining best speed. The new function name better indicates that the routine implements random sampling without replacement.

from random import random

def sample(n, k, random=random, int=int):
    """Chooses k unique random elements from [0,n).

    Used for random sampling without replacement.
    """

    if not n >= k >= 0:
        raise ValueError, "sample larger than population"
    result = [None] * k
    if k * 6 > n:     # if n len list takes less space than a k len dict
        pool = range(n)
        for i in xrange(k):         # invariant:  non-selected at [0,n-i)
            j = int(random() * (n-i))
            result[i] = pool[j]
            pool[j] = pool[n-i-1]
    else:
        selected = {}
        for i in xrange(k):
            j = int(random() * n)
            while j in selected:
                j = int(random() * n)
            result[i] = selected[j] = j
    return result
sasa sasa 19 years, 5 months ago  # | flag

random.sample? Isn't this just a bad implementation of the (more general) built-in random.sample?

Bogdano Arendartchuk 19 years, 1 month ago  # | flag

Re: random.sample? I think this is the implementation.