Welcome, guest | Sign In | My Account | Store | Cart

This is a translation of the Fortran subroutine RANKSB which appears on pp 38-9 of Nijenhuis & Wilk (1975) Combinatorial Algorithms, Academic Press.

Python, 66 lines
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66``` ```from random import random, randrange def ranksb ( N, K ) : if N < K : raise Exception, "N must be no less than K" if K == 0 : return [ ] L2 = K + 1 R = L2 A = K * [ 0 ] while 1 : M = 1 + int ( random ( ) * N ) I = 1 + ( M - 1 ) % K breakthencontinue = 0 if A [ I - 1 ] != 0 : while M != A [ I - 1 ] / L2 : LINK = A [ I - 1 ] % L2 if LINK == 0 : while 1 : R -= 1 if R == 0 : return map ( lambda a : a / L2, A ) if A [ R - 1 ] <= 0 : A [ I - 1 ] += R I = R A [ I - 1 ] = L2 * M break breakthencontinue = 1 break I = LINK else : continue if breakthencontinue : continue A [ I - 1 ] = L2 * M if __name__ == "__main__" : from fpformat import fix from time import time counts = { } n , k = 105, 90 sampleSize = 1000 timeStart = time ( ) for s in xrange ( sampleSize ) : a = ranksb ( n, k ) for i in a : if i in counts : counts [ i ] += 1 else : counts [ i ] = 1 print "Time to generate %i %i-subsets from set of size %i: %s seconds" \ % ( sampleSize, k, n, fix ( time ( ) - timeStart, 3 ) ) keys = counts . keys ( ) keys . sort ( ) totalCount = 0 idealCount = sampleSize * k / n ChiSquare = 0 print "Counts of occurrences of each sample element, " print "and difference between 'ideal' count and actual" for key in keys : print key, counts [ key ], abs ( counts [ key ] - idealCount ) totalCount += counts [ key ] ChiSquare +=float ( pow ( counts [ key ] - idealCount, 2 ) ) / idealCount print "Chi-squared test of uniformity: %s on %i d.f." % ( fix ( ChiSquare, 3), n - 1 ) ```

Originally written for a research project, this algorithm is useful wherever one needs to resample for use in jackknifing and related methods. Raymond Hettinger 20 years, 11 months ago

Take advantage of dictionaries. Use alternate approach for difficult cases. The spirit of the above algorithm is to use the result array as a double chained hash table, make repeated random picks, and query the table to see if the number was already picked. The method works well with large values of n, consumes memory space proportional to k, and runs with average time proportional to k when k is less than 2/3's of n. If k is closer to n, there will be many collisions and the algorithm will thrash unacceptably.

The following code adopts an equivalent method but takes advantage of the efficient C-coded hashing already built into Python's dictionaries. In cases where k is more than 2/3's of n, the code switches to a shuffle algorithm which takes time proportional to k and space proportional to n.

``````import random

def ranksb1(n,k):
if k > n: raise Exception, "N must be no less than K"
if k > n * 2 // 3:
pool = range(1,n+1)
for i in xrange(n-1, n-k-1, -1):
j = random.randrange(i+1)
pool[i], pool[j] = pool[j], pool[i]
return pool[-k:]
selections = {}
while k > len(selections):
value = random.randrange(1, n+1)
if value in selections: continue
selections[value] = True
return selections.keys()
`````` Bill Bell (author) 20 years, 9 months ago

Error in algorithm. The routine fails (or appears to fail) when n=k. Add a statement that returns the entire set when the subset size must be the same as the set size. Raymond Hettinger 20 years, 9 months ago

Refined version. After further development and peer review, the above code was refined to have a more parallel structure, have consistent variable names, use local variable optimization, and return selections in selection order. The new algorithm switchover point minimizes memory consumption while maintaining best speed. The new function name better indicates that the routine implements random sampling without replacement.

``````from random import random

def sample(n, k, random=random, int=int):
"""Chooses k unique random elements from [0,n).

Used for random sampling without replacement.
"""

if not n >= k >= 0:
raise ValueError, "sample larger than population"
result = [None] * k
if k * 6 > n:     # if n len list takes less space than a k len dict
pool = range(n)
for i in xrange(k):         # invariant:  non-selected at [0,n-i)
j = int(random() * (n-i))
result[i] = pool[j]
pool[j] = pool[n-i-1]
else:
selected = {}
for i in xrange(k):
j = int(random() * n)
while j in selected:
j = int(random() * n)
result[i] = selected[j] = j
return result
`````` sasa sasa 18 years, 10 months ago

random.sample? Isn't this just a bad implementation of the (more general) built-in random.sample? Bogdano Arendartchuk 18 years, 7 months ago

Re: random.sample? I think this is the implementation. Created by Bill Bell on Sun, 13 Oct 2002 (PSF)