This is a translation of the Fortran subroutine RANKSB which appears on pp 38-9 of Nijenhuis & Wilk (1975) Combinatorial Algorithms, Academic Press.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | from random import random, randrange
def ranksb ( N, K ) :
if N < K :
raise Exception, "N must be no less than K"
if K == 0 : return [ ]
L2 = K + 1
R = L2
A = K * [ 0 ]
while 1 :
M = 1 + int ( random ( ) * N )
I = 1 + ( M - 1 ) % K
breakthencontinue = 0
if A [ I - 1 ] != 0 :
while M != A [ I - 1 ] / L2 :
LINK = A [ I - 1 ] % L2
if LINK == 0 :
while 1 :
R -= 1
if R == 0 : return map ( lambda a : a / L2, A )
if A [ R - 1 ] <= 0 :
A [ I - 1 ] += R
I = R
A [ I - 1 ] = L2 * M
break
breakthencontinue = 1
break
I = LINK
else :
continue
if breakthencontinue :
continue
A [ I - 1 ] = L2 * M
if __name__ == "__main__" :
from fpformat import fix
from time import time
counts = { }
n , k = 105, 90
sampleSize = 1000
timeStart = time ( )
for s in xrange ( sampleSize ) :
a = ranksb ( n, k )
for i in a :
if i in counts :
counts [ i ] += 1
else :
counts [ i ] = 1
print "Time to generate %i %i-subsets from set of size %i: %s seconds" \
% ( sampleSize, k, n, fix ( time ( ) - timeStart, 3 ) )
keys = counts . keys ( )
keys . sort ( )
totalCount = 0
idealCount = sampleSize * k / n
ChiSquare = 0
print "Counts of occurrences of each sample element, "
print "and difference between 'ideal' count and actual"
for key in keys :
print key, counts [ key ], abs ( counts [ key ] - idealCount )
totalCount += counts [ key ]
ChiSquare +=float ( pow ( counts [ key ] - idealCount, 2 ) ) / idealCount
print "Chi-squared test of uniformity: %s on %i d.f." % ( fix ( ChiSquare, 3), n - 1 )
|
Originally written for a research project, this algorithm is useful wherever one needs to resample for use in jackknifing and related methods.
Tags: algorithms
Take advantage of dictionaries. Use alternate approach for difficult cases. The spirit of the above algorithm is to use the result array as a double chained hash table, make repeated random picks, and query the table to see if the number was already picked. The method works well with large values of n, consumes memory space proportional to k, and runs with average time proportional to k when k is less than 2/3's of n. If k is closer to n, there will be many collisions and the algorithm will thrash unacceptably.
The following code adopts an equivalent method but takes advantage of the efficient C-coded hashing already built into Python's dictionaries. In cases where k is more than 2/3's of n, the code switches to a shuffle algorithm which takes time proportional to k and space proportional to n.
Error in algorithm. The routine fails (or appears to fail) when n=k. Add a statement that returns the entire set when the subset size must be the same as the set size.
Refined version. After further development and peer review, the above code was refined to have a more parallel structure, have consistent variable names, use local variable optimization, and return selections in selection order. The new algorithm switchover point minimizes memory consumption while maintaining best speed. The new function name better indicates that the routine implements random sampling without replacement.
random.sample? Isn't this just a bad implementation of the (more general) built-in random.sample?
Re: random.sample? I think this is the implementation.