Takes a sequence and yields K partitions of it into training and validation test sets. Training sets are of size (k-1)*len(X)/K and partition sets are of size len(X)/K
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
def k_fold_cross_validation(X, K, randomise = False): """ Generates K (training, validation) pairs from the items in X. Each pair is a partition of X, where validation is an iterable of length len(X)/K. So each training iterable is of length (K-1)*len(X)/K. If randomise is true, a copy of X is shuffled before partitioning, otherwise its order is preserved in training and validation. """ if randomise: from random import shuffle; X=list(X); shuffle(X) for k in xrange(K): training = [x for i, x in enumerate(X) if i % K != k] validation = [x for i, x in enumerate(X) if i % K == k] yield training, validation X = [i for i in xrange(97)] for training, validation in k_fold_cross_validation(X, K=7): for x in X: assert (x in training) ^ (x in validation), x
This is a common task in machine learning.
Any improvements welcome. There's probably a one liner out there :)