This recipe solves the problem of shuffle-merging files -- interlacing (shuffle-merging) many small text files into one large text file, while preserving the order of the lines from within the small files.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | #!/usr/bin/python
"""
NAME
shuffle-merge -- shuffle-merge text files
SYNOPSIS
%(progname)s [OPTIONS] <File Name Prefix>
DESCRIPTION
shuffle-merge merges a number of text files. The order of merging is
selected with a random policy.
OPTIONS:
Arguments:
--help
Print a summary of the program options and exit.
--nprocs=<int>, -n <int>
number of processors [default=8]
--maxlines=<int>, -m <int>
max number of lines read [default=20]
"""
__rev = "1.0"
__author__ = 'Alexandru Iosup'
__email__ = 'A.Iosup at ewi.tudelft.nl'
__file__ = 'shuffle-merge.py'
__version__ = '$Revision: %s$' % __rev
__date__ = '$Date: 2005/08/15 16:59:00 $'
__copyright__ = 'Copyright (c) 2005 Alexandru IOSUP'
__license__ = 'Python'
import sys
import os
import getopt
import string
import random
import time
def ShuffleMerge( InFilePrefix, NProcs, MaxLines ):
"""
shuffle-merges files InFilePrefix_X, X in { 0, 1, ... NProcs } and
stores the result into sm-InFilePrefix.
Notes: does NOT check if the input files are available.
"""
NProcs = int(NProcs)
MaxLines = int(MaxLines)
#-- init random seed
random.seed(time.time())
OutFileName = "sm-%s" % InFilePrefix
OutFile = open( OutFileName, "w" )
InFileNames = {}
InFiles = {}
InFileFinished = {}
ProcsIDList = range(NProcs)
for index in ProcsIDList:
InFileNames[index] = "%s_%d" % (InFilePrefix, index)
InFiles[index] = open( InFileNames[index], "r" )
InFileFinished[index] = 0
nReadLines = 0
while 1:
#-- make a list of all input files not finished yet
ListOfNotFinished = []
for index in ProcsIDList:
if InFileFinished[index] == 0:
ListOfNotFinished.append(index)
#-- randomly select an input file
lenListOfNotFinished = len(ListOfNotFinished)
if lenListOfNotFinished == 0:
break
elif lenListOfNotFinished == 1:
ProcID = ListOfNotFinished[0]
else:
# at least 2 elements in this list -> pick at random the proc ID
ProcID = ListOfNotFinished[random.randint(0, lenListOfNotFinished - 1)]
#-- randomly copy 1 to MaxLines lines of it to the output file
nLinesToGet = random.randint( 1, MaxLines )
try:
for index in range(nLinesToGet):
line = InFiles[ProcID].readline()
if len(line) > 0:
OutFile.write( line )
nReadLines = nReadLines + 1
if nReadLines % 10000 == 0:
print "nReadLines", nReadLines, "[last read", nLinesToGet, \
"from", ProcID, "/", ListOfNotFinished, "]"
else:
InFileFinished[ProcID] = 1
except KeyError, e:
print "Got wrong array index:", e
except IOError, (errno, strerror):
print "I/O error(%s): %s" % (errno, strerror)
InFileFinished[ProcID] = 1
print "nReadLines", nReadLines, "[last read", nLinesToGet, \
"from", ProcID, "/", ListOfNotFinished, "]"
OutFile.close()
for index in ProcsIDList:
InFiles[index].close()
def usage(progname):
print __doc__ % vars()
def main(argv):
OneCharOpts = "hn:m:"
MultiCharList = [
"help", "nprocs=", "maxlines="
]
try:
opts, args = getopt.getopt( argv, OneCharOpts, MultiCharList )
except getopt.GetoptError:
usage(os.path.basename(sys.argv[0]))
sys.exit(2)
NProcs = 8
MaxLines = 20
FileNamePrefix = "ttt"
for opt, arg in opts:
if opt in ["-h", "--help"]:
usage(os.path.basename(sys.argv[0]))
sys.exit()
elif opt in ["-n", "--nprocs"]:
NProcs = arg.strip()
elif opt in ["-m", "--maxlines"]:
MaxLines = arg.strip()
if len(args) >= 1:
FileNamePrefix = args[0]
ShuffleMerge( FileNamePrefix, NProcs, MaxLines )
if __name__ == "__main__":
main(sys.argv[1:])
|
In a scientific simulation process, it is not uncommon to need to combine multiple source files into one final file, which will become input for another stage in the process, while preserving the order of the lines from within the source files. For example, when simulating the way checked messages arrive to a centralized component (e.g., a central database server in a distributed banking service), the final file needs to combine all source files in a random way (e.g., the messages arrived at their pace, disturbed by the transfer over Internet), while preserving the order between lines of the same source file (e.g., the receiving-end of the messaging service ensured the messages from the same client arrived in a fixed order).
This recipe solves this problem, under the following assumptions: o the source files are named "Prefix_X", with the same Prefix, and X being a 0-bades integer index of the file (e.g., 0, 1, ..., n-1 for n source files) o the shuffle-merge (output) file is "sm-Prefix"
The script prints a verification message every 10K lines parsed, and seems to be performing in under 10s on a 1M lines set of input.
Possible improvements [est.difficulty: trivial In a scientific simulation process, it is not uncommon to need to combine multiple source files into one final file, which will become input for another stage in the process, while preserving the order of the lines from within the source files. For example, when simulating the way checked messages arrive to a centralized component (e.g., a central database server in a distributed banking service), the final file needs to combine all source files in a random way (e.g., the messages arrived at their pace, disturbed by the transfer over Internet), while preserving the order between lines of the same source file (e.g., the receiving-end of the messaging service ensured the messages from the same client arrived in a fixed order).
This recipe solves this problem, under the following assumptions: o the source files are named "Prefix_X", with the same Prefix, and X being a 0-bades integer index of the file (e.g., 0, 1, ..., n-1 for n source files) o the shuffle-merge (output) file is "sm-Prefix"
The script prints a verification message every 10K lines parsed, and seems to be performing in under 10s on a 1M lines set of input.
Possible improvements [est.difficulty: trivial