Welcome, guest | Sign In | My Account | Store | Cart

This recipe solves the problem of shuffle-merging files -- interlacing (shuffle-merging) many small text files into one large text file, while preserving the order of the lines from within the small files.

Python, 158 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
#!/usr/bin/python

"""
NAME

    shuffle-merge -- shuffle-merge text files

SYNOPSIS
    %(progname)s [OPTIONS] <File Name Prefix>

DESCRIPTION
    shuffle-merge merges a number of text files. The order of merging is
    selected with a random policy.
    
OPTIONS:
    Arguments:
    --help 
      Print a summary of the program options and exit.
    
    --nprocs=<int>, -n <int>
      number of processors [default=8]
    
    --maxlines=<int>, -m <int>
      max number of lines read [default=20]
      
"""

__rev = "1.0"
__author__ = 'Alexandru Iosup'
__email__ = 'A.Iosup at ewi.tudelft.nl'
__file__ = 'shuffle-merge.py'
__version__ = '$Revision: %s$' % __rev
__date__ = '$Date: 2005/08/15 16:59:00 $'
__copyright__ = 'Copyright (c) 2005 Alexandru IOSUP'
__license__ = 'Python'


import sys
import os
import getopt
import string 
import random
import time


def ShuffleMerge( InFilePrefix, NProcs, MaxLines ):
    """ 
    shuffle-merges files InFilePrefix_X, X in { 0, 1, ... NProcs } and
    stores the result into sm-InFilePrefix.
    
    Notes: does NOT check if the input files are available.
    """
    
    NProcs = int(NProcs)
    MaxLines = int(MaxLines)
    
    #-- init random seed
    random.seed(time.time())
    
    
    OutFileName = "sm-%s" % InFilePrefix
    OutFile = open( OutFileName, "w" )
    
    InFileNames = {}
    InFiles = {}
    InFileFinished = {}
    
    ProcsIDList = range(NProcs)
    
    for index in ProcsIDList:
        InFileNames[index] = "%s_%d" % (InFilePrefix, index)
        InFiles[index] = open( InFileNames[index], "r" )
        InFileFinished[index] = 0
        
    nReadLines = 0
    while 1:
        
        #-- make a list of all input files not finished yet
        ListOfNotFinished = []
        for index in ProcsIDList:
            if InFileFinished[index] == 0:
                ListOfNotFinished.append(index)
                
        #-- randomly select an input file
        lenListOfNotFinished = len(ListOfNotFinished)
        if lenListOfNotFinished == 0:
            break
        elif lenListOfNotFinished == 1:
            ProcID = ListOfNotFinished[0]
        else: 
            # at least 2 elements in this list -> pick at random the proc ID
            ProcID = ListOfNotFinished[random.randint(0, lenListOfNotFinished - 1)]
            
        #-- randomly copy 1 to MaxLines lines of it to the output file
        nLinesToGet = random.randint( 1, MaxLines )
        try:
            for index in range(nLinesToGet):
                line = InFiles[ProcID].readline()
                if len(line) > 0:
                    OutFile.write( line )
                    nReadLines = nReadLines + 1
                    if nReadLines % 10000 == 0:
                        print "nReadLines", nReadLines, "[last read", nLinesToGet, \
                              "from", ProcID, "/", ListOfNotFinished, "]"
                else:
                    InFileFinished[ProcID] = 1
        except KeyError, e:
            print "Got wrong array index:", e
        except IOError, (errno, strerror):
            print "I/O error(%s): %s" % (errno, strerror)
            InFileFinished[ProcID] = 1
        
    print "nReadLines", nReadLines, "[last read", nLinesToGet, \
                  "from", ProcID, "/", ListOfNotFinished, "]"
        
    OutFile.close()
    for index in ProcsIDList:
        InFiles[index].close()
        

def usage(progname):
    print __doc__ % vars() 


def main(argv):                  

    OneCharOpts = "hn:m:"
    MultiCharList = [
        "help", "nprocs=", "maxlines="
        ]
        
    try:                                
        opts, args = getopt.getopt( argv, OneCharOpts, MultiCharList )
    except getopt.GetoptError:
        usage(os.path.basename(sys.argv[0]))
        sys.exit(2)
    
    NProcs = 8
    MaxLines = 20
    FileNamePrefix = "ttt"
    
    for opt, arg in opts:
        if opt in ["-h", "--help"]:
            usage(os.path.basename(sys.argv[0]))
            sys.exit()
        elif opt in ["-n", "--nprocs"]: 
            NProcs = arg.strip() 
        elif opt in ["-m", "--maxlines"]: 
            MaxLines = arg.strip() 
            
    if len(args) >= 1:
        FileNamePrefix = args[0]
            
    ShuffleMerge( FileNamePrefix, NProcs, MaxLines )

if __name__ == "__main__":

    main(sys.argv[1:]) 
    

In a scientific simulation process, it is not uncommon to need to combine multiple source files into one final file, which will become input for another stage in the process, while preserving the order of the lines from within the source files. For example, when simulating the way checked messages arrive to a centralized component (e.g., a central database server in a distributed banking service), the final file needs to combine all source files in a random way (e.g., the messages arrived at their pace, disturbed by the transfer over Internet), while preserving the order between lines of the same source file (e.g., the receiving-end of the messaging service ensured the messages from the same client arrived in a fixed order).

This recipe solves this problem, under the following assumptions: o the source files are named "Prefix_X", with the same Prefix, and X being a 0-bades integer index of the file (e.g., 0, 1, ..., n-1 for n source files) o the shuffle-merge (output) file is "sm-Prefix"

The script prints a verification message every 10K lines parsed, and seems to be performing in under 10s on a 1M lines set of input.

Possible improvements [est.difficulty: trivial In a scientific simulation process, it is not uncommon to need to combine multiple source files into one final file, which will become input for another stage in the process, while preserving the order of the lines from within the source files. For example, when simulating the way checked messages arrive to a centralized component (e.g., a central database server in a distributed banking service), the final file needs to combine all source files in a random way (e.g., the messages arrived at their pace, disturbed by the transfer over Internet), while preserving the order between lines of the same source file (e.g., the receiving-end of the messaging service ensured the messages from the same client arrived in a fixed order).

This recipe solves this problem, under the following assumptions: o the source files are named "Prefix_X", with the same Prefix, and X being a 0-bades integer index of the file (e.g., 0, 1, ..., n-1 for n source files) o the shuffle-merge (output) file is "sm-Prefix"

The script prints a verification message every 10K lines parsed, and seems to be performing in under 10s on a 1M lines set of input.

Possible improvements [est.difficulty: trivial