Shuffle Merge Files « Python recipes

This recipe solves the problem of shuffle-merging files -- interlacing (shuffle-merging) many small text files into one large text file, while preserving the order of the lines from within the small files.

      #!/usr/bin/python

"""
NAME

    shuffle-merge -- shuffle-merge text files

SYNOPSIS
    %(progname)s [OPTIONS] <File Name Prefix>

DESCRIPTION
    shuffle-merge merges a number of text files. The order of merging is
    selected with a random policy.
    
OPTIONS:
    Arguments:
    --help 
      Print a summary of the program options and exit.
    
    --nprocs=<int>, -n <int>
      number of processors [default=8]
    
    --maxlines=<int>, -m <int>
      max number of lines read [default=20]
      
"""

__rev = "1.0"
__author__ = 'Alexandru Iosup'
__email__ = 'A.Iosup at ewi.tudelft.nl'
__file__ = 'shuffle-merge.py'
__version__ = '$Revision: %s$' % __rev
__date__ = '$Date: 2005/08/15 16:59:00 $'
__copyright__ = 'Copyright (c) 2005 Alexandru IOSUP'
__license__ = 'Python'


import sys
import os
import getopt
import string 
import random
import time


def ShuffleMerge( InFilePrefix, NProcs, MaxLines ):
    """ 
    shuffle-merges files InFilePrefix_X, X in { 0, 1, ... NProcs } and
    stores the result into sm-InFilePrefix.
    
    Notes: does NOT check if the input files are available.
    """
    
    NProcs = int(NProcs)
    MaxLines = int(MaxLines)
    
    #-- init random seed
    random.seed(time.time())
    
    
    OutFileName = "sm-%s" % InFilePrefix
    OutFile = open( OutFileName, "w" )
    
    InFileNames = {}
    InFiles = {}
    InFileFinished = {}
    
    ProcsIDList = range(NProcs)
    
    for index in ProcsIDList:
        InFileNames[index] = "%s_%d" % (InFilePrefix, index)
        InFiles[index] = open( InFileNames[index], "r" )
        InFileFinished[index] = 0
        
    nReadLines = 0
    while 1:
        
        #-- make a list of all input files not finished yet
        ListOfNotFinished = []
        for index in ProcsIDList:
            if InFileFinished[index] == 0:
                ListOfNotFinished.append(index)
                
        #-- randomly select an input file
        lenListOfNotFinished = len(ListOfNotFinished)
        if lenListOfNotFinished == 0:
            break
        elif lenListOfNotFinished == 1:
            ProcID = ListOfNotFinished[0]
        else: 
            # at least 2 elements in this list -> pick at random the proc ID
            ProcID = ListOfNotFinished[random.randint(0, lenListOfNotFinished - 1)]
            
        #-- randomly copy 1 to MaxLines lines of it to the output file
        nLinesToGet = random.randint( 1, MaxLines )
        try:
            for index in range(nLinesToGet):
                line = InFiles[ProcID].readline()
                if len(line) > 0:
                    OutFile.write( line )
                    nReadLines = nReadLines + 1
                    if nReadLines % 10000 == 0:
                        print "nReadLines", nReadLines, "[last read", nLinesToGet, \
                              "from", ProcID, "/", ListOfNotFinished, "]"
                else:
                    InFileFinished[ProcID] = 1
        except KeyError, e:
            print "Got wrong array index:", e
        except IOError, (errno, strerror):
            print "I/O error(%s): %s" % (errno, strerror)
            InFileFinished[ProcID] = 1
        
    print "nReadLines", nReadLines, "[last read", nLinesToGet, \
                  "from", ProcID, "/", ListOfNotFinished, "]"
        
    OutFile.close()
    for index in ProcsIDList:
        InFiles[index].close()
        

def usage(progname):
    print __doc__ % vars() 


def main(argv):                  

    OneCharOpts = "hn:m:"
    MultiCharList = [
        "help", "nprocs=", "maxlines="
        ]
        
    try:                                
        opts, args = getopt.getopt( argv, OneCharOpts, MultiCharList )
    except getopt.GetoptError:
        usage(os.path.basename(sys.argv[0]))
        sys.exit(2)
    
    NProcs = 8
    MaxLines = 20
    FileNamePrefix = "ttt"
    
    for opt, arg in opts:
        if opt in ["-h", "--help"]:
            usage(os.path.basename(sys.argv[0]))
            sys.exit()
        elif opt in ["-n", "--nprocs"]: 
            NProcs = arg.strip() 
        elif opt in ["-m", "--maxlines"]: 
            MaxLines = arg.strip() 
            
    if len(args) >= 1:
        FileNamePrefix = args[0]
            
    ShuffleMerge( FileNamePrefix, NProcs, MaxLines )

if __name__ == "__main__":

    main(sys.argv[1:]) 
    

      

In a scientific simulation process, it is not uncommon to need to combine multiple source files into one final file, which will become input for another stage in the process, while preserving the order of the lines from within the source files. For example, when simulating the way checked messages arrive to a centralized component (e.g., a central database server in a distributed banking service), the final file needs to combine all source files in a random way (e.g., the messages arrived at their pace, disturbed by the transfer over Internet), while preserving the order between lines of the same source file (e.g., the receiving-end of the messaging service ensured the messages from the same client arrived in a fixed order).

This recipe solves this problem, under the following assumptions: o the source files are named "Prefix_X", with the same Prefix, and X being a 0-bades integer index of the file (e.g., 0, 1, ..., n-1 for n source files) o the shuffle-merge (output) file is "sm-Prefix"

The script prints a verification message every 10K lines parsed, and seems to be performing in under 10s on a 1M lines set of input.

Possible improvements [est.difficulty: trivial In a scientific simulation process, it is not uncommon to need to combine multiple source files into one final file, which will become input for another stage in the process, while preserving the order of the lines from within the source files. For example, when simulating the way checked messages arrive to a centralized component (e.g., a central database server in a distributed banking service), the final file needs to combine all source files in a random way (e.g., the messages arrived at their pace, disturbed by the transfer over Internet), while preserving the order between lines of the same source file (e.g., the receiving-end of the messaging service ensured the messages from the same client arrived in a fixed order).

The script prints a verification message every 10K lines parsed, and seems to be performing in under 10s on a 1M lines set of input.

Possible improvements [est.difficulty: trivial

Tags: files

◄	Python recipes (4591)	►
◄	Alexandru Iosup's recipes (1)	►

Shuffle Merge Files (Python recipe) by Alexandru Iosup
ActiveState Code (http://code.activestate.com/recipes/439319/)

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Shuffle Merge Files (Python recipe) by Alexandru Iosup ActiveState Code (http://code.activestate.com/recipes/439319/)