Welcome, guest | Sign In | My Account | Store | Cart

This recipe can be used to sort big files (much bigger than the available RAM) according to a key. The sort is guaranteed to be stable on Python 2.3.

Python, 135 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
from heapq import heapify, heappop, heappush
from itertools import islice, cycle
from tempfile import gettempdir
import os

def merge(chunks,key=None):
    if key is None:
        key = lambda x : x

    values = []

    for index, chunk in enumerate(chunks):
        try:
            iterator = iter(chunk)
            value = iterator.next()
        except StopIteration:
            try:
                chunk.close()
                os.remove(chunk.name)
                chunks.remove(chunk)
            except:
                pass
        else:
            heappush(values,((key(value),index,value,iterator,chunk)))

    while values:
        k, index, value, iterator, chunk = heappop(values)
        yield value
        try:
            value = iterator.next()
        except StopIteration:
            try:
                chunk.close()
                os.remove(chunk.name)
                chunks.remove(chunk)
            except:
                pass
        else:
            heappush(values,(key(value),index,value,iterator,chunk))

def batch_sort(input,output,key=None,buffer_size=32000,tempdirs=[]):
    if not tempdirs:
        tempdirs.append(gettempdir())
    
    input_file = file(input,'rb',64*1024)
    try:
        input_iterator = iter(input_file)
        
        chunks = []
        try:
            for tempdir in cycle(tempdirs):
                current_chunk = list(islice(input_iterator,buffer_size))
                if current_chunk:
                    current_chunk.sort(key=key)
                    output_chunk = file(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
                    output_chunk.writelines(current_chunk)
                    output_chunk.flush()
                    output_chunk.seek(0)
                    chunks.append(output_chunk)
                else:
                    break
        except:
            for chunk in chunks:
                try:
                    chunk.close()
                    os.remove(chunk.name)
                except:
                    pass
            if output_chunk not in chunks:
                try:
                    output_chunk.close()
                    os.remove(output_chunk.name)
                except:
                    pass
            return
    finally:
        input_file.close()
    
    output_file = file(output,'wb',64*1024)
    try:
        output_file.writelines(merge(chunks,key))
    finally:
        for chunk in chunks:
            try:
                chunk.close()
                os.remove(chunk.name)
            except:
                pass
        output_file.close()

if __name__ == '__main__':
    import optparse
    parser = optparse.OptionParser()
    parser.add_option(
        '-b','--buffer',
        dest='buffer_size',
        type='int',default=32000,
        help='''Size of the line buffer. The file to sort is
            divided into chunks of that many lines. Default : 32,000 lines.'''
    )
    parser.add_option(
        '-k','--key',
        dest='key',
        help='''Python expression used to compute the key for each
            line, "lambda line:" is prepended.\n
            Example : -k "line[5:10]". By default, the whole line is the key.'''
    )
    parser.add_option(
        '-t','--tempdir',
        dest='tempdirs',
        action='append',
        default=[],
        help='''Temporary directory to use. You might get performance
            improvements if the temporary directory is not on the same physical
            disk than the input and output directories. You can even try
            providing multiples directories on differents physical disks.
            Use multiple -t options to do that.'''
    )
    parser.add_option(
        '-p','--psyco',
        dest='psyco',
        action='store_true',
        default=False,
        help='''Use Psyco.'''
    )
    options,args = parser.parse_args()
    
    if options.key:
        options.key = eval('lambda line : (%s)'%options.key)
    
    if options.psyco:
        import psyco
        psyco.full()
    
    batch_sort(args[0],args[1],options.key,options.buffer_size,options.tempdirs) 

The memory used is at most buffer_size * maximum length of line + a 64 kb buffer per chunk (there are number of lines in the file / buffer_size chunks). In practice one should try to provide big buffer_size numbers since the limiting factor is often the maximum numbers of file a process can open simultaneously. The merge process does a n-way merge between all the chunks, I should try to perform partial mergings to work around this limitation.

The Python Cookbook already contains a big file sort implementation (look for "big file sorting"), but I feel that this version is more terse and Pythonic.

Usage example :

sort.py -k "(line[5:10],int(line[10:16])" input.txt output.txt

This sort lexicographically on the text found at columns 5 to 9 inclusive, then numerically on the number found at column 10 to 15 inclusive.

Performance :

See by yourself :) I find this script quite competitive thanks to :

1) the awesome speed of Python's sort 2) the not less awesome speed of Python heapq 3) usage of file.writelines 4) a trick I've found to implement the merge sort, which is a variant of the Schwartzian transform wherein the iterator is included in the transformed tuple.

Update (1.1) : integrated changes by Ian Bicking.

Update (1.2) : added the possibility of specifying temporary directories, which can give a performance boost if they reside on different physical disks than the input or output files.

Update (1.3,1.4) : fixed typos.

Update (1.5) : This version now does its best to cleanup the temporary files even when an error is raised during the sort.

7 comments

Ian Bicking 15 years, 10 months ago  # | flag

tweaks and times. You can do x.sort(key=None), so you don't have to test for "key is not None". Testing "if len(current_chunk)>0" is a long way of saying "if current_chunk". And the variable "position" in merge doesn't seem necessary -- the stability of the first .sort() should be sufficient.

It would be interesting to benchmark this against GNU sort, which I hear is really good at sorting large files. Obviously GNU sort will win, but by how much? Well, I can figure that out for myself I suppose; with a file with a million random lines:

$ time sort test_big.txt > tmp.txt
real    0m4.017s
user    0m2.584s
sys     0m0.396s

$ time python large_sort.py test_big.txt tmp.txt
real    0m8.436s
user    0m7.454s
sys     0m0.322s

Three times slower than GNU sort... which is actually a lot better than I expected.

Ed Blake 15 years, 10 months ago  # | flag

Yeah, and how much did Python's startup time contribute to the difference?

Nicolas Lehuen (author) 15 years, 10 months ago  # | flag

Thanks for the tips. I've modified the recipe according to your remarks. You're right, adding a position field to the merged tuples is unneeded since the iterators used by the merging function are returning items in a stable way. The index field is needed, though, otherwise for equal key values we would risk that the iterators be used in a different order than the original one.

I've tested this script against GNU sort (textutils) 2.1 running under Win32, and it was quite competitive, only two to three times slower.

The point is that you get much more flexibility in the way you can define keys ; I wouldn't know how to easily have a two-part key with different semantics (lexicographic and numeric) using GNU sort. The key definition in the command line is only a start, you can define much more complicated keys.

Maris Fogels 15 years, 9 months ago  # | flag

gotchas, cleanup, and common usage. A very useful recipe! But there is one 'gotcha' in v1.4. If the input file does not end with a newline, then the output file will combine the last input line with the next line in the sort sequence. This is because file.writelines() does not add line separators.

In my case, I am working with text files that end with '\x1a' instead of a newline. GNU sort will put the '\x1a' character on it's own line at the very beginning of the file, but the python version combines the first and second lines. I hacked in a workaround by checking for end-of-file using input_file.tell(), but it added 20% to the processing time :( I figure there must be a more elegant solution.

Another problem I encountered is that chunk files are not cleaned up if the chunk creation process is interrupted. This could happen when:

  • the 'key' function throws an exception

  • KeyboardInterrupt (the user cancels the program; I do this all the time)

  • MemoryError (buffer_size is too large)

  • IOError (the disk is full)

I timed the (modified) function against our previous implementation, which uses os.system():

$ timeit.py -s 'import batchsort' \
'batchsort.batch_sort("test.TXT", "sorted.txt", lambda line: (line[210:223]))'
10 loops, best of 3: 2.74e+05 usec per loop

$ timeit.py -s 'import os' \
'os.system("sort -k 1.210,1.223 test.TXT > sorted2.txt")'
10 loops, best of 3: 3.95e+05 usec per loop

Not bad, 30% faster on an athlon 800Mhz running Python 2.3, against a 2Mb file with 2200 rows. The difference is probably from the overhead of starting and monitoring the external process. Could someone verify this?

I would consider replacements for existing os.system() calls to be a more common case than command-line usage, but the time advantage may disappear since you will probably be calling 'sort' once on a large file, not 10 times on some small files.

One advantage this recipe has over GNU sort is that it is predictable. On Linux Fedora Core 3 GNU sort does some funny and annoying things because of the default locale settings. I prefer to use predictable pure python rather than checking and setting environment variables in sort's calling shell.

Nicolas Lehuen (author) 13 years, 2 months ago  # | flag

As someone asked me, you can numerically sort a file containing an integer per line with this command :

python sort.py -k "int(line)" test.txt output.txt

Maybe I should add a -K my_custom_key.py option that would load the given script and use the function it contains to compute the key ? This would enable anyone to use this sort with complicated key extraction functions, which is pretty much the point of this script, since for simple keys GNU sort will be much faster.

anatoly techtonik 13 years ago  # | flag

Performance is still not as critical as an ability to sort really big files with system limits for maximum opened files. Maximum file size that can be sorted on windows platform with 512 maxfiles limit and default buffer of 32000 lines, each 100 characters on average, is 512*32000*100 = 1562.5Mb

I've compiled a spreadsheet to find out maximum possible file size to be sorted (not accounting for filesystem limits) https://spreadsheets.google.com/ccc?key=p7lZ2JoyePLYmKlds2oaLdg&hl=en#

32000 lines buffer weight about 3.2Mb for each chunk given 100 characters-wide strings. Total memory consumed in this case is 32000*100 + 512*64*1024 = 35Mb

There are many options how to balance buffer size. I would keep it at maximum, but it is not clear how efficient is sort() both in speed and memory with large buffers. Having 512 chunks all over the disk will probably involve too many disk seeks and make caching inefficient. We can increase buffer to a reasonable amount to keep chunk number small. For 1562.5Mb file buffer increase to 64000 lines (6.4Mb) results in drop in handles to 256, but also in total memory usage from 35Mb to just 22Mb. In general for files under 4Gb buffer size of 128000 lines (100 chars each, that is about 12Mb) is optimal (an approximate calculation). For files greater that 6Gb buffer should be increased to avoid depletion of file descriptors.

Another question is the memory usage and effectiveness of heapq module. If I remember correctly heapq module in Python 2.5 doesn't use accelerated _heapq C implementation even though the latter is accessible through the import.

Gabriel Genellina 12 years, 6 months ago  # | flag

Recipe 576755 is just a rewrite of this one for Python 2.6, taking advantage of heapq.merge, context managers, and other niceties of newer Python versions.