This recipe can be used to sort big files (much bigger than the available RAM) according to a key. The sort is guaranteed to be stable on Python 2.3.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | from heapq import heapify, heappop, heappush
from itertools import islice, cycle
from tempfile import gettempdir
import os
def merge(chunks,key=None):
if key is None:
key = lambda x : x
values = []
for index, chunk in enumerate(chunks):
try:
iterator = iter(chunk)
value = iterator.next()
except StopIteration:
try:
chunk.close()
os.remove(chunk.name)
chunks.remove(chunk)
except:
pass
else:
heappush(values,((key(value),index,value,iterator,chunk)))
while values:
k, index, value, iterator, chunk = heappop(values)
yield value
try:
value = iterator.next()
except StopIteration:
try:
chunk.close()
os.remove(chunk.name)
chunks.remove(chunk)
except:
pass
else:
heappush(values,(key(value),index,value,iterator,chunk))
def batch_sort(input,output,key=None,buffer_size=32000,tempdirs=[]):
if not tempdirs:
tempdirs.append(gettempdir())
input_file = file(input,'rb',64*1024)
try:
input_iterator = iter(input_file)
chunks = []
try:
for tempdir in cycle(tempdirs):
current_chunk = list(islice(input_iterator,buffer_size))
if current_chunk:
current_chunk.sort(key=key)
output_chunk = file(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
output_chunk.writelines(current_chunk)
output_chunk.flush()
output_chunk.seek(0)
chunks.append(output_chunk)
else:
break
except:
for chunk in chunks:
try:
chunk.close()
os.remove(chunk.name)
except:
pass
if output_chunk not in chunks:
try:
output_chunk.close()
os.remove(output_chunk.name)
except:
pass
return
finally:
input_file.close()
output_file = file(output,'wb',64*1024)
try:
output_file.writelines(merge(chunks,key))
finally:
for chunk in chunks:
try:
chunk.close()
os.remove(chunk.name)
except:
pass
output_file.close()
if __name__ == '__main__':
import optparse
parser = optparse.OptionParser()
parser.add_option(
'-b','--buffer',
dest='buffer_size',
type='int',default=32000,
help='''Size of the line buffer. The file to sort is
divided into chunks of that many lines. Default : 32,000 lines.'''
)
parser.add_option(
'-k','--key',
dest='key',
help='''Python expression used to compute the key for each
line, "lambda line:" is prepended.\n
Example : -k "line[5:10]". By default, the whole line is the key.'''
)
parser.add_option(
'-t','--tempdir',
dest='tempdirs',
action='append',
default=[],
help='''Temporary directory to use. You might get performance
improvements if the temporary directory is not on the same physical
disk than the input and output directories. You can even try
providing multiples directories on differents physical disks.
Use multiple -t options to do that.'''
)
parser.add_option(
'-p','--psyco',
dest='psyco',
action='store_true',
default=False,
help='''Use Psyco.'''
)
options,args = parser.parse_args()
if options.key:
options.key = eval('lambda line : (%s)'%options.key)
if options.psyco:
import psyco
psyco.full()
batch_sort(args[0],args[1],options.key,options.buffer_size,options.tempdirs)
|
The memory used is at most buffer_size * maximum length of line + a 64 kb buffer per chunk (there are number of lines in the file / buffer_size chunks). In practice one should try to provide big buffer_size numbers since the limiting factor is often the maximum numbers of file a process can open simultaneously. The merge process does a n-way merge between all the chunks, I should try to perform partial mergings to work around this limitation.
The Python Cookbook already contains a big file sort implementation (look for "big file sorting"), but I feel that this version is more terse and Pythonic.
Usage example :
sort.py -k "(line[5:10],int(line[10:16])" input.txt output.txt
This sort lexicographically on the text found at columns 5 to 9 inclusive, then numerically on the number found at column 10 to 15 inclusive.
Performance :
See by yourself :) I find this script quite competitive thanks to :
1) the awesome speed of Python's sort 2) the not less awesome speed of Python heapq 3) usage of file.writelines 4) a trick I've found to implement the merge sort, which is a variant of the Schwartzian transform wherein the iterator is included in the transformed tuple.
Update (1.1) : integrated changes by Ian Bicking.
Update (1.2) : added the possibility of specifying temporary directories, which can give a performance boost if they reside on different physical disks than the input or output files.
Update (1.3,1.4) : fixed typos.
Update (1.5) : This version now does its best to cleanup the temporary files even when an error is raised during the sort.
tweaks and times. You can do x.sort(key=None), so you don't have to test for "key is not None". Testing "if len(current_chunk)>0" is a long way of saying "if current_chunk". And the variable "position" in merge doesn't seem necessary -- the stability of the first .sort() should be sufficient.
It would be interesting to benchmark this against GNU sort, which I hear is really good at sorting large files. Obviously GNU sort will win, but by how much? Well, I can figure that out for myself I suppose; with a file with a million random lines:
Three times slower than GNU sort... which is actually a lot better than I expected.
Yeah, and how much did Python's startup time contribute to the difference?
Thanks for the tips. I've modified the recipe according to your remarks. You're right, adding a position field to the merged tuples is unneeded since the iterators used by the merging function are returning items in a stable way. The index field is needed, though, otherwise for equal key values we would risk that the iterators be used in a different order than the original one.
I've tested this script against GNU sort (textutils) 2.1 running under Win32, and it was quite competitive, only two to three times slower.
The point is that you get much more flexibility in the way you can define keys ; I wouldn't know how to easily have a two-part key with different semantics (lexicographic and numeric) using GNU sort. The key definition in the command line is only a start, you can define much more complicated keys.
gotchas, cleanup, and common usage. A very useful recipe! But there is one 'gotcha' in v1.4. If the input file does not end with a newline, then the output file will combine the last input line with the next line in the sort sequence. This is because file.writelines() does not add line separators.
In my case, I am working with text files that end with '\x1a' instead of a newline. GNU sort will put the '\x1a' character on it's own line at the very beginning of the file, but the python version combines the first and second lines. I hacked in a workaround by checking for end-of-file using input_file.tell(), but it added 20% to the processing time :( I figure there must be a more elegant solution.
Another problem I encountered is that chunk files are not cleaned up if the chunk creation process is interrupted. This could happen when:
the 'key' function throws an exception
KeyboardInterrupt (the user cancels the program; I do this all the time)
MemoryError (buffer_size is too large)
IOError (the disk is full)
I timed the (modified) function against our previous implementation, which uses os.system():
Not bad, 30% faster on an athlon 800Mhz running Python 2.3, against a 2Mb file with 2200 rows. The difference is probably from the overhead of starting and monitoring the external process. Could someone verify this?
I would consider replacements for existing os.system() calls to be a more common case than command-line usage, but the time advantage may disappear since you will probably be calling 'sort' once on a large file, not 10 times on some small files.
One advantage this recipe has over GNU sort is that it is predictable. On Linux Fedora Core 3 GNU sort does some funny and annoying things because of the default locale settings. I prefer to use predictable pure python rather than checking and setting environment variables in sort's calling shell.
As someone asked me, you can numerically sort a file containing an integer per line with this command :
Maybe I should add a
-K my_custom_key.py
option that would load the given script and use the function it contains to compute the key ? This would enable anyone to use this sort with complicated key extraction functions, which is pretty much the point of this script, since for simple keys GNU sort will be much faster.Performance is still not as critical as an ability to sort really big files with system limits for maximum opened files. Maximum file size that can be sorted on windows platform with 512 maxfiles limit and default buffer of 32000 lines, each 100 characters on average, is 512*32000*100 = 1562.5Mb
I've compiled a spreadsheet to find out maximum possible file size to be sorted (not accounting for filesystem limits) https://spreadsheets.google.com/ccc?key=p7lZ2JoyePLYmKlds2oaLdg&hl=en#
32000 lines buffer weight about 3.2Mb for each chunk given 100 characters-wide strings. Total memory consumed in this case is 32000*100 + 512*64*1024 = 35Mb
There are many options how to balance buffer size. I would keep it at maximum, but it is not clear how efficient is sort() both in speed and memory with large buffers. Having 512 chunks all over the disk will probably involve too many disk seeks and make caching inefficient. We can increase buffer to a reasonable amount to keep chunk number small. For 1562.5Mb file buffer increase to 64000 lines (6.4Mb) results in drop in handles to 256, but also in total memory usage from 35Mb to just 22Mb. In general for files under 4Gb buffer size of 128000 lines (100 chars each, that is about 12Mb) is optimal (an approximate calculation). For files greater that 6Gb buffer should be increased to avoid depletion of file descriptors.
Another question is the memory usage and effectiveness of heapq module. If I remember correctly heapq module in Python 2.5 doesn't use accelerated _heapq C implementation even though the latter is accessible through the import.
Recipe 576755 is just a rewrite of this one for Python 2.6, taking advantage of heapq.merge, context managers, and other niceties of newer Python versions.