Python script to figure out how to best fit a collection of data into a container (such as a DVD) while avoiding to waste free space.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | #!/usr/bin/env python
# mixnmatch.py - find combination of files/dirs that sum below a given threshold
# -- Jose Fonseca
import os
import os.path
import optparse
import sys
from sets import ImmutableSet as set
def get_size(path, block_size):
if os.path.isdir(path):
result = 0
for name in os.listdir(path):
size = get_size(os.path.join(path, name), block_size)
size = (size + block_size - 1)//block_size*block_size
result += size
return result
else:
return os.path.getsize(path)
def mix_and_match(limit, items, verbose = False):
# filter items
items = [(size, name) for size, name in items if size <= limit]
# sort them by size
items.sort(lambda (xsize, xname), (ysize, yname): cmp(xsize, ysize))
# initialize variables
added_collections = dict([(set([name]), size) for size, name in items])
collections = added_collections
while True:
if verbose:
sys.stderr.write("%d\n" % len(collections))
# find unique combinations of the recent collections
new_collections = {}
for names1, size1 in added_collections.iteritems():
for size2, name2 in items:
size3 = size1 + size2
if size3 > limit:
# we can break here as all collections that follow are
# bigger in size due to the sorting above
break
if name2 in names1:
continue
names3 = names1.union(set([name2]))
if names3 in new_collections:
continue
new_collections[names3] = size3
if len(new_collections) == 0:
break
collections.update(new_collections)
added_collections = new_collections
return [(size, names) for names, size in collections.iteritems()]
def main():
parser = optparse.OptionParser(usage="\n\t%prog [options] path ...")
parser.add_option(
'-l', '--limit',
type="int", dest="limit", default=4700000000,
help="total size limit")
parser.add_option(
'-B', '--block-size',
type="int", dest="size", default=2048,
help="use this block size")
parser.add_option(
'-s', '--show',
type="int", dest="show", default=10,
help="number of combinations to show")
parser.add_option(
'-v', '--verbose',
action="store_true", dest="verbose", default=False,
help="verbose output")
(options, args) = parser.parse_args(sys.argv[1:])
limit = options.limit
block_size = options.size
items = [(get_size(arg, block_size), arg) for arg in args]
collections = mix_and_match(limit, items, options.verbose)
collections.sort(lambda (xsize, xnames), (ysize, ynames): cmp(xsize, ysize))
if options.show != 0:
collections = collections[-options.show:]
for size, names in collections:
percentage = 100.0*float(size)/float(limit)
try:
sys.stdout.write("%10d\t%02.2f%%\t%s\n" % (size, percentage, " ".join(names)))
except IOError:
# ignore broken pipe
pass
if __name__ == '__main__':
main()
|
Have you ever did mental math to figure out how to best fit a collection of data into a set of DVDs, trying to squeeze the most into every single DVD? It happens more and more to me, so I wrote a Python script to do it for me.
The algorithm used to efficiently find the largest path combinations below a threshold is inspired in the apriori algorithm for association rule discovery. Since the largest path combination is a superset of smaller combinations, we can start building those starting from single paths, combine those with the initial to make two-item sets while removing all larger than the threshold, then three-item, four-item, and so on; until no larger combination below the threshold can be found.
Note 2006/03/30: recipe has been updated to count size using a block size, and to show the results in rever order.
Minor detail. I didn't try your algorithm, I just browsed it quickly; since I have a similar algorithm (in my case, it was nighttime-downloading of update packages in rpm format and then splitting them to record on CD's --DVD's weren't that popular then), I believe that you could have more accurate results if you round up the file sizes in sector granularity (ie, size = (size+2047)//2048 )
trying... I try to run your script, and found these issues:
the best match writes at top of list, so I must scroll console back (on Linux) or run script with some paging program (Windows) to see best match. My decision for this script: reversed(collections) on last loop (requires Python 2.4);
user must specify each item on command line, unhelpful;
grouped items printed as one big line, it kills my eyes ;) ;
Thank you for reminder, I'll publish my script for this job here later.
Granularity. At some point I was considering adding a block size option, but then I realized that there was the issue of what to do with directories and gave up, but then again you're right: it would be more accurate.
Usability. As you may have guessed, the program is fitted to my usage pattern, which is (in Linux) always something like:
So I always select the items from one of the top lines and paste it back on the command line. I admit a GUI would be nice, but providing one is a tad beyond my intents.
Nevertheless, reverse sorting is a good idea. It's just a matter of taking away the minus sign in
I shall post later an updated version of the script with after gathering all the feedback.
symlink problem. This program crashes badly if there is a symbolic link whose target does not exist.
Brute force approach. Here, just performing a brute-force computation always seems faster... Here is a drop-in replacement for the mix_and_match() function:
def mix_and_match(limit, items, verbose = False): def powerset(seq): if len(seq) > 0: head = powerset(seq[:-1]) return head + [item + [seq[-1]] for item in head] else: return [[]] return [item for item in [(sum([size for size, name in subset]), set([name for size, name in subset])) for subset in powerset(items)] if item[0] Here, just performing a brute-force computation always seems faster... Here is a drop-in replacement for the mix_and_match() function:
def mix_and_match(limit, items, verbose = False): def powerset(seq): if len(seq) > 0: head = powerset(seq[:-1]) return head + [item + [seq[-1]] for item in head] else: return [[]] return [item for item in [(sum([size for size, name in subset]), set([name for size, name in subset])) for subset in powerset(items)] if item[0]
Brute force approach (reloaded). No idea what happened above... Here it goes: