Welcome, guest | Sign In | My Account | Store | Cart

Python script to figure out how to best fit a collection of data into a container (such as a DVD) while avoiding to waste free space.

Python, 105 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
#!/usr/bin/env python
# mixnmatch.py - find combination of files/dirs that sum below a given threshold
# -- Jose Fonseca

import os
import os.path
import optparse
import sys

from sets import ImmutableSet as set


def get_size(path, block_size):
    if os.path.isdir(path):
        result = 0
        for name in os.listdir(path):
            size = get_size(os.path.join(path, name), block_size)
            size = (size + block_size - 1)//block_size*block_size
            result += size
        return result
    else:
        return os.path.getsize(path)


def mix_and_match(limit, items, verbose = False):

    # filter items
    items = [(size, name) for size, name in items if size <= limit]
    # sort them by size
    items.sort(lambda (xsize, xname), (ysize, yname): cmp(xsize, ysize))
    
    # initialize variables
    added_collections = dict([(set([name]), size) for size, name in items])
    collections = added_collections

    while True:
        if verbose:
            sys.stderr.write("%d\n" % len(collections))
        
        # find unique combinations of the recent collections 
        new_collections = {}
        for names1, size1 in added_collections.iteritems():
            for size2, name2 in items:
                size3 = size1 + size2
                if size3 > limit:
                    # we can break here as all collections that follow are
                    #  bigger in size due to the sorting above
                    break
                if name2 in names1:
                    continue
                names3 = names1.union(set([name2]))
                if names3 in new_collections:
                    continue
                new_collections[names3] = size3
    
        if len(new_collections) == 0:
            break
        
        collections.update(new_collections)
        added_collections = new_collections

    return [(size, names) for names, size in collections.iteritems()]


def main():
    parser = optparse.OptionParser(usage="\n\t%prog [options] path ...")
    parser.add_option(
        '-l', '--limit', 
        type="int", dest="limit", default=4700000000, 
        help="total size limit")
    parser.add_option(
        '-B', '--block-size', 
        type="int", dest="size", default=2048, 
        help="use this block size")
    parser.add_option(
        '-s', '--show', 
        type="int", dest="show", default=10, 
        help="number of combinations to show")
    parser.add_option(
        '-v', '--verbose', 
        action="store_true", dest="verbose", default=False, 
        help="verbose output")
    (options, args) = parser.parse_args(sys.argv[1:])

    limit = options.limit
    block_size = options.size
    
    items = [(get_size(arg, block_size), arg) for arg in args]
    
    collections = mix_and_match(limit, items, options.verbose)
    collections.sort(lambda (xsize, xnames), (ysize, ynames): cmp(xsize, ysize))
    if options.show != 0:
        collections = collections[-options.show:]
    
    for size, names in collections:
        percentage = 100.0*float(size)/float(limit)
        try:
            sys.stdout.write("%10d\t%02.2f%%\t%s\n" % (size, percentage, " ".join(names)))
        except IOError:
            # ignore broken pipe
            pass 


if __name__ == '__main__':
    main()

Have you ever did mental math to figure out how to best fit a collection of data into a set of DVDs, trying to squeeze the most into every single DVD? It happens more and more to me, so I wrote a Python script to do it for me.

The algorithm used to efficiently find the largest path combinations below a threshold is inspired in the apriori algorithm for association rule discovery. Since the largest path combination is a superset of smaller combinations, we can start building those starting from single paths, combine those with the initial to make two-item sets while removing all larger than the threshold, then three-item, four-item, and so on; until no larger combination below the threshold can be found.

Note 2006/03/30: recipe has been updated to count size using a block size, and to show the results in rever order.

7 comments

Christos Georgiou 18 years, 1 month ago  # | flag

Minor detail. I didn't try your algorithm, I just browsed it quickly; since I have a similar algorithm (in my case, it was nighttime-downloading of update packages in rpm format and then splitting them to record on CD's --DVD's weren't that popular then), I believe that you could have more accurate results if you round up the file sizes in sector granularity (ie, size = (size+2047)//2048 )

Denis Barmenkov 18 years, 1 month ago  # | flag

trying... I try to run your script, and found these issues:

  1. the best match writes at top of list, so I must scroll console back (on Linux) or run script with some paging program (Windows) to see best match. My decision for this script: reversed(collections) on last loop (requires Python 2.4);

  2. user must specify each item on command line, unhelpful;

  3. grouped items printed as one big line, it kills my eyes ;) ;

Thank you for reminder, I'll publish my script for this job here later.

José Fonseca (author) 18 years, 1 month ago  # | flag

Granularity. At some point I was considering adding a block size option, but then I realized that there was the issue of what to do with directories and gave up, but then again you're right: it would be more accurate.

José Fonseca (author) 18 years, 1 month ago  # | flag

Usability. As you may have guessed, the program is fitted to my usage pattern, which is (in Linux) always something like:

$ cd /dir/to/stuff/to/burn/in/dvds
$ mixnmatch.py *
4482695104      95.38%  xxx yyy zzz
4480609827      95.33%  xxx aaa bbb
4397360001      93.56%  xxx ccc
...
$ growisofs -Z /dev/dvdrw xxx yyy zzz

So I always select the items from one of the top lines and paste it back on the command line. I admit a GUI would be nice, but providing one is a tad beyond my intents.

Nevertheless, reverse sorting is a good idea. It's just a matter of taking away the minus sign in

collections.sort(lambda (xsize, xnames), (ysize, ynames): -cmp(xsize, ysize))

I shall post later an updated version of the script with after gathering all the feedback.

Keith Briggs 18 years ago  # | flag

symlink problem. This program crashes badly if there is a symbolic link whose target does not exist.

Sylvain Fourmanoit 15 years, 11 months ago  # | flag

Brute force approach. Here, just performing a brute-force computation always seems faster... Here is a drop-in replacement for the mix_and_match() function:

def mix_and_match(limit, items, verbose = False): def powerset(seq): if len(seq) > 0: head = powerset(seq[:-1]) return head + [item + [seq[-1]] for item in head] else: return [[]] return [item for item in [(sum([size for size, name in subset]), set([name for size, name in subset])) for subset in powerset(items)] if item[0] Here, just performing a brute-force computation always seems faster... Here is a drop-in replacement for the mix_and_match() function:

def mix_and_match(limit, items, verbose = False): def powerset(seq): if len(seq) > 0: head = powerset(seq[:-1]) return head + [item + [seq[-1]] for item in head] else: return [[]] return [item for item in [(sum([size for size, name in subset]), set([name for size, name in subset])) for subset in powerset(items)] if item[0]

Sylvain Fourmanoit 15 years, 11 months ago  # | flag

Brute force approach (reloaded). No idea what happened above... Here it goes:

def mix_and_match(limit, items, verbose = False):
    def powerset(seq):
        if len(seq) > 0:
            head = powerset(seq[:-1])
            return head + [item + [seq[-1]] for item in head]
        else:
            return [[]]
    return [item
            for item in [(sum([size for size, name in subset]),
                          set([name for size, name in subset]))
                         for subset in powerset(items)]
            if item[0] &lt; limit]