Welcome, guest | Sign In | My Account | Store | Cart

Implement Smalltalk's bag class (similar to MultiSets in C++ or bag's in Objective C or Haskell's Edison module).

Python, 96 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
from operator import itemgetter
from heapq import nlargest

class bag(object):

    def __init__(self, iterable=()):
        self._data = {}
        self._len = 0
        self.update(iterable)

    def update(self, iterable):
        if isinstance(iterable, dict):
            for elem, n in iterable.iteritems():
                self[elem] += n
        else:
            for elem in iterable:
                self[elem] += 1 

    def __contains__(self, elem):
        return elem in self._data

    def __getitem__(self, elem):
        return self._data.get(elem, 0)

    def __setitem__(self, elem, n):
        self._len += n - self[elem]
        self._data[elem] = n
        if n == 0:
            del self._data[elem]

    def __delitem__(self, elem):
        self._len -= self[elem]
        del self._data[elem]

    def __len__(self):
        assert self._len == sum(self._data.itervalues())
        return self._len

    def __eq__(self, other):
        if not isinstance(other, bag):
            return False
        return self._data == other._data

    def __ne__(self, other):
        if not isinstance(other, bag):
            return True
        return self._data != other._data

    def __hash__(self):
        raise TypeError

    def __repr__(self):
        return 'bag(%r)' % self._data

    def copy(self):
        return self.__class__(self)

    __copy__ = copy # For the copy module

    def __deepcopy__(self, memo):
        from copy import deepcopy
        result = self.__class__()
        memo[id(self)] = result
        data = result._data
        result._data = deepcopy(self._data, memo)
        result._len = self._len
        return result

    def __getstate__(self):
        return self._data.copy(), self._len

    def __setstate__(self, data):
        self._data = data[0].copy()        
        self._len = data[1]

    def clear(self):
        self._data.clear()
        self._len = 0

    def __iter__(self):
        for elem, cnt in self._data.iteritems():
            for i in xrange(cnt):
                yield elem
                
    def iterunique(self):
        return self._data.iterkeys()

    def itercounts(self):
        return self._data.iteritems()     

    def mostcommon(self, n=None):
        if n is None:
            return sorted(self.itercounts(), key=itemgetter(1), reverse=True)
        it = enumerate(self.itercounts())
        nl = nlargest(n, ((cnt, i, elem) for (i, (elem, cnt)) in it))
        return [(elem, cnt) for cnt, i, elem in nl]

I'm proposing this API for inclusion in a Py2.5 collections module. Please comment if you see any way the API can be improved.

Example: <pre>

>>> from collections import bag
>>> b = bag('banana')
>>> b.mostcommon()
[('a', 3), ('n', 2), ('b', 1)]
>>> list(b)
['a', 'a', 'a', 'b', 'n', 'n']
>>> set(b.iterunique())  # faster than set(b)
set(['a', 'b', 'n'])
>>> b['a']
3
>>> b['a'] += 4
>>> b['a']
7
>>> del b['a']
>>> b['a'] += 1
>>> b['a']
1
</pre>

Reposted after incorporating most of the suggestions below.

17 comments

Matt R 20 years, 3 months ago  # | flag

subscripting? Finding this recipe has just saved me an hour or so -- thanks!

API: How about using subscripting to access the counts?

>>> b = bag('banana')
>>> b['a']
3
average 20 years, 3 months ago  # | flag

API. I agree with previous comment from Matt R and prefer subscripting interface so i can:

>>> mybag = bag("abacab")



>>> mybag['d'] += 1 #automatically adds 'a' to bag if doesn't exist



>>> mybag['c'] -= 2 #clear (or throw exception) if less than 2 'c's in bag



>>> mybag['a'] = 0  #removes all 'a' from bag



>>> mybag['e']      #queries bag for the count of 'e'

0

This gets rid of count, add, remove methods.

Since I inheret my bag from dict, I redefine pop() to both return and remove all of a given item of bag. Also since I have to redefine dict.__setitem__() anyway, I use that opportunity to validate the dict(non-zero integer) values before insertion. This also makes it easy for subclasses to redefine the policy of how the bag should behave (like in example #2 above). (Alternatively, one could pass in a policy function or "filter" to the bag constructor.)

I also allow add/subtract arithmetic on bags:

>>> bag("word counts from file one".split()) + bag("word counts from another file".split())

bag({"word": 2, "counts": 2, "from": 2, "another": 1, "file": 2, "one": 1})

Mike

average 20 years, 3 months ago  # | flag

addendum. I should add that the suggested API allows me to easily create an weighted graph class as a dict of bags with a nice API for the graph:

>>> g = graph()
>>> g['here']['there']  #returns edge weight



>>> g['here']['newthere'] = 1 #add new vertex 'newthere' with weight=1
Remy Blank 19 years, 4 months ago  # | flag

Inequality operator. Shouldn't the __ne__() operator be:

def __ne__(self, other):
        if not isinstance(other, bag):
            return True
        return self._data != other._data

I.e. if the other object is not a bag, it is not equal to the bag?

Steven Bethard 19 years, 4 months ago  # | flag

__iter__. Why does __iter__ iterate over (item, count) pairs as opposed to just items like dict? That is, why not try to parallel the current Python objects more?

Steven Bethard 19 years, 4 months ago  # | flag

remove. I think maybe remove needs renamed. If I called the remove method on a bag, I would expect the count to be zeroed, not decremented. Maybe 'decrement' is a better name? I would also like a remove method that really does zero the count -- this is useful for pruning a bag by count... Of course, a 'prune' method that removes all items with less than or equal to a given count would be even better for my purposes...

Steven Bethard 19 years, 4 months ago  # | flag

sortedbycount. I don't think sortedbycount should be a method. As your implementation shows, this functionality is already available with sorted(). I'm not sure that sortedbycount actually gains us much here...

Andreas Kloss 19 years, 4 months ago  # | flag

Thinking in abstract vs. implementation. If you think of the bag as a key:count dict, then yes, "remove" should remove the whole key, and "decrement" could decrement until 0 (then remove).

On the other hand, if you think of the bag as an unordered container for multiple possibly-equal things, then it is sensible for "remove" to remove just one element instead of all elements being equal to the one you happened to select.

Case in point: Suppose you have a bag of many black and white marbles. You remove a white marble. Would you expect the bag to not contain any white marble? The method to remove all elements of an equality class should be aptly named "removeAll" or something to this effect.

-- Andre

Providing high quality math zealotry since 1977 ;)

Steven Bethard 19 years, 4 months ago  # | flag

My typical use for a bag class would be to hold word counts, not marbles. ;) So my abstraction is keys with counts. Of course, it's a moot point since the remove method's been removed in favor of __getitem__ and __setitem__.

Steven Bethard 19 years, 4 months ago  # | flag

mostcommon. Thanks, I like mostcommon a lot better!

Jim Jewett 19 years, 4 months ago  # | flag

specialcase update with another bag. I would expect to able to write

b1 = Bag()
b2 = Bag()
...
b2.update(b1)

As written, this turns out to be a fairly inefficient way of doing it. If b1[k] == 20, it will take 20 separate calls to the iterator, the length adjuster, and the dictionary's __setitem__

Steven Bethard 19 years, 2 months ago  # | flag

itercounts is confusing. I was looking at this again today because I need to use something like this right now, and it strikes me that itercounts is kind of confusing. For my current task, I want just the counts (i.e. the equivalent of _data.itervalues()). To me, itercounts sounds like it does this, not iterate through the (item, count) pairs.

So I guess I have two suggestions: (1) Make itercounts iterate just through the counts, and (2) Rename itercounts to something like iteritems or iteritemcounts.

Thanks again!

Martin v. Löwis 19 years ago  # | flag

Still need add and remove. Users coming from Smalltalk or Objective C, or users familiar with C++ multiset will expect that a collections.bag is a container that is capable of holding the same thing (say, a string) multiple times. Currently, there appears to be no .add method, which I think there should be. Also, .remove should remove a single item, not all copies of the same item. For convenience, .addAll might be useful, to implement a frequency count.

Jon Blunt 18 years, 10 months ago  # | flag

Bags should do not have -ve quantities. If you decrement a value below 0 there is no logical meaning and you can break the validity of len(myBag).

>>> x =bag.bag()



>>> x['a']

0

>>> x['a']-=1



>>> x['a']

-1

>>> len (x)

Traceback (most recent call last):

File "", line 1, in -toplevel-

len (x)

ValueError: __len__() should return >= 0

Dan Stromberg 16 years, 4 months ago  # | flag

Union, intersection, and difference of two bags? This class would be more general if it supported union, intersection and difference of two bags:

http://www.brpreiss.com/books/opus5/html/page398.html

...making it more truly like "sets with duplicates".

I'll probably need a method that does a modified union, where b.minunion(c) would give a bag with elements e s.t. their repetition counts are max(|e| in b, |e| in c). But I'm not sure I should foist this weird operation on the rest of the python world. :)

Matteo Dell'Amico 15 years, 2 months ago  # | flag

This is my implementation inheriting from dict. This changes the viewpoint from a "set with duplicates" to a counting dictionary which is probably more simple to grasp for the average Python user. The default __iter__ becomes thus iterunique, and __len__ returns the number of unique items.

from itertools import repeat
from heapq import nlargest
from operator import itemgetter

class bag(dict):

    def __init__(self, data=()):
        self.update(data)

    def __missing__(self, key):
        return 0

    def update(self, other):
        if hasattr(other, 'items'):
            super(bag, self).update(other)
        else:
            for elem in other:
                self[elem] += 1

    def __setitem__(self, elem, n):
        if n <= 0:
            if elem in self:
                del self[elem]
        else:
            super(bag, self).__setitem__(elem, n)

    def itermultiple(self):
        for elem, cnt in self.iteritems():
            for _ in xrange(cnt):
                yield elem

    def nitems(self):
        return sum(self.itervalues())

    def mostcommon(self, n=None):
        if n is None:
            return sorted(self.iteritems(), key=itemgetter(1), reverse=True)
        else:
            return nlargest(n, self.iteritems(), key=itemgetter(1))

    def __repr__(self):
        return '%s(%s)' % (self.__class__.__name__, dict.__repr__(self))

    def add(self, item, n=1):
        self[item] += n

    def discard(self, item, n=1):
        self[item] -= n
Grant Jenks 9 years, 7 months ago  # | flag

If you're looking for an API similar to that provided by a bag, check out the sortedcontainers module. It implements sorted list, sorted dict, and sorted set data types in pure-Python and is fast-as-C implementations (even faster!). Learn more about sortedcontainers, available on PyPI and github.