Implement Smalltalk's bag class (similar to MultiSets in C++ or bag's in Objective C or Haskell's Edison module).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | from operator import itemgetter
from heapq import nlargest
class bag(object):
def __init__(self, iterable=()):
self._data = {}
self._len = 0
self.update(iterable)
def update(self, iterable):
if isinstance(iterable, dict):
for elem, n in iterable.iteritems():
self[elem] += n
else:
for elem in iterable:
self[elem] += 1
def __contains__(self, elem):
return elem in self._data
def __getitem__(self, elem):
return self._data.get(elem, 0)
def __setitem__(self, elem, n):
self._len += n - self[elem]
self._data[elem] = n
if n == 0:
del self._data[elem]
def __delitem__(self, elem):
self._len -= self[elem]
del self._data[elem]
def __len__(self):
assert self._len == sum(self._data.itervalues())
return self._len
def __eq__(self, other):
if not isinstance(other, bag):
return False
return self._data == other._data
def __ne__(self, other):
if not isinstance(other, bag):
return True
return self._data != other._data
def __hash__(self):
raise TypeError
def __repr__(self):
return 'bag(%r)' % self._data
def copy(self):
return self.__class__(self)
__copy__ = copy # For the copy module
def __deepcopy__(self, memo):
from copy import deepcopy
result = self.__class__()
memo[id(self)] = result
data = result._data
result._data = deepcopy(self._data, memo)
result._len = self._len
return result
def __getstate__(self):
return self._data.copy(), self._len
def __setstate__(self, data):
self._data = data[0].copy()
self._len = data[1]
def clear(self):
self._data.clear()
self._len = 0
def __iter__(self):
for elem, cnt in self._data.iteritems():
for i in xrange(cnt):
yield elem
def iterunique(self):
return self._data.iterkeys()
def itercounts(self):
return self._data.iteritems()
def mostcommon(self, n=None):
if n is None:
return sorted(self.itercounts(), key=itemgetter(1), reverse=True)
it = enumerate(self.itercounts())
nl = nlargest(n, ((cnt, i, elem) for (i, (elem, cnt)) in it))
return [(elem, cnt) for cnt, i, elem in nl]
|
I'm proposing this API for inclusion in a Py2.5 collections module. Please comment if you see any way the API can be improved.
Example: <pre>
>>> from collections import bag
>>> b = bag('banana')
>>> b.mostcommon()
[('a', 3), ('n', 2), ('b', 1)]
>>> list(b)
['a', 'a', 'a', 'b', 'n', 'n']
>>> set(b.iterunique()) # faster than set(b)
set(['a', 'b', 'n'])
>>> b['a']
3
>>> b['a'] += 4
>>> b['a']
7
>>> del b['a']
>>> b['a'] += 1
>>> b['a']
1
</pre>
Reposted after incorporating most of the suggestions below.
Tags: algorithms
subscripting? Finding this recipe has just saved me an hour or so -- thanks!
API: How about using subscripting to access the counts?
API. I agree with previous comment from Matt R and prefer subscripting interface so i can:
0
This gets rid of count, add, remove methods.
Since I inheret my bag from dict, I redefine pop() to both return and remove all of a given item of bag. Also since I have to redefine dict.__setitem__() anyway, I use that opportunity to validate the dict(non-zero integer) values before insertion. This also makes it easy for subclasses to redefine the policy of how the bag should behave (like in example #2 above). (Alternatively, one could pass in a policy function or "filter" to the bag constructor.)
I also allow add/subtract arithmetic on bags:
bag({"word": 2, "counts": 2, "from": 2, "another": 1, "file": 2, "one": 1})
Mike
addendum. I should add that the suggested API allows me to easily create an weighted graph class as a dict of bags with a nice API for the graph:
Inequality operator. Shouldn't the __ne__() operator be:
I.e. if the other object is not a bag, it is not equal to the bag?
__iter__. Why does __iter__ iterate over (item, count) pairs as opposed to just items like dict? That is, why not try to parallel the current Python objects more?
remove. I think maybe remove needs renamed. If I called the remove method on a bag, I would expect the count to be zeroed, not decremented. Maybe 'decrement' is a better name? I would also like a remove method that really does zero the count -- this is useful for pruning a bag by count... Of course, a 'prune' method that removes all items with less than or equal to a given count would be even better for my purposes...
sortedbycount. I don't think sortedbycount should be a method. As your implementation shows, this functionality is already available with sorted(). I'm not sure that sortedbycount actually gains us much here...
Thinking in abstract vs. implementation. If you think of the bag as a key:count dict, then yes, "remove" should remove the whole key, and "decrement" could decrement until 0 (then remove).
On the other hand, if you think of the bag as an unordered container for multiple possibly-equal things, then it is sensible for "remove" to remove just one element instead of all elements being equal to the one you happened to select.
Case in point: Suppose you have a bag of many black and white marbles. You remove a white marble. Would you expect the bag to not contain any white marble? The method to remove all elements of an equality class should be aptly named "removeAll" or something to this effect.
-- Andre
Providing high quality math zealotry since 1977 ;)
My typical use for a bag class would be to hold word counts, not marbles. ;) So my abstraction is keys with counts. Of course, it's a moot point since the remove method's been removed in favor of __getitem__ and __setitem__.
mostcommon. Thanks, I like mostcommon a lot better!
specialcase update with another bag. I would expect to able to write
As written, this turns out to be a fairly inefficient way of doing it. If b1[k] == 20, it will take 20 separate calls to the iterator, the length adjuster, and the dictionary's __setitem__
itercounts is confusing. I was looking at this again today because I need to use something like this right now, and it strikes me that itercounts is kind of confusing. For my current task, I want just the counts (i.e. the equivalent of _data.itervalues()). To me, itercounts sounds like it does this, not iterate through the (item, count) pairs.
So I guess I have two suggestions: (1) Make itercounts iterate just through the counts, and (2) Rename itercounts to something like iteritems or iteritemcounts.
Thanks again!
Still need add and remove. Users coming from Smalltalk or Objective C, or users familiar with C++ multiset will expect that a collections.bag is a container that is capable of holding the same thing (say, a string) multiple times. Currently, there appears to be no .add method, which I think there should be. Also, .remove should remove a single item, not all copies of the same item. For convenience, .addAll might be useful, to implement a frequency count.
Bags should do not have -ve quantities. If you decrement a value below 0 there is no logical meaning and you can break the validity of len(myBag).
0
-1
Traceback (most recent call last):
File "", line 1, in -toplevel-
ValueError: __len__() should return >= 0
Union, intersection, and difference of two bags? This class would be more general if it supported union, intersection and difference of two bags:
http://www.brpreiss.com/books/opus5/html/page398.html
...making it more truly like "sets with duplicates".
I'll probably need a method that does a modified union, where b.minunion(c) would give a bag with elements e s.t. their repetition counts are max(|e| in b, |e| in c). But I'm not sure I should foist this weird operation on the rest of the python world. :)
This is my implementation inheriting from
dict
. This changes the viewpoint from a "set with duplicates" to a counting dictionary which is probably more simple to grasp for the average Python user. The default__iter__
becomes thusiterunique
, and__len__
returns the number of unique items.If you're looking for an API similar to that provided by a bag, check out the sortedcontainers module. It implements sorted list, sorted dict, and sorted set data types in pure-Python and is fast-as-C implementations (even faster!). Learn more about sortedcontainers, available on PyPI and github.