This implements a "unique" filter. Its input is an iterator of hashable items. It returns an iterator containing only the unique items of the input, in input order. That is, list(unique("cabbage")) produces ["c", "a", "b", "g"]. The implementation is lazy. The function supports the "key" parameter, which provides an alternate form of comparison.
(Note: a better version of this is available from the itertools documentation as unique_everseen )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | def unique(iterable, key=None):
seen = set()
if key is None:
# Optimize the common case
for item in iterable:
if item in seen:
continue
seen.add(item)
yield item
else:
for item in iterable:
keyitem = key(item)
if keyitem in seen:
continue
seen.add(keyitem)
yield item
if __name__ == "__main__":
assert list(unique("abcd")) == list("abcd"), (list(unique("abcd")), "abcd".split())
assert list(unique("abca")) == list("abc")
assert list(unique("baaca")) == list("bac")
assert list(unique("")) == []
assert list(unique("to be or not to be".split())) == "to be or not".split()
assert list(unique("to be or not to be".split(), key=len)) == "to not".split()
assert list(unique(set("cabbage"))) == list(unique(set("cabbage")))
print("All tests passed.")
|
There are several recipes which find the unique items in a collection (502263, 438599, and 52560). These consume the entire input before returning the unique items, often as a list.
Python code is evolving to prefer more iterator-based solutions. This recipe implements a lazy iterator. Internally it keeps track of all the previously seen items using a set. To produce a new unique item it consumes items from the input until there's one which hasn't been seen before. When found, this new item is added to the set.
The use case is a task like "find the first 5 unique items in the input".
for item in itertools.islice(unique(input_iterator), 5):
print(item)
A more real-world example is to generate a list of speakers in order of appearance in a play:
for speaker in unique(node.text for node in play.findall("//SPEAKER")):
print(speaker)
The "key" option is useful when there is another way to determine uniqueness. For example, if the speaker node in the previous example contains information about the actor in the attribute "actor", then
for speaker in unique(node for node in play.findall("//SPEAKER"), key=lambda node: node.text):
print("%s as %s" % (speaker.attrib["actor"], speaker.text))
will show the actor and the corresponding role.
As a simpler example of the "key" parameter, the following takes a list of words and returns the first of each word of a given length, in input order.
for word in unique(words, key=len):
print(word)
Do note that if you only want unique elements in an iterator then your can do set(input_iterator). This recipe preserves the input order. You can preserve the input order with iter(collections.OrderedDict.from_keys(input_iterator)) but that does not use the "key" function and it consumes the entire input before generating output. You can implement the key function yourself with collections.OrderedDict((key(item), item) for item in input_iterator).values() (or itervalues() for pre-3.x Python), but that again consumes the entire input before generating output.
Some of the other unique() recipes support fallback solutions which use a sorted list when elements implement __lt__ but not __hash__, and even failsafe from that and revert to the order-quadratic list algorithm if the elements only implement __eq__. I decided to not implement those backup strategies since I don't like the run-time check, I don't like the strange behavior if some elements implement __lt_ and others implement __hash__ but not vice versa, and because I want an exception instead of occasional quadratic behavior.
I instead suggest that a "unique_sortable" or similar function be done for input objects which are sortable but not hashable.