|
1
|
Python's "in" operator is extremely handy, but O(N) when applied to an N-item sequence; if a sequence is subject to frequent "in" tests, an auxiliary dictionary at its side can boost performance A LOT if the values are hashable.
A membership check ("in" operator) on a sequence of N items is O(N); if M such tests are performed, overall time is O(M*N). Preparing an auxiliary dictionary whose keys are the sequence's items is also [roughly] O(N), but then the M tests are just [roughly] O(M), so overall we have [roughly] O(N+M). Even better overall performance may often be obtained by 'permanently' placing the auxiliary dictionary on the side of the sequence, encapsulating both into one object; however, in this case, the dictionary must be "maintained", as the sequence is modified, so that it stays "in sync" with the actual membership of the sequence. The above code, for example, extends UserList, and delegates to it every method -- this is done explicitly, for methods that may modify the membership, to enable resetting a flag that asserts "the auxiliary dict is in sync"; the auxdict gets rebuilt, if needed, when __contains__ is used (the "in" operator calls the __contains__ method when applied to an instance that has it). Of course, performance characteristics depend on the actual pattern of membership tests versus membership modifications, and it may need some careful profiling to find the right approach for a given use. This recipe, however, caters for a very common pattern of use -- where sequence-modifying operations happen "in bunches", then, for some time, no sequence modification is performed, but several membership-tests may be done. Rebuilding the dictionary when needed is also simpler than incrementally maintaining it at each sequence-modifying step (which requires careful analysis of what is being removed, and what inserted, particularly upon such operations as slice-assignment; if incremental maintenance of the auxdict is desired, it will likely be wise to have the values in the auxdict be a count of the number of occurrences of each key's value in the sequence -- a list of the indices where the value is present being another lively possibility, but taking more work to maintain), although, depending on usage patterns, the resulting software might happen to be substantially slower. Of course, all of this is only needed if _the sequence itself_ is needed -- i.e., the order of items in the sequence is significant; otherwise, keeping just the dictionary is obviously simpler and more effective! (Again, the dictionary can map values to _counts_ if the only need is for the data structure to be a 'bag' rather than a 'set'). An important prerequisite for any of these membership-test optimizations is that the values in the sequence must be _hashable_ ones -- else, of course, they cannot be keys in a dictionary! For example, a list of tuples might be subjected to this treatment, but for a list of lists they would not be applicable.
Tags: search
|
Add a comment
Sign in to comment

Download
Copy to clipboard
