ActiveState Code

Recipe 252524: Length-limited O(1) LRU Cache implementation


Everyone knows about LRU. Here is an implementation that takes O(1) time to insert, update or delete.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
class Node(object):
    __slots__ = ['prev', 'next', 'me']
    def __init__(self, prev, me):
        self.prev = prev
        self.me = me
        self.next = None

class LRU:
    """
    Implementation of a length-limited O(1) LRU queue.
    Built for and used by PyPE:
    http://pype.sourceforge.net
    Copyright 2003 Josiah Carlson.
    """
    def __init__(self, count, pairs=[]):
        self.count = max(count, 1)
        self.d = {}
        self.first = None
        self.last = None
        for key, value in pairs:
            self[key] = value
    def __contains__(self, obj):
        return obj in self.d
    def __getitem__(self, obj):
        a = self.d[obj].me
        self[a[0]] = a[1]
        return a[1]
    def __setitem__(self, obj, val):
        if obj in self.d:
            del self[obj]
        nobj = Node(self.last, (obj, val))
        if self.first is None:
            self.first = nobj
        if self.last:
            self.last.next = nobj
        self.last = nobj
        self.d[obj] = nobj
        if len(self.d) > self.count:
            if self.first == self.last:
                self.first = None
                self.last = None
                return
            a = self.first
            a.next.prev = None
            self.first = a.next
            a.next = None
            del self.d[a.me[0]]
            del a
    def __delitem__(self, obj):
        nobj = self.d[obj]
        if nobj.prev:
            nobj.prev.next = nobj.next
        else:
            self.first = nobj.next
        if nobj.next:
            nobj.next.prev = nobj.prev
        else:
            self.last = nobj.prev
        del self.d[obj]
    def __iter__(self):
        cur = self.first
        while cur != None:
            cur2 = cur.next
            yield cur.me[1]
            cur = cur2
    def iteritems(self):
        cur = self.first
        while cur != None:
            cur2 = cur.next
            yield cur.me
            cur = cur2
    def iterkeys(self):
        return iter(self.d)
    def itervalues(self):
        for i,j in self.iteritems():
            yield j
    def keys(self):
        return self.d.keys()

Discussion

Having not seen anything of the sort during my undergraduate years, upon writing it initially in the summer of 2002, I was enthusiastic as to it's usefulness in database caching or even operating system paging. Unfortunately, it seems as though no one is even interested in this thing, which suggests that it has been done before or something else. As a result, I'm offering it in the Python Cookbook before releasing a variant in my own Python editor PyPE (for keeping information about the 128 most recently opened documents) available at http://pype.sourceforge.net .

Use and enjoy.

Edit: Oh wow, you can edit them when there's a bug? Wow. Fixed the bugs.

Comments

  1. 1. At 11:56 a.m. on 17 dec 2003, Fazal Majid said:

    Not quite O(1) or LRU. You basically maintain a dictionary and a linked list. Dictionary accesses are not O(1). I believe the Python implementation uses a hash table, which has O(N) worst case behavior. The best you can hope for in a dictionary is O(log(N)).

    Furthermore, your code is not correct and does not maintain LRU semantics, as your linked list is not updated to move the most recently accessed node to the end of the list when __getitem__ is called. You are actually implementing a FIFO.

    If you want to have good complexity behavior in a cache, you will have to use a priority queue algorithm like binomial heaps. I haven't tried the new Priority Queue class in Python 2.3 yet, but I have used the one at http://www.csse.monash.edu.au/hons/projects/1999/Andrew.Snare/ to good effect.

  2. 2. At 5:37 a.m. on 18 dec 2003, Michael Hudson said:

    over pedantry. While saying "the best you can hope for from a dict is O(log n)" may be pedantically correct (and I'm not even sure about that), in practice assuming dicts have O(1) access isn't going to get you into hot water.

    Python's hash functions are pretty good.

  3. 3. At 1:03 a.m. on 21 dec 2003, Josiah Carlson (the author) said:

    Bugfixes that slipped my mind. Small bugfixes:

    self.first = a
    a.next = None
    del self.d[a[0]]
    

    Should be replaced by:

    self.first = a.next
    a.next = None
    del self.d[a.me[0]]
    
  4. 4. At 1:13 a.m. on 21 dec 2003, Josiah Carlson (the author) said:

    Hashing and O(1). Check the Python distribution source code, namely dictobject.c. If you notice the probe sequence rules, unless every object has the same hash, it will not perform O(n). In fact, with a maximally 2/3 filled table, the odds of colliding more than 10 times is less than ~1%.

    To me, 10 probes is good enough. Heck, at 20 probes, we're talking about ~.01%.

    Plus it has the benefit of not having to deal with pointers for a tree structure.

  5. 5. At 1:24 a.m. on 21 dec 2003, Josiah Carlson (the author) said:

    It is an LRU (it was buggy). After the bugfix I posted, it is indeed an LRU. Why?

    If you follow the flow of code, certainly it always deletes the 'first' item, but that is the oldest item...all the newer items are inserted as the 'last' item.

    Here is some output (after my bugfix given in another comment):

    >>> a = LRU(4)
    >>> a[1] = 1
    >>> a[2] = 2
    >>> a[3] = 3
    >>> a[4] = 4
    >>> for i in a.iteritems():
    ...     print i,
    ...
    (1, 1) (2, 2) (3, 3) (4, 4)
    >>> a[5] = 5
    >>> for i in a.iteritems():
    ...     print i,
    ...
    (2, 2) (3, 3) (4, 4) (5, 5)
    >>> a[3] = 6
    >>> for i in a.iteritems():
    ...     print i,
    ...
    (2, 2) (4, 4) (5, 5) (3, 6)
    

    Seems to work for me.

  6. 6. At 1:28 a.m. on 21 dec 2003, Josiah Carlson (the author) said:

    I see... That's what everyone is talking about.

    I only update it when writing, everyone cares about reading.

    There is a bugfix in another comment.

  7. 7. At 1:30 a.m. on 21 dec 2003, Josiah Carlson (the author) said:

    Make LRU on read (as well as write). If you want to make it LRU on read, as well as write (which is all I had been concerned about), change:

    def __getitem__(self, obj):
        return self.d[obj].me[1]
    

    To:

    def __getitem__(self, obj):
        a = self.d[obj].me
        self[a[0]] = a[1]
        return a[1]
    
  8. 8. At 2:14 a.m. on 21 dec 2003, Josiah Carlson (the author) said:

    A note on binomial heaps. I've implemented both a standard binary heap, as well as a binomial heap in Python before. I was not impressed by my binomial heap implementation.

    The author that you link to claims to have creted a fibonacci heap that is fast...but how fast? The graphs for his distribution show that it is quite fast, but I wanted to test it myself, so I did.

    Below are some test runs against my sample implementations. dict_heap and list_heap are identical, except that dict_heap uses a dictionary and list_heap uses a Python list (which is really an array). Notice how the dictionary is only ever about 33% slower than list_heap during the removal of all the elements.

                  init     popall
    10000:
    dict_heap     0.265    1.829
    list_heap     0.141    1.406
    BinomialHeap  0.484    4.25
    PQ0           0.047    0.734
    PQueue        0.359    1.594
    PQEquivBig    1.062    2.172
    20000:
    dict_heap     0.563    3.968
    list_heap     0.281    3.047
    BinomialHeap  0.969    9.125
    PQ0           0.11     2.765
    PQueue        0.766    3.437
    PQEquivBig    2.218    4.625
    30000:
    dict_heap     0.844    6.234
    list_heap     0.422    4.75
    BinomialHeap  1.469   14.313
    PQ0           0.172    8.859
    PQueue        1.125    5.407
    PQEquivBig    3.313    7.141
    40000:
    dict_heap     1.141    8.562
    list_heap     0.578    6.531
    BinomialHeap  1.937   19.828
    PQ0           0.234   16.094
    PQueue        1.625    7.438
    PQEquivBig    4.563    9.797
    

    PQ0 gets slower as the sizes get larger because deleting the first item in a list (as PQ0 does) is O(n).

    PQueue just seems to be poorly implemented.

    PQEquivBig seems to be exactly 50% slower than list_heap during deletion, and much slower in insertion.

    I'm not impressed.

  9. 9. At 4:12 p.m. on 10 sep 2004, Josiah Carlson (the author) said:

    Ammortized O(1) when the dictionary is sparse enough. I was thinking about it for a bit; if Python's hash functions are good enough, and the dictionary was sparse enough, one can ammortize all accesses to O(1).

    Assume Python dictionaries use double hashing; first for the start location, second for a probe sequence increment. If the dictionary is 1/2 full, then the probability of you not colliding in your first shot is 1/2. For those that collide, subsequent tests have a 1/2 chance of not colliding. If we sum up the probabilities to get to probe number k... 11/2 + 21/4 + 31/8 + 41/16 + ..., which ends up summing to 2.

    Very nifty.

  10. 10. At 5:20 p.m. on 31 may 2006, Tim Mitchell said:

    Needs to use weakref. This recipie looks great for my application with a cache of about 1000 objects. However the doubly linked Nodes form reference cycles and will not get garbage collected when the LRU instance is deleted. Using a weakref for the Node.prev solves this. You could also use __slots__ on the Node class to save space - given that that there will be alot of them.

  11. 11. At 8:37 p.m. on 6 aug 2006, Josiah Carlson (the author) said:

    You make a good point about the use of slots. I've updated the recipe to reflect it.

    As for the reference cycles, since 2.2, Python has had a cycle-detecting garbage collector, so these reference cycles aren't a big deal. Also, only mangling the previous pointers ignores the fact that next pointers introduce their own reference cycle. To prove that reference cycles aren't a big deal, the following is a sample interaction using the LRU with Python 2.3 (as Python 2.2 is no longer supported by the PSF).

    >>> import gc
    >>> import lru
    >>> x = lru.LRU(2000, zip(range(1000), range(1000)))
    >>> del x
    >>> gc.garbage
    []
    >>> gc.collect()
    3000
    >>> gc.garbage
    []
    >>>
    

Sign in to comment