Trees are very common data structures and are usually considered to be very efficient. However, this is only true if the tree is balanced, meaning that all branches have roughly the same number of nodes.
There are good balancing trees, such as rb-trees or avl-trees. Unfortunately they are quite difficult to implement. An alternative tree structure is the highly branched b-tree (https://en.wikipedia.org/wiki/B-tree). In the c language, binary trees are preferable in most cases. However, in python things are different. This recipe shows how simple it is to implement a b-tree in python. The example is a sorted dict.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | nmax = 20 # max number of branches per node
class Values:
key = None
def __init__(self, data):
self.data = tuple(data)
if len(data):
self.key = data[0][0]
def split(self):
l = self.data
if len(l)<nmax:
return (self,)
n = nmax/2
r = []
while l:
r.append(Values(l[:n]))
l = l[n:]
return tuple(r)
def insert(self, key, value):
for i, (k, v) in enumerate(self.data):
if key<k:
l = self.data[:i]+((key, value),)+self.data[i:]
return Values(l)
elif key == k:
l = self.data[:i]+((key, value),)+self.data[i+1:]
return Values(l)
l = self.data+((key, value),)
return Values(l)
class Node:
def __init__(self, childs):
assert childs
self.data = tuple(childs)
self.key = childs[0].key
def split(self):
l = self.data
if len(l)<nmax:
return (self,)
n = nmax/2
r = []
while l:
r.append(Node(l[:n]))
l = l[n:]
return tuple(r)
def insert(self, key, value):
last = None
for i, child in enumerate(self.data):
if key<child.key:
if last is not None:
l = self.data[:i-1]+last.insert(key, value).split()+self.data[i:]
else:
l = child.insert(key, value).split()+self.data[1:]
return Node(l)
last = child
l = self.data[:-1]+last.insert(key, value).split()
return Node(l)
|
The key idea of b-trees is that they grow in width and not in depth. Hence the tree which is returned from the insert-method always has the same depth as the original tree before the insert. If a node is full, it is split by the parent node (see the split method). That way branches always have the same length and the tree stays balanced.
Insertion of keys and values is done as follows:
tree = Values(())
for i in range(100000):
tree = tree.insert(key=i, value=i)
if len(tree.data)>nmax:
tree = Node(tree.split())
Given the simplicity of the algorithm, the performance is quite ok. Inserting 100 000 numbers takes 2.4 seconds on my pc. The rb-tree from http://code.activestate.com/recipes/576817-red-black-tree needs for the same task about twice the time: 5.2 seconds.
Surely these numbers have to be interpreted with caution since both trees are not optimized.
f you're looking for an API similar to that provided by a b-tree, check out the sortedcontainers module. It implements sorted list, sorted dict, and sorted set data types in pure-Python and is fast-as-C implementations (even faster!). Learn more about sortedcontainers, available on PyPI and github.