foldsh in this recipe is a general purpose tool for transforming tree-like recursive data structures while keeping track of shared subtrees.
# By default, a branch is encoded as a list of subtrees; each subtree can be a # branch or a leaf (=anything non-iterable). Subtrees can be shared: >>> subtree = [42,44] >>> tree = [subtree,[subtree]] # We can apply a function to all leaves: >>> foldsh(tree, leaf= lambda x: x+1) [[43, 45], [[43, 45]]] # Or apply a function to the branches: >>> foldsh(tree, branch= lambda t,c: list(reversed(c))) [[[44, 42]], [44, 42]] # The sharing is preserved: >>> _ is _ True # Summing up the leaves without double counting of shared subtrees: >>> foldsh(tree, branch= lambda t,c: sum(c), shared= lambda x: 0) 86
In particular, it is useful for transforming YAML documents. An example of this is given below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
def identity(x): return x def foldsh(tree, branch= lambda tree,ch: list(ch), leaf=identity, shared=identity, getchildren=iter): '''Sharing-aware tree transformations. foldsh(tree,branch,leaf,shared,getchildren) :: tree -> result branch :: tree, list of children's results -> result leaf :: tree -> result shared :: result -> result getchildren :: tree -> iterable of children''' mem = dict() def _fold(tree): if id(tree) in mem: return shared(mem[id(tree)]) else: try: children = getchildren(tree) except: res = leaf(tree) else: reschildren = [_fold(child) for child in children] res = branch(tree,reschildren) mem[id(tree)] = res return res return _fold(tree)
Revision 2: rewrite using the try...except...else idiom which I didn't know before and makes the logical structure clearer. Functionality is unchanged.
Many transformations on trees follow the same general pattern. In a recursive traversal,
- first, leaves are transformed into results (here, this is the task of the function
- when all subtrees (children) of a branch have been transformed, the results of these children are combined into a result for the branch (the task of function
- this way, results are passed up the tree until the result for the whole tree is constructed.
In this recipe, the recursion is performed by
foldsh; the transformation functions
branch are supplied by the user. Note that a "result" can be anything, for example
- an integer counting the number of leaves, depth of the tree, maximal breadth of branches;
- a list of all the leaves in the tree;
- another tree.
In addition to implementing a regular fold, this recipe supports sharing of subtrees. (So the data structure is actually a DAG, but we'll still say tree.) If during the recursive traversal
foldsh encounters a subtree that it has seen before, it will not recurse into it, but instead take the previously constructed result, and apply
shared to it.
foldsh recurse over other tree-like structures, supply it with a different
getchildren function. This function will be applied to the tree (and subtrees) to determine whether we have reached a branch or a leaf. If
getchildren is applied to a branch, it should return an iterable of subtrees; if applied to a leaf, it should throw an exception.
With no arguments except its first,
foldsh(tree) returns a copy of
tree where the structure consists of newly constructed lists, and the leaves are shared with
tree. In the new structure, the sharing of subtrees is preserved.
def countsh(tree): '''Count the number of leaf+branch nodes (count shared subtrees once).''' def branch(tree,reschildren): return 1 + sum(reschildren) def leaf(tree): return 1 def shared(tree): return 0 return foldsh(tree,branch,leaf,shared) def listleaves(tree): '''List all leaves (but those from shared subtrees only once).''' def leaf(tree): return [tree] def branch(tree,reschildren): result =  for rc in reschildren: result.extend(rc) return result def shared(res): return  return foldsh(tree,branch,leaf,shared) # Actually, this can also be elegantly solved by returning iterators as # result. We do not have to make an exception for shared subtrees, because # these (shared!) iterators will already be exhausted when encountered for # the second time: def iterleavesonce(tree): '''Iterate all leaves (but those from shared subtrees only once).''' def leaf(tree): return iter([tree]) def branch(tree,reschildren): return itertools.chain(*reschildren) return foldsh(tree,branch,leaf) # Build a tree out of frozensets instead of lists: subtree = [42,44] tree = [subtree,[subtree]] fstree = foldsh(tree, branch= lambda t,c: frozenset(c)) # fstree is now: # frozenset([frozenset([42, 44]), frozenset([frozenset([42, 44])])]) # with shared subtrees
Watch out with strings as leaves. The default
getchildren will go into infinite
iter('s') will yield 's' again. To solve this, you can use:
def getchildren(tree): assert not isinstance(tree,(str,unicode)) return iter(tree)
YAML is a human-readable language for data structure serialization with support for sharing. A YAML sequence corresponds to a Python
list and a YAML mapping to a
dict. We can use
foldsh to recurse into the data structure (for the dicts, we only recurse into the values), and, for example, reverse all the lists:
import yaml # from http://pyyaml.org/ def reverselists(tree): def branch(tree,reschildren): if isinstance(tree,dict): return dict(zip(tree.iterkeys(),reschildren)) else: return list(reversed(reschildren)) def getchildren(tree): try: return tree.itervalues() except: assert not isinstance(tree,str) return iter(tree) return foldsh(tree,branch,getchildren=getchildren) y = yaml.load(''' ingredients: - &s # this element is given label s and can be shared later - spam - SPAM - eggs - *s # like here - *s # and here ''')
Sharing is preserved in the resulting YAML:
>>> print(yaml.dump(reverselists(y),default_flow_style=False)) ingredients: - &id001 - SPAM - spam - *id001 - eggs - *id001
This recipe does not support cyclic data structures (it will go into infinite recursion). For a way to deal with this, see Recipe 578118.