The function foldsh
in this recipe is a general purpose tool for transforming tree-like recursive data structures while keeping track of shared subtrees.
# By default, a branch is encoded as a list of subtrees; each subtree can be a
# branch or a leaf (=anything non-iterable). Subtrees can be shared:
>>> subtree = [42,44]
>>> tree = [subtree,[subtree]]
# We can apply a function to all leaves:
>>> foldsh(tree, leaf= lambda x: x+1)
[[43, 45], [[43, 45]]]
# Or apply a function to the branches:
>>> foldsh(tree, branch= lambda t,c: list(reversed(c)))
[[[44, 42]], [44, 42]]
# The sharing is preserved:
>>> _[0][0] is _[1]
True
# Summing up the leaves without double counting of shared subtrees:
>>> foldsh(tree, branch= lambda t,c: sum(c), shared= lambda x: 0)
86
In particular, it is useful for transforming YAML documents. An example of this is given below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | def identity(x):
return x
def foldsh(tree, branch= lambda tree,ch: list(ch), leaf=identity, shared=identity, getchildren=iter):
'''Sharing-aware tree transformations.
foldsh(tree,branch,leaf,shared,getchildren) :: tree -> result
branch :: tree, list of children's results -> result
leaf :: tree -> result
shared :: result -> result
getchildren :: tree -> iterable of children'''
mem = dict()
def _fold(tree):
if id(tree) in mem:
return shared(mem[id(tree)])
else:
try:
children = getchildren(tree)
except:
res = leaf(tree)
else:
reschildren = [_fold(child) for child in children]
res = branch(tree,reschildren)
mem[id(tree)] = res
return res
return _fold(tree)
|
Changelog
Revision 2: rewrite using the try...except...else idiom which I didn't know before and makes the logical structure clearer. Functionality is unchanged.
Background
Many transformations on trees follow the same general pattern. In a recursive traversal,
- first, leaves are transformed into results (here, this is the task of the function
leaf
); - when all subtrees (children) of a branch have been transformed, the results of these children are combined into a result for the branch (the task of function
branch
); - this way, results are passed up the tree until the result for the whole tree is constructed.
In this recipe, the recursion is performed by foldsh
; the transformation functions leaf
and branch
are supplied by the user. Note that a "result" can be anything, for example
- an integer counting the number of leaves, depth of the tree, maximal breadth of branches;
- a list of all the leaves in the tree;
- another tree.
The general pattern is known as a fold (and in Python as reduce
; but both these terms are mostly used for inductive lists) or catamorphism.
In addition to implementing a regular fold, this recipe supports sharing of subtrees. (So the data structure is actually a DAG, but we'll still say tree.) If during the recursive traversal foldsh
encounters a subtree that it has seen before, it will not recurse into it, but instead take the previously constructed result, and apply shared
to it.
To make foldsh
recurse over other tree-like structures, supply it with a different getchildren
function. This function will be applied to the tree (and subtrees) to determine whether we have reached a branch or a leaf. If getchildren
is applied to a branch, it should return an iterable of subtrees; if applied to a leaf, it should throw an exception.
With no arguments except its first, foldsh(tree)
returns a copy of tree
where the structure consists of newly constructed lists, and the leaves are shared with tree
. In the new structure, the sharing of subtrees is preserved.
Examples
def countsh(tree):
'''Count the number of leaf+branch nodes (count shared subtrees once).'''
def branch(tree,reschildren):
return 1 + sum(reschildren)
def leaf(tree):
return 1
def shared(tree):
return 0
return foldsh(tree,branch,leaf,shared)
def listleaves(tree):
'''List all leaves (but those from shared subtrees only once).'''
def leaf(tree):
return [tree]
def branch(tree,reschildren):
result = []
for rc in reschildren:
result.extend(rc)
return result
def shared(res):
return []
return foldsh(tree,branch,leaf,shared)
# Actually, this can also be elegantly solved by returning iterators as
# result. We do not have to make an exception for shared subtrees, because
# these (shared!) iterators will already be exhausted when encountered for
# the second time:
def iterleavesonce(tree):
'''Iterate all leaves (but those from shared subtrees only once).'''
def leaf(tree):
return iter([tree])
def branch(tree,reschildren):
return itertools.chain(*reschildren)
return foldsh(tree,branch,leaf)
# Build a tree out of frozensets instead of lists:
subtree = [42,44]
tree = [subtree,[subtree]]
fstree = foldsh(tree, branch= lambda t,c: frozenset(c))
# fstree is now:
# frozenset([frozenset([42, 44]), frozenset([frozenset([42, 44])])])
# with shared subtrees
Strings
Watch out with strings as leaves. The default getchildren
will go into infinite
recursion because iter('s')
will yield 's' again. To solve this, you can use:
def getchildren(tree):
assert not isinstance(tree,(str,unicode))
return iter(tree)
YAML
YAML is a human-readable language for data structure serialization with support for sharing. A YAML sequence corresponds to a Python list
and a YAML mapping to a dict
. We can use foldsh
to recurse into the data structure (for the dicts, we only recurse into the values), and, for example, reverse all the lists:
import yaml # from http://pyyaml.org/
def reverselists(tree):
def branch(tree,reschildren):
if isinstance(tree,dict):
return dict(zip(tree.iterkeys(),reschildren))
else:
return list(reversed(reschildren))
def getchildren(tree):
try:
return tree.itervalues()
except:
assert not isinstance(tree,str)
return iter(tree)
return foldsh(tree,branch,getchildren=getchildren)
y = yaml.load('''
ingredients:
- &s # this element is given label s and can be shared later
- spam
- SPAM
- eggs
- *s # like here
- *s # and here
''')
Sharing is preserved in the resulting YAML:
>>> print(yaml.dump(reverselists(y),default_flow_style=False))
ingredients:
- &id001
- SPAM
- spam
- *id001
- eggs
- *id001
Cycles
This recipe does not support cyclic data structures (it will go into infinite recursion). For a way to deal with this, see Recipe 578118.