Welcome, guest | Sign In | My Account | Store | Cart

The function foldsh in this recipe is a general purpose tool for transforming tree-like recursive data structures while keeping track of shared subtrees.

# By default, a branch is encoded as a list of subtrees; each subtree can be a
# branch or a leaf (=anything non-iterable). Subtrees can be shared:
>>> subtree = [42,44]
>>> tree = [subtree,[subtree]]
# We can apply a function to all leaves:
>>> foldsh(tree, leaf= lambda x: x+1)
[[43, 45], [[43, 45]]]
# Or apply a function to the branches:
>>> foldsh(tree, branch= lambda t,c: list(reversed(c)))
[[[44, 42]], [44, 42]]
# The sharing is preserved:
>>> _[0][0] is _[1]
True
# Summing up the leaves without double counting of shared subtrees:
>>> foldsh(tree, branch= lambda t,c: sum(c), shared= lambda x: 0)
86

In particular, it is useful for transforming YAML documents. An example of this is given below.

Python, 28 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def identity(x):
    return x

def foldsh(tree, branch= lambda tree,ch: list(ch), leaf=identity, shared=identity, getchildren=iter):
    '''Sharing-aware tree transformations.
    
    foldsh(tree,branch,leaf,shared,getchildren) :: tree -> result
    
    branch      :: tree, list of children's results -> result
    leaf        :: tree -> result
    shared      :: result -> result
    getchildren :: tree -> iterable of children'''
    
    mem = dict()
    def _fold(tree):
        if id(tree) in mem:
            return shared(mem[id(tree)])
        else:
            try:
                children = getchildren(tree)
            except:
                res = leaf(tree)
            else:
                reschildren = [_fold(child) for child in children]
                res = branch(tree,reschildren)
            mem[id(tree)] = res
            return res
    return _fold(tree)
Changelog

Revision 2: rewrite using the try...except...else idiom which I didn't know before and makes the logical structure clearer. Functionality is unchanged.

Background

Many transformations on trees follow the same general pattern. In a recursive traversal,

  • first, leaves are transformed into results (here, this is the task of the function leaf);
  • when all subtrees (children) of a branch have been transformed, the results of these children are combined into a result for the branch (the task of function branch);
  • this way, results are passed up the tree until the result for the whole tree is constructed.

In this recipe, the recursion is performed by foldsh; the transformation functions leaf and branch are supplied by the user. Note that a "result" can be anything, for example

  • an integer counting the number of leaves, depth of the tree, maximal breadth of branches;
  • a list of all the leaves in the tree;
  • another tree.

The general pattern is known as a fold (and in Python as reduce; but both these terms are mostly used for inductive lists) or catamorphism.

In addition to implementing a regular fold, this recipe supports sharing of subtrees. (So the data structure is actually a DAG, but we'll still say tree.) If during the recursive traversal foldsh encounters a subtree that it has seen before, it will not recurse into it, but instead take the previously constructed result, and apply shared to it.

To make foldsh recurse over other tree-like structures, supply it with a different getchildren function. This function will be applied to the tree (and subtrees) to determine whether we have reached a branch or a leaf. If getchildren is applied to a branch, it should return an iterable of subtrees; if applied to a leaf, it should throw an exception.

With no arguments except its first, foldsh(tree) returns a copy of tree where the structure consists of newly constructed lists, and the leaves are shared with tree. In the new structure, the sharing of subtrees is preserved.

Examples
def countsh(tree):
    '''Count the number of leaf+branch nodes (count shared subtrees once).'''
    def branch(tree,reschildren):
        return 1 + sum(reschildren)
    def leaf(tree):
        return 1
    def shared(tree):
        return 0
    return foldsh(tree,branch,leaf,shared)

def listleaves(tree):
    '''List all leaves (but those from shared subtrees only once).'''
    def leaf(tree):
        return [tree]
    def branch(tree,reschildren):
        result = []
        for rc in reschildren:
            result.extend(rc)
        return result
    def shared(res):
        return []
    return foldsh(tree,branch,leaf,shared)

# Actually, this can also be elegantly solved by returning iterators as
# result. We do not have to make an exception for shared subtrees, because
# these (shared!) iterators will already be exhausted when encountered for
# the second time:

def iterleavesonce(tree):
    '''Iterate all leaves (but those from shared subtrees only once).'''
    def leaf(tree):
        return iter([tree])
    def branch(tree,reschildren):
        return itertools.chain(*reschildren)
    return foldsh(tree,branch,leaf)

# Build a tree out of frozensets instead of lists:
subtree = [42,44]
tree = [subtree,[subtree]]
fstree = foldsh(tree, branch= lambda t,c: frozenset(c))
# fstree is now:
# frozenset([frozenset([42, 44]), frozenset([frozenset([42, 44])])])
# with shared subtrees
Strings

Watch out with strings as leaves. The default getchildren will go into infinite recursion because iter('s') will yield 's' again. To solve this, you can use:

def getchildren(tree):
    assert not isinstance(tree,(str,unicode))
    return iter(tree)
YAML

YAML is a human-readable language for data structure serialization with support for sharing. A YAML sequence corresponds to a Python list and a YAML mapping to a dict. We can use foldsh to recurse into the data structure (for the dicts, we only recurse into the values), and, for example, reverse all the lists:

import yaml    # from http://pyyaml.org/

def reverselists(tree):
   def branch(tree,reschildren):
      if isinstance(tree,dict):
         return dict(zip(tree.iterkeys(),reschildren))
      else:
         return list(reversed(reschildren))
   def getchildren(tree):
      try:
         return tree.itervalues()
      except:
         assert not isinstance(tree,str)
         return iter(tree)
   return foldsh(tree,branch,getchildren=getchildren)

y = yaml.load('''
ingredients:
  - &s       # this element is given label s and can be shared later
    - spam
    - SPAM
  - eggs
  - *s       # like here
  - *s       # and here
''')

Sharing is preserved in the resulting YAML:

>>> print(yaml.dump(reverselists(y),default_flow_style=False))
ingredients:
- &id001
  - SPAM
  - spam
- *id001
- eggs
- *id001
Cycles

This recipe does not support cyclic data structures (it will go into infinite recursion). For a way to deal with this, see Recipe 578118.

Created by Sander Evers on Wed, 2 May 2012 (MIT)
Python recipes (4591)
Sander Evers's recipes (6)

Required Modules

  • (none specified)

Other Information and Tasks