Strongly connected components of a directed graph. « Python recipes

Two linear-time algorithms for finding the strongly connected components of a directed graph. strongly_connected_components_tree implements (a variant of) Tarjan's well-known algorithm for finding strongly connected components, while strongly_connected_components_path implements a path-based algorithm due (in this form) to Gabow.

Edit: I added an iterative function strongly_connected_components_iterative; this is a direct conversion of strongly_connected_components_path into iterative form. It's therefore safe to use on high-depth graphs, without risk of running into Python's recursion limit.

      def strongly_connected_components_path(vertices, edges):
    """
    Find the strongly connected components of a directed graph.

    Uses a recursive linear-time algorithm described by Gabow [1]_ to find all
    strongly connected components of a directed graph.

    Parameters
    ----------
    vertices : iterable
        A sequence or other iterable of vertices.  Each vertex should be
        hashable.

    edges : mapping
        Dictionary (or mapping) that maps each vertex v to an iterable of the
        vertices w that are linked to v by a directed edge (v, w).

    Returns
    -------
    components : iterator
        An iterator that yields sets of vertices.  Each set produced gives the
        vertices of one strongly connected component.

    Raises
    ------
    RuntimeError
        If the graph is deep enough that the algorithm exceeds Python's
        recursion limit.

    Notes
    -----
    The algorithm has running time proportional to the total number of vertices
    and edges.  It's practical to use this algorithm on graphs with hundreds of
    thousands of vertices and edges.

    The algorithm is recursive.  Deep graphs may cause Python to exceed its
    recursion limit.

    `vertices` will be iterated over exactly once, and `edges[v]` will be
    iterated over exactly once for each vertex `v`.  `edges[v]` is permitted to
    specify the same vertex multiple times, and it's permissible for `edges[v]`
    to include `v` itself.  (In graph-theoretic terms, loops and multiple edges
    are permitted.)

    References
    ----------
    .. [1] Harold N. Gabow, "Path-based depth-first search for strong and
       biconnected components," Inf. Process. Lett. 74 (2000) 107--114.

    .. [2] Robert E. Tarjan, "Depth-first search and linear graph algorithms,"
       SIAM J.Comput. 1 (2) (1972) 146--160.

    Examples
    --------
    Example from Gabow's paper [1]_.

    >>> vertices = [1, 2, 3, 4, 5, 6]
    >>> edges = {1: [2, 3], 2: [3, 4], 3: [], 4: [3, 5], 5: [2, 6], 6: [3, 4]}
    >>> for scc in strongly_connected_components_path(vertices, edges):
    ...     print(scc)
    ...
    set([3])
    set([2, 4, 5, 6])
    set([1])

    Example from Tarjan's paper [2]_.

    >>> vertices = [1, 2, 3, 4, 5, 6, 7, 8]
    >>> edges = {1: [2], 2: [3, 8], 3: [4, 7], 4: [5],
    ...          5: [3, 6], 6: [], 7: [4, 6], 8: [1, 7]}
    >>> for scc in  strongly_connected_components_path(vertices, edges):
    ...     print(scc)
    ...
    set([6])
    set([3, 4, 5, 7])
    set([8, 1, 2])

    """
    identified = set()
    stack = []
    index = {}
    boundaries = []

    def dfs(v):
        index[v] = len(stack)
        stack.append(v)
        boundaries.append(index[v])

        for w in edges[v]:
            if w not in index:
                # For Python >= 3.3, replace with "yield from dfs(w)"
                for scc in dfs(w):
                    yield scc
            elif w not in identified:
                while index[w] < boundaries[-1]:
                    boundaries.pop()

        if boundaries[-1] == index[v]:
            boundaries.pop()
            scc = set(stack[index[v]:])
            del stack[index[v]:]
            identified.update(scc)
            yield scc

    for v in vertices:
        if v not in index:
            # For Python >= 3.3, replace with "yield from dfs(v)"
            for scc in dfs(v):
                yield scc


def strongly_connected_components_tree(vertices, edges):
    """
    Find the strongly connected components of a directed graph.

    Uses a recursive linear-time algorithm described by Tarjan [2]_ to find all
    strongly connected components of a directed graph.

    Parameters
    ----------
    vertices : iterable
        A sequence or other iterable of vertices.  Each vertex should be
        hashable.

    edges : mapping
        Dictionary (or mapping) that maps each vertex v to an iterable of the
        vertices w that are linked to v by a directed edge (v, w).

    Returns
    -------
    components : iterator
        An iterator that yields sets of vertices.  Each set produced gives the
        vertices of one strongly connected component.

    Raises
    ------
    RuntimeError
        If the graph is deep enough that the algorithm exceeds Python's
        recursion limit.

    Notes
    -----
    The algorithm has running time proportional to the total number of vertices
    and edges.  It's practical to use this algorithm on graphs with hundreds of
    thousands of vertices and edges.

    The algorithm is recursive.  Deep graphs may cause Python to exceed its
    recursion limit.

    `vertices` will be iterated over exactly once, and `edges[v]` will be
    iterated over exactly once for each vertex `v`.  `edges[v]` is permitted to
    specify the same vertex multiple times, and it's permissible for `edges[v]`
    to include `v` itself.  (In graph-theoretic terms, loops and multiple edges
    are permitted.)

    References
    ----------
    .. [1] Harold N. Gabow, "Path-based depth-first search for strong and
       biconnected components," Inf. Process. Lett. 74 (2000) 107--114.

    .. [2] Robert E. Tarjan, "Depth-first search and linear graph algorithms,"
       SIAM J.Comput. 1 (2) (1972) 146--160.

    Examples
    --------
    Example from Gabow's paper [1]_.

    >>> vertices = [1, 2, 3, 4, 5, 6]
    >>> edges = {1: [2, 3], 2: [3, 4], 3: [], 4: [3, 5], 5: [2, 6], 6: [3, 4]}
    >>> for scc in strongly_connected_components_tree(vertices, edges):
    ...     print(scc)
    ...
    set([3])
    set([2, 4, 5, 6])
    set([1])

    Example from Tarjan's paper [2]_.

    >>> vertices = [1, 2, 3, 4, 5, 6, 7, 8]
    >>> edges = {1: [2], 2: [3, 8], 3: [4, 7], 4: [5],
    ...          5: [3, 6], 6: [], 7: [4, 6], 8: [1, 7]}
    >>> for scc in  strongly_connected_components_tree(vertices, edges):
    ...     print(scc)
    ...
    set([6])
    set([3, 4, 5, 7])
    set([8, 1, 2])

    """
    identified = set()
    stack = []
    index = {}
    lowlink = {}

    def dfs(v):
        index[v] = len(stack)
        stack.append(v)
        lowlink[v] = index[v]

        for w in edges[v]:
            if w not in index:
                # For Python >= 3.3, replace with "yield from dfs(w)"
                for scc in dfs(w):
                    yield scc
                lowlink[v] = min(lowlink[v], lowlink[w])
            elif w not in identified:
                lowlink[v] = min(lowlink[v], lowlink[w])

        if lowlink[v] == index[v]:
            scc = set(stack[index[v]:])
            del stack[index[v]:]
            identified.update(scc)
            yield scc

    for v in vertices:
        if v not in index:
            # For Python >= 3.3, replace with "yield from dfs(v)"
            for scc in dfs(v):
                yield scc


def strongly_connected_components_iterative(vertices, edges):
    """
    This is a non-recursive version of strongly_connected_components_path.
    See the docstring of that function for more details.

    Examples
    --------
    Example from Gabow's paper [1]_.

    >>> vertices = [1, 2, 3, 4, 5, 6]
    >>> edges = {1: [2, 3], 2: [3, 4], 3: [], 4: [3, 5], 5: [2, 6], 6: [3, 4]}
    >>> for scc in strongly_connected_components_iterative(vertices, edges):
    ...     print(scc)
    ...
    set([3])
    set([2, 4, 5, 6])
    set([1])

    Example from Tarjan's paper [2]_.

    >>> vertices = [1, 2, 3, 4, 5, 6, 7, 8]
    >>> edges = {1: [2], 2: [3, 8], 3: [4, 7], 4: [5],
    ...          5: [3, 6], 6: [], 7: [4, 6], 8: [1, 7]}
    >>> for scc in  strongly_connected_components_iterative(vertices, edges):
    ...     print(scc)
    ...
    set([6])
    set([3, 4, 5, 7])
    set([8, 1, 2])

    """
    identified = set()
    stack = []
    index = {}
    boundaries = []

    for v in vertices:
        if v not in index:
            to_do = [('VISIT', v)]
            while to_do:
                operation_type, v = to_do.pop()
                if operation_type == 'VISIT':
                    index[v] = len(stack)
                    stack.append(v)
                    boundaries.append(index[v])
                    to_do.append(('POSTVISIT', v))
                    # We reverse to keep the search order identical to that of
                    # the recursive code;  the reversal is not necessary for
                    # correctness, and can be omitted.
                    to_do.extend(
                        reversed([('VISITEDGE', w) for w in edges[v]]))
                elif operation_type == 'VISITEDGE':
                    if v not in index:
                        to_do.append(('VISIT', v))
                    elif v not in identified:
                        while index[v] < boundaries[-1]:
                            boundaries.pop()
                else:
                    # operation_type == 'POSTVISIT'
                    if boundaries[-1] == index[v]:
                        boundaries.pop()
                        scc = set(stack[index[v]:])
                        del stack[index[v]:]
                        identified.update(scc)
                        yield scc

      

A "strongly connected component" of a directed graph is a maximal subgraph such that any vertex in the subgraph is reachable from any other; any directed graph can be decomposed into its strongly connected components.

These recipes arose from code to find CPython reference cycles, and will quite happily run on graphs containing hundreds of thousands of vertices and edges. It's striking how similar the two algorithms look in this form: they both do a depth-first traversal of the whole graph, yielding strongly connected components as they're found, and they differ only in the single auxiliary structure (boundaries in the case of the path-based algorithm; lowlink in the case of the tree-based algorithm) that's used to detect that a strongly connected component has been identified.

Tarjan's algorithm has some minor variations from the published version, but still retains the characteristic use of lowlink to identify strongly connected components. The first variation is that we maintain a set identified containing all vertices that belong to the strongly connected components identified so far, and use this instead of checking whether w is in the current stack in the elif condition of dfs. (At any point in the algorithm, each vertex is exactly one of (1) not yet visited, (2) in identified, or (3) in stack. The vertices in index are a union of those in identified and stack.) The second variation is that instead of being numbered consecutively starting at 1, vertices are numbered according to their depth in the current stack. A nice side-effect of this is that once a strongly connected component has been identified, it's easy to extract it from the stack with a slicing operation.

Both functions are recursive, and so can raise RuntimeError on really deep graphs; it's unusual for this to happen on graphs of objects and object references. It's left as a challenge to convert either algorithm to iterative form.

Tags: connected, directed, graph, strong, tarjan

4 comments

Robin Becker 11 years ago # | flag

Looked at the last of these algorithms and notice that you are using a dictionary for index. Given that the vertices are denoted by integers would it not be more sensible to use a list to store the values since list indexing is faster than dict look ups?

index = {} ==> index = (max(vertices)+1)*[None]

v not in index ==> index[v] is None

Even if vertices and edges aren't actual integers there's an easy O(n+m) conversion to integers which can be applied before starting the algorithm. I've tested a modified version and it does seem a few percent faster on your examples.

Mark Dickinson (author) 11 years ago # | flag

In the applications that I care about, the vertices are not consecutive integers. So no, a list wouldn't work here. Yes, you could convert, but that conversion would almost certainly involve building another dictionary. Given that these are linear-time algorithms, the cost of the conversion would likely outweigh any speedup from the algorithm.

Robin Becker 11 years ago # | flag

I guess the storage requirement for a sparse integer vertex set is an issue, however your assumption that the algorithm is linear time depends on the set/get time of python dicts which are used for both the digraph structure and index.

According to http://wiki.python.org/moin/TimeComplexity the worst case amortized time could be O(n) which would make the algorithms quite expensive.

The worst case is unlikely, but after the recent kerfuffle about dictionary indexing attacks (http://bugs.python.org/issue13703) we do know they can happen.

Tim Leslie 10 years, 11 months ago # | flag

If you are after a highly optimised SCC algorithm, then Scipy provides an implementation as part of its sparse graph library.

http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csgraph.connected_components.html

The algorithm used here is an improved versions of Tarjan's algorithm which is optimised for memory usage without any loss of speed. Details of the implementation can be found here

◄	Python recipes (4591)	►
◄	Mark Dickinson's recipes (1)	►

Strongly connected components of a directed graph. (Python recipe) by Mark Dickinson
ActiveState Code (http://code.activestate.com/recipes/578507/)

4 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Strongly connected components of a directed graph. (Python recipe) by Mark Dickinson ActiveState Code (http://code.activestate.com/recipes/578507/)

4 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Strongly connected components of a directed graph. (Python recipe) by Mark Dickinson
ActiveState Code (http://code.activestate.com/recipes/578507/)