Two linear-time algorithms for finding the strongly connected components of a directed graph. `strongly_connected_components_tree`

implements (a variant of) Tarjan's well-known algorithm for finding strongly connected components, while `strongly_connected_components_path`

implements a path-based algorithm due (in this form) to Gabow.

Edit: I added an iterative function `strongly_connected_components_iterative`

; this is a direct conversion of `strongly_connected_components_path`

into iterative form. It's therefore safe to use on high-depth graphs, without risk of running into Python's recursion limit.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 | ```
def strongly_connected_components_path(vertices, edges):
"""
Find the strongly connected components of a directed graph.
Uses a recursive linear-time algorithm described by Gabow [1]_ to find all
strongly connected components of a directed graph.
Parameters
----------
vertices : iterable
A sequence or other iterable of vertices. Each vertex should be
hashable.
edges : mapping
Dictionary (or mapping) that maps each vertex v to an iterable of the
vertices w that are linked to v by a directed edge (v, w).
Returns
-------
components : iterator
An iterator that yields sets of vertices. Each set produced gives the
vertices of one strongly connected component.
Raises
------
RuntimeError
If the graph is deep enough that the algorithm exceeds Python's
recursion limit.
Notes
-----
The algorithm has running time proportional to the total number of vertices
and edges. It's practical to use this algorithm on graphs with hundreds of
thousands of vertices and edges.
The algorithm is recursive. Deep graphs may cause Python to exceed its
recursion limit.
`vertices` will be iterated over exactly once, and `edges[v]` will be
iterated over exactly once for each vertex `v`. `edges[v]` is permitted to
specify the same vertex multiple times, and it's permissible for `edges[v]`
to include `v` itself. (In graph-theoretic terms, loops and multiple edges
are permitted.)
References
----------
.. [1] Harold N. Gabow, "Path-based depth-first search for strong and
biconnected components," Inf. Process. Lett. 74 (2000) 107--114.
.. [2] Robert E. Tarjan, "Depth-first search and linear graph algorithms,"
SIAM J.Comput. 1 (2) (1972) 146--160.
Examples
--------
Example from Gabow's paper [1]_.
>>> vertices = [1, 2, 3, 4, 5, 6]
>>> edges = {1: [2, 3], 2: [3, 4], 3: [], 4: [3, 5], 5: [2, 6], 6: [3, 4]}
>>> for scc in strongly_connected_components_path(vertices, edges):
... print(scc)
...
set([3])
set([2, 4, 5, 6])
set([1])
Example from Tarjan's paper [2]_.
>>> vertices = [1, 2, 3, 4, 5, 6, 7, 8]
>>> edges = {1: [2], 2: [3, 8], 3: [4, 7], 4: [5],
... 5: [3, 6], 6: [], 7: [4, 6], 8: [1, 7]}
>>> for scc in strongly_connected_components_path(vertices, edges):
... print(scc)
...
set([6])
set([3, 4, 5, 7])
set([8, 1, 2])
"""
identified = set()
stack = []
index = {}
boundaries = []
def dfs(v):
index[v] = len(stack)
stack.append(v)
boundaries.append(index[v])
for w in edges[v]:
if w not in index:
# For Python >= 3.3, replace with "yield from dfs(w)"
for scc in dfs(w):
yield scc
elif w not in identified:
while index[w] < boundaries[-1]:
boundaries.pop()
if boundaries[-1] == index[v]:
boundaries.pop()
scc = set(stack[index[v]:])
del stack[index[v]:]
identified.update(scc)
yield scc
for v in vertices:
if v not in index:
# For Python >= 3.3, replace with "yield from dfs(v)"
for scc in dfs(v):
yield scc
def strongly_connected_components_tree(vertices, edges):
"""
Find the strongly connected components of a directed graph.
Uses a recursive linear-time algorithm described by Tarjan [2]_ to find all
strongly connected components of a directed graph.
Parameters
----------
vertices : iterable
A sequence or other iterable of vertices. Each vertex should be
hashable.
edges : mapping
Dictionary (or mapping) that maps each vertex v to an iterable of the
vertices w that are linked to v by a directed edge (v, w).
Returns
-------
components : iterator
An iterator that yields sets of vertices. Each set produced gives the
vertices of one strongly connected component.
Raises
------
RuntimeError
If the graph is deep enough that the algorithm exceeds Python's
recursion limit.
Notes
-----
The algorithm has running time proportional to the total number of vertices
and edges. It's practical to use this algorithm on graphs with hundreds of
thousands of vertices and edges.
The algorithm is recursive. Deep graphs may cause Python to exceed its
recursion limit.
`vertices` will be iterated over exactly once, and `edges[v]` will be
iterated over exactly once for each vertex `v`. `edges[v]` is permitted to
specify the same vertex multiple times, and it's permissible for `edges[v]`
to include `v` itself. (In graph-theoretic terms, loops and multiple edges
are permitted.)
References
----------
.. [1] Harold N. Gabow, "Path-based depth-first search for strong and
biconnected components," Inf. Process. Lett. 74 (2000) 107--114.
.. [2] Robert E. Tarjan, "Depth-first search and linear graph algorithms,"
SIAM J.Comput. 1 (2) (1972) 146--160.
Examples
--------
Example from Gabow's paper [1]_.
>>> vertices = [1, 2, 3, 4, 5, 6]
>>> edges = {1: [2, 3], 2: [3, 4], 3: [], 4: [3, 5], 5: [2, 6], 6: [3, 4]}
>>> for scc in strongly_connected_components_tree(vertices, edges):
... print(scc)
...
set([3])
set([2, 4, 5, 6])
set([1])
Example from Tarjan's paper [2]_.
>>> vertices = [1, 2, 3, 4, 5, 6, 7, 8]
>>> edges = {1: [2], 2: [3, 8], 3: [4, 7], 4: [5],
... 5: [3, 6], 6: [], 7: [4, 6], 8: [1, 7]}
>>> for scc in strongly_connected_components_tree(vertices, edges):
... print(scc)
...
set([6])
set([3, 4, 5, 7])
set([8, 1, 2])
"""
identified = set()
stack = []
index = {}
lowlink = {}
def dfs(v):
index[v] = len(stack)
stack.append(v)
lowlink[v] = index[v]
for w in edges[v]:
if w not in index:
# For Python >= 3.3, replace with "yield from dfs(w)"
for scc in dfs(w):
yield scc
lowlink[v] = min(lowlink[v], lowlink[w])
elif w not in identified:
lowlink[v] = min(lowlink[v], lowlink[w])
if lowlink[v] == index[v]:
scc = set(stack[index[v]:])
del stack[index[v]:]
identified.update(scc)
yield scc
for v in vertices:
if v not in index:
# For Python >= 3.3, replace with "yield from dfs(v)"
for scc in dfs(v):
yield scc
def strongly_connected_components_iterative(vertices, edges):
"""
This is a non-recursive version of strongly_connected_components_path.
See the docstring of that function for more details.
Examples
--------
Example from Gabow's paper [1]_.
>>> vertices = [1, 2, 3, 4, 5, 6]
>>> edges = {1: [2, 3], 2: [3, 4], 3: [], 4: [3, 5], 5: [2, 6], 6: [3, 4]}
>>> for scc in strongly_connected_components_iterative(vertices, edges):
... print(scc)
...
set([3])
set([2, 4, 5, 6])
set([1])
Example from Tarjan's paper [2]_.
>>> vertices = [1, 2, 3, 4, 5, 6, 7, 8]
>>> edges = {1: [2], 2: [3, 8], 3: [4, 7], 4: [5],
... 5: [3, 6], 6: [], 7: [4, 6], 8: [1, 7]}
>>> for scc in strongly_connected_components_iterative(vertices, edges):
... print(scc)
...
set([6])
set([3, 4, 5, 7])
set([8, 1, 2])
"""
identified = set()
stack = []
index = {}
boundaries = []
for v in vertices:
if v not in index:
to_do = [('VISIT', v)]
while to_do:
operation_type, v = to_do.pop()
if operation_type == 'VISIT':
index[v] = len(stack)
stack.append(v)
boundaries.append(index[v])
to_do.append(('POSTVISIT', v))
# We reverse to keep the search order identical to that of
# the recursive code; the reversal is not necessary for
# correctness, and can be omitted.
to_do.extend(
reversed([('VISITEDGE', w) for w in edges[v]]))
elif operation_type == 'VISITEDGE':
if v not in index:
to_do.append(('VISIT', v))
elif v not in identified:
while index[v] < boundaries[-1]:
boundaries.pop()
else:
# operation_type == 'POSTVISIT'
if boundaries[-1] == index[v]:
boundaries.pop()
scc = set(stack[index[v]:])
del stack[index[v]:]
identified.update(scc)
yield scc
``` |

A "strongly connected component" of a directed graph is a maximal subgraph such that any vertex in the subgraph is reachable from any other; any directed graph can be decomposed into its strongly connected components.

These recipes arose from code to find CPython reference cycles, and will quite happily run on graphs containing hundreds of thousands of vertices and edges. It's striking how similar the two algorithms look in this form: they both do a depth-first traversal of the whole graph, yielding strongly connected components as they're found, and they differ only in the single auxiliary structure (`boundaries`

in the case of the path-based algorithm; `lowlink`

in the case of the tree-based algorithm) that's used to detect that a strongly connected component has been identified.

Tarjan's algorithm has some minor variations from the published version, but still retains the characteristic use of `lowlink`

to identify strongly connected components. The first variation is that we maintain a set `identified`

containing all vertices that belong to the strongly connected components identified so far, and use this instead of checking whether w is in the current stack in the `elif`

condition of `dfs`

. (At any point in the algorithm, each vertex is exactly one of (1) not yet visited, (2) in `identified`

, or (3) in `stack`

. The vertices in `index`

are a union of those in `identified`

and `stack`

.) The second variation is that instead of being numbered consecutively starting at 1, vertices are numbered according to their depth in the current stack. A nice side-effect of this is that once a strongly connected component has been identified, it's easy to extract it from the stack with a slicing operation.

Both functions are recursive, and so can raise `RuntimeError`

on really deep graphs; it's unusual for this to happen on graphs of objects and object references. It's left as a challenge to convert either algorithm to iterative form.

Looked at the last of these algorithms and notice that you are using a dictionary for index. Given that the vertices are denoted by integers would it not be more sensible to use a list to store the values since list indexing is faster than dict look ups?

index = {} ==> index = (max(vertices)+1)*[None]

v not in index ==> index[v] is None

Even if vertices and edges aren't actual integers there's an easy O(n+m) conversion to integers which can be applied before starting the algorithm. I've tested a modified version and it does seem a few percent faster on your examples.

In the applications that I care about, the vertices are not consecutive integers. So no, a list wouldn't work here. Yes, you could convert, but that conversion would almost certainly involve building another dictionary. Given that these are linear-time algorithms, the cost of the conversion would likely outweigh any speedup from the algorithm.

I guess the storage requirement for a sparse integer vertex set is an issue, however your assumption that the algorithm is linear time depends on the set/get time of python dicts which are used for both the digraph structure and index.

According to http://wiki.python.org/moin/TimeComplexity the worst case amortized time could be O(n) which would make the algorithms quite expensive.

The worst case is unlikely, but after the recent kerfuffle about dictionary indexing attacks (http://bugs.python.org/issue13703) we do know they can happen.

If you are after a highly optimised SCC algorithm, then Scipy provides an implementation as part of its sparse graph library.

http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csgraph.connected_components.html

The algorithm used here is an improved versions of Tarjan's algorithm which is optimised for memory usage without any loss of speed. Details of the implementation can be found here