ActiveState Code

Recipe 161542: Directory Walker Generator


A generator that does what os.path.walk does, without the need for a callback function.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from __future__ import generators # needed for Python 2.2
import os
import stat
import types

def walktree(top=".", depthfirst=False):
    """
    Walk a directory tree, starting from 'top'

    This code is based on os.path.walk, with the callback function
    replaced by a yield and recursion replaced by iteration
    """
    if type(top) != types.StringType:
        raise TypeError("top must be a string")

    # decide which end of the stack to pop
    if depthfirst:
        index = -1
    else:
        index = 0
    
    dirstack = [top]
    while dirstack:
        top = dirstack.pop(index)
        try:
            names = os.listdir(top)
        except os.error:
             return
        yield top, names
        dirs = []
        for name in names:
            name = os.path.join(top, name)
            try:
                st = os.lstat(name)
            except os.error:
                continue
            if stat.S_ISDIR(st.st_mode):
                dirs.append(name)
        # the depth-first results look 'backwards' to me
        # so I'm changing them to fit my expectations
        if depthfirst:
            dirs.reverse()
        dirstack.extend(dirs)

if __name__ == "__main__":
    """
    a sample:
    look for python scripts starting in the current directory
    """
    
    for top, names in walktree():
        print "searching folder %s" % (top,)
        for name in names:
            if name[-3:] == ".py":
                print "I think %s is a python script" % (name,)

Discussion

This is a rewrite of os.path.walk. It's convenient for processing files that inhabit directory trees. Its original purpose was to vist the '.h' files of a large C++ project and wrap them in#ifdef include guards.

Thanks to Tim Peters for pointing out that using an internal stack instead of recursion gives one the choice of depth-first or width-first visits.

Thanks to Raymond Hettinger for pointing out st.st_mode and urging me to get rid of the global search mode definitions.

Thanks to Noah Spurrier for suggesting a sample.

Comments

  1. 1. At 9:45 p.m. on 24 mar 2003, Noah Spurrier said:

    Neat, but give an example! This looks like a neat use of generators, but it needs an example of its use. Maybe give an example of converting all filenames to lowercase or something.

  2. 2. At 12:23 p.m. on 19 apr 2003, Noah Spurrier said:

    Depth First is broken. I just noticed that Depthfirst is broken. It still walks the tree breadth first. A fix is not easy as it would involve recursion or an additional stack. I'm not sure how recurssion works with generators.

    For example, take the following directory tree:

        AAA
       /   \
    BBB     CCC
           /   \
        DDD     EEE
    

    This structure can be created using these commands:

    mkdir AAA
    mkdir AAA/BBB
    mkdir AAA/CCC
    mkdir AAA/CCC/DDD
    mkdir AAA/CCC/EEE
    

    Then run this python code to walk the tree:

    root = 'AAA'
    w = walktree (root, True)
    for (basepath, children) in w:
        print basepath, ':', children
    

    The expected output for depth first tree traversal is this:

    AAA/BBB : []
    AAA/CCC/DDD : []
    AAA/CCC/EEE : []
    AAA/CCC : ['DDD', 'EEE']
    AAA : ['BBB', 'CCC']
    

    But it does not produce this. The actual output is below. This is actually still BREADTH first, but following the RIGHTMOST path first.

    AAA : ['BBB', 'CCC']
    AAA/CCC : ['DDD', 'EEE']
    AAA/CCC/EEE : []
    AAA/CCC/DDD : []
    AAA/BBB : []
    

    With depthfirst set to False you still get BREADTH first, but following the LEFTMOST path first:

    AAA : ['BBB', 'CCC']
    AAA/BBB : []
    AAA/CCC : ['DDD', 'EEE']
    AAA/CCC/DDD : []
    AAA/CCC/EEE : []
    

    Yours, Noah

  3. 3. At 2:38 p.m. on 19 apr 2003, Noah Spurrier said:

    Fix for depth first problem. The depth first problem is important. Consider how would you easily write a directory walker to rename all files AND directories to lower case? The obvious code does not do what one would expect:

    from walktree import *
    for (basepath, children) in walktree():
        for child in children:
            old_name = os.path.join(basepath, child)
            new_name = os.path.join(basepath, child.lower())
                print old_name, "-->", new_name
            os.rename (old_name, new_name)
    

    That will try to rename the directories in the wrong order. It will try to rename a directory before it has visited the children of that directory. Given the following directory tree:

        AAA
       /   \
    BBB     CCC
           /   \
        DDD     EEE
    

    That code will try to do this:

    os.rename ('./AAA/CCC', './AAA/ccc')
    

    then later:

    os.rename ('./AAA/CCC/DDD', './AAA/CCC/ddd') # ERROR!
    

    But AAA/CCC does not exist anymore. It was previously renamed to AAA/ccc!

    The easiest way to fix walktree() is to use recursion. It's interesting to note that generators may be nested recursively. The following code fixes walktree to make it work properly for depth first tree walking.

    Yours,

    Noah Spurrier

    from __future__ import generators # needed for Python 2.2
    import os
    import stat
    
    def walktree (top = ".", depthfirst = True):
        """This walks a directory tree, starting from 'top'.
        This is somewhat like os.path.walk, but using
        generators instead of a visit function.
        One important difference is that walktree() defaults
        to DEPTH first with optional BREADTH first,
        whereas the os.path.walk function allows only BREADTH first.
        Depth first was made the default because it is safer if
        you are going to be modifying directory names. This avoids
        the problem of renaming a directory before visiting the children
        of that directory.
    
        """
        names = os.listdir(top)
        if not depthfirst:
            yield top, names
        for name in names:
            try:
                st = os.lstat(os.path.join(top, name))
            except os.error:
                continue
            if stat.S_ISDIR(st.st_mode):
                for (newtop, children) in walktree (os.path.join(top, name), depthfirst):
                    yield newtop, children
        if depthfirst:
            yield top, names
    
    def test():
        for (basepath, children) in walktree():
            for child in children:
            print os.path.join(basepath, child)
    
    if __name__ == '__main__':
        test()
    
  4. 4. At 11:29 p.m. on 29 jul 2003, Andrew Dalke said:

    os.walk. Also, see 'os.walk', added in Python 2.3

Sign in to comment