Welcome, guest | Sign In | My Account | Store | Cart

Walker encapsulates os.walk's directory traversal as an object with the added features of excluded directories and a hook for calling an outside function to act on each file.

Walker can easily be subclassed for more functionality, as with ReWalker which filters filenames in traversal by a regular expression.

Python, 72 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
"""
Walker encapsulates os.walk's directory traversal as an object with 
the added features of excluded directories and a hook for calling 
an outside function to act on each file.  Walker can easily be 
subclassed for more functionality.

ReWalker filters filenames in traversal by a regular expression.

Jack Trainor 2007
"""
import os, os.path
import re

class Walker(object):
    def __init__(self, dir, executeHook=None, excludeDirs=[]):
        self.dir = dir
        self.executeHook = executeHook
        self.excludeDirs = excludeDirs
        
    def isValidFile(self, fileName):
        return True
        
    def isValidDir(self, dir):
        head, tail = os.path.split(dir)
        valid = (not tail in self.excludeDirs)
        return valid
                  
    def executeFile(self, path):
        if self.executeHook:
            self.executeHook(self, path)
        # else subclass Walker and override executeFile
            
    def execute(self):
        for root, dirs, fileNames in os.walk(self.dir):
            for fileName in fileNames:
                if self.isValidDir(root) and self.isValidFile(fileName):
                    path = os.path.join(root, fileName)
                    self.executeFile(path)
        return self 
    
class ReWalker(Walker):
    def __init__(self, dir, fileMatchRe, executeHook=None, excludeDirs=[]):
        Walker.__init__(self, dir, executeHook, excludeDirs)
        self.fileMatchPat = re.compile(fileMatchRe)

    def isValidFile(self, fileName):
        return self.fileMatchPat.match(fileName)

#######################################################
""" For testing: """
def RenameFile(path, matchRe, subRe):
    dir, name = os.path.split(path)
    newName = re.sub(matchRe, subRe, name)
    if newName != name:
        print "%s -> %s" % (name, newName)
        newPath = os.path.join(dir, newName)
        os.rename(path, newPath)

def Rename1(walker, path):
    RenameFile(path, r"(.*)\.pyc$", r"#\1.pyc#")

def Rename2(walker, path):
    RenameFile(path, r"#(.*)\.pyc#$", r"\1.pyc")

def Test():
    """ renames pyc files to #.*pyc# then restores them back again """
    walker = ReWalker(r"C:\Dev\Copy of PyUtils", r".*\.pyc$", Rename1, [".svn"]).execute()
    walker = ReWalker(r"C:\Dev\Copy of PyUtils", r".*\.pyc#$", Rename2, [".svn"]).execute()


if __name__ == "__main__":
    Test()
    

At one point I found myself having to do frequent houeskeeping/utility chores on a large body of source code which included Subversion directories that I obviously didn't want to touch. Over time I accumulated and refined code to make that easy and straightforward. There are many ways to do this, of course. This is my latest version.

I also prefer using regular expressions for file names instead of glob.

The motivation here is DRY (Don't Repeat Yourself). os.walk makes it easy to traverse directories, but I didn't want to keep cut and pasting that block of code. I wanted to write that function which did what I wanted to one file and just plug the function into a larger call.

6 comments

Andrew Hill 15 years, 9 months ago  # | flag

Exclude dirs won't work as intended. I'm fairly new to Python so I'm not 100% sure on this, but it looks like directory exclusion won't work as intended (or at least as I assume it's intended).

If I am excluding ".svn" directories, this will prevent executeFile() being called on the files in the .svn/ directory itself, but not in directories under .svn (which contain, for example, a copy of the current checked out version of the repository)

E.g. the following directory structure for hello.py which is stored in a subversion repository:

./code/
./code/hello.py
./code/hello.pyc
./code/.svn/entries
./code/.svn/format
./code/.svn/prop-base/
./code/.svn/prop-base/hello.py.svn-base
./code/.svn/props/
./code/.svn/text-base/
./code/.svn/text-base/hello.py.svn-base
./code/.svn/tmp/

If I was matching the regular expression "^hello." using a ReWalker and excluding ".svn" dirs:

walker = ReWalker( os.getcwd(), r"^hello\.", doSomething, [".svn"] ).execute()

This would match:

./code/hello.py
./code/hello.pyc
./code/.svn/prop-base/hello.py.svn-base
./code/.svn/text-base/hello.py.svn-base

Presumably to make it work as I assume it was intended, the isValidDir function needs to check more than just the last component of the path (the tail from os.path.split), and instead iterate through each directory (probably ignoring the self.dir prefix in the dir being tested by isValidDir)...

For example:

def isValidDir(self, dir):
    # Remove self.dir prefix (exclusions don't apply to the supplied root dir for walking)
    if dir.startswith( self.dir ):
        dir = dir[len(self.dir):]
    # Check all sub-directories in the path of the file
    subdirs = dir.split('/')
    for subdir in subdirs:
        if subdir in self.excludeDirs:
            return False
    return True

Though I'm sure there's a better way (more elegant in Python at the very least) to code this...

Gui R 15 years, 8 months ago  # | flag

Nice work Jack, but I have to agree, the excluded directory feature doesn't work. Andrew, your solution could work, but there's actually simpler:

The os.walk() function returns a tuple of 3 parameters at each call:

the root dir, a list of sub-directories, and a list of files. As the API doc says (http://docs.python.org/lib/os-file-dir.html), it is possible to modify in place the list of sub-directories to limit where os.walk() will go down next. So, in the Walker class, first we can get rid of the isValidDir() method, and here's the new execute() method:

def execute(self):
    for root, dirs, fileNames in os.walk(self.dir):
        for exdir in self.excludeDirs:
            if exdir in dirs:
                dirs.remove(exdir)
        for fileName in fileNames:
                path = os.path.join(root, fileName)
                self.executeFile(path)
dar 15 years, 2 months ago  # | flag

With reference to the new execute method by Guillaume Rava, the modified code left out the check for a valid file - as when using a regular expression to filter the files.

The revised code should now be:

def execute( self ):
    for root, dirs, fileNames in os.walk( self.dir ):
        for exdir in self.excludeDirs:
            if exdir in dirs:
                dirs.remove( exdir )
        for fileName in fileNames:
            if self.isValidFile( fileName ):
                path = os.path.join( root, fileName )
                self.executeFile( path )
itwasntme 14 years, 9 months ago  # | flag

Hm, it will be better to use os.listdir:

def listdir( self, dirname ):
  for name in os.listdir( dirname ):
    if os.path.isdir( name ):
      if self.isValidDir( name ):
        listdir( os.path.join( dirname, name ) )
    else:
      if self.isValidFile( name ):
      self.executeFile( os.path.join( dirname, name ) )

def execute( self ):
  listdir( self.dir )
itwasntme 14 years, 9 months ago  # | flag

...and isValidDir should be changed if you choose such way:

    def isValidDir( self, dirname ):
      return dirname not in self.excludeDirs
albert kao 13 years, 11 months ago  # | flag

How to use your functions to walk a directory and ignore all the files or directories which names begin in '.' (e.g. '.svn')? I added the following code but it has bugs. Please help. Thanks.

[code] """ For testing: """ def ProcessFile(walker, path): print("walker " + walker + " path " + path)

if __name__ == "__main__": walker = ReWalker(r"C:\test\com.comp.hw.prod.proj.war\bin", r".*", ProcessFile, ["."]).execute()

C:\python>ReWalker.py Traceback (most recent call last): File "C:\python\ReWalker.py", line 80, in <module> walker = ReWalker(r"C:\test\com.comp.hw.prod.proj.war\bin", r".*", ProcessFile, ["."]).execute() AttributeError: 'ReWalker' object has no attribute 'execute' [/code]