Some file processing tasks are quite time consuming, especially when COM is involved. Unnecessary repetitions are then unbearable. Here is a module that helps avoid them.
Python, 177 lines
The fcache module is concerned with avoiding unnecessary file processing operations (like parsing a file). It is assumed that such operations are implemented as functions or methods with the following signature: <pre>f(filename1, filename2, ...) -> object</pre> or <pre>f(*filenames) -> object</pre> The basic assumption made here is that the operation is time-invariant (state-less). In other words, the result of f depends on the input (source) files only, and not on any kind of internal state.
The mechanism used is caching. Each result is stored (pickled) into a file. The file name is derived from the function/method name and the arguments. If any of the files denoted by 'filenames' is newer then the cache file or the cache file does not exist, the function is called and the result is stored. If not, the result is recycled from the cache without invoking the function. (The approach partly mimics the strategy of build tools like make or ant.)
This, of course, only makes sense where function execution is significantly slower then the unpickling process itself. That is usually the case with: - complex text file parsing, involving for example many regular expression operations - processing of files in proprietary formats using COM (COM introduces a run-time overhead especially striking when many inter-process COM invocations are made, like with processing MS-Office files) - ...
This recipe has been used to speed up an automated testing process that involved analyzing office documents via COM to retrieve data for the tests.
About the implementation:
The class CacheManager is responsible for wrapping functions in cache handling code. This is what its primary method, wrap, does. It also keeps track of the created cache files internally, so that they can be deleted via deleteCacheFiles.
Upon instantiation, the cache file folder, the cache file name extensions and an encoding function other then the defaults may be specified.
The default cache folder is the subfolder "@cache" of the directory where the fcache.py module resides. This is so that, when the module is distributed with different applications (which usually go into different folders), each application will automatically get a cache folder of own.
The default cache file extension is ".cache" (not very inventive, I know).
The encoding function is used to produce a unique key from the function arguments, and defaults to fcache.hashhex. There is a second encoding function, fcache.md2hex, which is slower but has a much lower probability to generate the same key for different arguments.
The key is used to produce a unique cache file name for every combination of a function and its arguments. The cache file name is generated as follows: First, all source file names are converted to absolute paths via sources = map(os.path.abspath, sources).
Then the encoding function is invoked with repr(sources), and the generated key is prefixed with the functions qualified identifier, which is ModuleName.FunctionName for functions and ModuleName.ClassName.MethodName for methods. (In the future, this could be extended to include package names as well.)
A word of caution:
The module only compares the time stamp of the source files against that of the cache file to determine if a cached result is out of date. However, another reason why the saved result may become invalid is, if the implementation itself changes. While it would be easy to check if the operation defining module is newer than the cache file, the result may also depend on arbitrary other modules. A complete module dependency analysis is required for a rigid check, but that is beyond the scope of this recipe. Without it, my advice is to clear the cache repository (delete all files) after EVERY code change.
My current feature wish list (feel free to extend it):
Cheers and happy caching!
Recipe "Memoizing (cacheing) function return values" by Paul Moore, and especially the coment on closures by Hannu Kankaanpää. It is here: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52201