Welcome, guest | Sign In | My Account | Store | Cart

Some tasks require reading from a set of files in a random manner. Opening and reading files is a time consuming operation even when the operating system is caching the contents of the files in memory. Caching files explicitly can speed up processing greatly especially where the cached form is optimised for likely access patterns.

Python, 19 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import string
class FileCache:
	'''Caches the contents of a set of files.
	Avoids reading files repeatedly from disk by holding onto the
	contents of each file as a list of strings.
	'''

	def __init__(self):
		self.filecache = {}
		
	def grabFile(self, filename):
		'''Return the contents of a file as a list of strings.
		New line characters are removed.
		'''
		if not self.filecache.has_key(filename):
			f = open(filename, "r")
			self.filecache[filename] = string.split(f.read(), '\n')
			f.close()
		return self.filecache[filename]

To produce contextual help in the form of calltips in the SciTE editor, the output of ctags is correlated with the contents of the header files indexed by ctags. The output of ctags is alphabetised by identifier rather than by the name of the header files in which they are located, so processing the ctags output line by line and then opening and reading the header file defining that identifier to find its context is very time consuming. By caching each header file when it is first read, this task is made much faster.

Since the ctags output includes the line number on which an identifier is defined as well as the name of the header file, storing the file as a list of lines allows rapid access to the definition line.

The speed up delivered is immense. Processing the subset of Win32 headers delivered with mingw gcc takes 3.3 seconds with the cache versus 330 seconds before the cache. This is for a 1.8 MB subset of the Win32 headers - the full set of Win32 headers from the Microsoft Platform SDK is now 42 MB and so would take over 2 hours to process without the cache. The set of headers commonly installed on a Linux Development machine is of the same order of magnitude.

Small speed ups can often be achieved by looking closely at the file access code. An earlier version of this recipe used readlines to process each line but reading the whole file and using string.split is 10% faster. The xreadlines method available in Python 2.1 may be faster in some circumstances but it was the same as readlines in my testing. The mmap module could also enhance performance but did not for this code being a better solution where read/write access is required or the files are being randomly accessed at a byte rather than a line level.

This same class can be used for many similar file processing tasks.

1 comment

Neil Hodgson (author) 22 years, 12 months ago  # | flag

Thoroughly read the standard library. This module implements the same feature as the standard library's linecache module. This shows the importance of reading Python's documentation as the library contains solutions to many common problems.

Created by Neil Hodgson on Sat, 31 Mar 2001 (PSF)
Python recipes (4591)
Neil Hodgson's recipes (1)

Required Modules

Other Information and Tasks