This recipe presents a general purpose file object iterator cum file object proxy class. It provides a class that gives several iterator functions to read a text file by characters, words, lines, paragraphs or blocks. It also acts as a proxy for the wrapped file object.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | import re
class FileIterator(object):
""" A general purpose file object iterator cum
file object proxy """
def __init__(self, fw):
self._fw = fw
# Attribute proxy for wrapped file object
def __getattr__(self, name):
try:
return self.__dict__[name]
except KeyError:
if hasattr(self._fw, name):
return getattr(self._fw, name)
return None
def readlines(self):
""" Line iterator """
for line in self._fw:
yield line
def readwords(self):
""" Word iterator. Newlines are omitted """
# 'Words' are defined as those things
# separated by whitespace.
wspacere = re.compile(r'\s+')
for line in self._fw:
words = wspacere.split(line)
for w in words:
yield w
def readchars(self):
""" Character iterator """
for c in self._fw.read():
yield c
def readblocks(self, block_size):
""" Block iterator """
while True:
block = self._fw.read(block_size)
if block=='':
break
yield block
def readparagraphs(self):
""" Paragraph iterator """
# This re-uses Alex Martelli's
# paragraph reading recipe.
# Python Cookbook 2nd edition 19.10, Page 713
paragraph = []
for line in self._fw:
if line.isspace():
if paragraph:
yield "".join(paragraph)
paragraph = []
else:
paragraph.append(line)
if paragraph:
yield "".join(paragraph)
if __name__=="__main__":
def dosomething(item):
print item,
try:
fw = open("myfile.txt")
iter = FileIterator(fw)
for item in iter.readlines():
dosomething(item)
# Rewind - method will be
# proxied to wrapped file object
iter.seek(0)
for item in iter.readblocks(100):
dosomething(item)
# Seek to a different position
pos = 200
iter.seek(pos)
for item in iter.readwords():
dosomething(item)
iter.close()
except (OSError, IOError), e:
print e
|
The idea for this recipe came from Alex Martelli's recipe that supplies a generator to read a text file by paragraph (Recipe 19.10, Python Cookbook 2nd Edition, Page 713). I thought it would be nice to have a single class that provides different methods to read a file - by character, word, line, paragraph and custom-size blocks.
One way to do it is by subtyping the "file" type and implementing the methods in the new type. However this recipe takes uses the aggregator cum proxy design pattern. It aggregates an open file object and defines iterator methods on top of it. However, unresolved methods are proxied to the wrapped file object, so you can use the iterator object to perform operations on the file object directly, as shown in the examples.
Better versions... The consistancy (and efficiency in some cases) of your version is lacking. The below fixes those cases.
readwords. Why not add wspacere as a class member, so that the regex isn't recompiled each method invocation? Or does python optimize that?
bogus __getattr__. 1. If self.__dict__ contains name, __getattr__ won't get called
This version addresses both issues:
Re: Better versions... Given that file object is its own iterator, and iter(f) returns f (unless f is closed), plus the fact that a file object iterator's next() method returns the next input line, all means that the readlines() method can be simplified a litte bit further, to just:
readlines. SOME SUBTLETIES NOTED:
The behavior of the readlines() method of this file-object proxy/iterator class differs from the one with the same name in a standard file object, which reads until EOF and returns a list containing all the lines thus read.
Since the class effectively hides the conventional method in the wrapped file-object, it's not -- pedantically -- a true proxy class...
Re: Re: Better versions... On further reflection, the best and simplest thing to do would be to just leave out the FileIterator class's readlines() method. It doesn't add any desired functionality and leaving it out will allow/cause the underlying file's readline to be used (given the definition of the __getattr__() method).
Doing this will also make the class a true proxy class (see my other earlier comment about subletites below).