Welcome, guest | Sign In | My Account | Store | Cart

This recipe presents a general purpose file object iterator cum file object proxy class. It provides a class that gives several iterator functions to read a text file by characters, words, lines, paragraphs or blocks. It also acts as a proxy for the wrapped file object.

Python, 94 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import re

class FileIterator(object):
    """ A general purpose file object iterator cum
    file object proxy """
    
    def __init__(self, fw):
        self._fw = fw

    # Attribute proxy for wrapped file object
    def __getattr__(self, name):
        try:
            return self.__dict__[name]
        except KeyError:
            if hasattr(self._fw, name):
                return getattr(self._fw, name)

        return None
        
    def readlines(self):
        """ Line iterator """

        for line in self._fw:
            yield line
                
    def readwords(self):
        """ Word iterator. Newlines are omitted """
        
        # 'Words' are defined as those things
        # separated by whitespace.
        wspacere = re.compile(r'\s+')
        for line in self._fw:
            words = wspacere.split(line)
            for w in words:
                yield w

    def readchars(self):
        """ Character iterator """
        
        for c in self._fw.read():
            yield c

    def readblocks(self, block_size):
        """ Block iterator """

        while True:
            block = self._fw.read(block_size)
            if block=='':
                break
            yield block
        
    def readparagraphs(self):
        """ Paragraph iterator """

        # This re-uses Alex Martelli's
        # paragraph reading recipe.
        # Python Cookbook 2nd edition 19.10, Page 713
        paragraph = []
        for line in self._fw:
            if line.isspace():
                if paragraph:
                    yield "".join(paragraph)
                    paragraph = []
            else:
                paragraph.append(line)
        if paragraph:
            yield "".join(paragraph)
        
if __name__=="__main__":
    
    def dosomething(item):
        print item,
        
    try:
        fw = open("myfile.txt")
        iter = FileIterator(fw)
        for item in iter.readlines():
            dosomething(item)
            
        # Rewind - method will be
        # proxied to wrapped file object
        iter.seek(0)
        for item in iter.readblocks(100):
            dosomething(item)

        # Seek to a different position
        pos = 200
        iter.seek(pos)
        for item in iter.readwords():
            dosomething(item)        

        iter.close()
    except (OSError, IOError), e:
        print e

    

The idea for this recipe came from Alex Martelli's recipe that supplies a generator to read a text file by paragraph (Recipe 19.10, Python Cookbook 2nd Edition, Page 713). I thought it would be nice to have a single class that provides different methods to read a file - by character, word, line, paragraph and custom-size blocks.

One way to do it is by subtyping the "file" type and implementing the methods in the new type. However this recipe takes uses the aggregator cum proxy design pattern. It aggregates an open file object and defines iterator methods on top of it. However, unresolved methods are proxied to the wrapped file object, so you can use the iterator object to perform operations on the file object directly, as shown in the examples.

6 comments

Josiah Carlson 18 years, 11 months ago  # | flag

Better versions... The consistancy (and efficiency in some cases) of your version is lacking. The below fixes those cases.

def readlines(self):
    """ Line iterator """
    return iter(self._fw)

def readwords(self):
    """ Word iterator. Newlines are omitted """

    # 'Words' are defined as those things
    # separated by whitespace.
    for line in self._fw:
        for w in line.split():
            yield w

def readchars(self, linebuff=True):
    """ Character iterator """

    # By default, handle a line-at-a-time, like
    # everything else in this class does, otherwise
    # only byte-at-a-time.

    if linebuff:
        for line in self._fw:
            for ch in line:
                yield ch
    else:
        fr = self._fw.read
        a = 1
        while a:
            a = fr(1)
            if a:
                yield a
Christopher Smith 18 years, 11 months ago  # | flag

readwords. Why not add wspacere as a class member, so that the regex isn't recompiled each method invocation? Or does python optimize that?

Just van Rossum 18 years, 11 months ago  # | flag

bogus __getattr__. 1. If self.__dict__ contains name, __getattr__ won't get called

  1. If the named attribute is not found, you MUST raise AttributeError.

This version addresses both issues:

def __getattr__(self, name):
    return getattr(self._fw, name)
Martin Miller 18 years, 11 months ago  # | flag

Re: Better versions... Given that file object is its own iterator, and iter(f) returns f (unless f is closed), plus the fact that a file object iterator's next() method returns the next input line, all means that the readlines() method can be simplified a litte bit further, to just:

def readlines(self):
    """ Line iterator """
    return self._fw
Martin Miller 18 years, 11 months ago  # | flag

readlines. SOME SUBTLETIES NOTED:

The behavior of the readlines() method of this file-object proxy/iterator class differs from the one with the same name in a standard file object, which reads until EOF and returns a list containing all the lines thus read.

Since the class effectively hides the conventional method in the wrapped file-object, it's not -- pedantically -- a true proxy class...

Martin Miller 18 years, 6 months ago  # | flag

Re: Re: Better versions... On further reflection, the best and simplest thing to do would be to just leave out the FileIterator class's readlines() method. It doesn't add any desired functionality and leaving it out will allow/cause the underlying file's readline to be used (given the definition of the __getattr__() method).

Doing this will also make the class a true proxy class (see my other earlier comment about subletites below).

Created by Anand on Wed, 6 Apr 2005 (PSF)
Python recipes (4591)
Anand's recipes (38)

Required Modules

Other Information and Tasks