Welcome, guest | Sign In | My Account | Store | Cart

Wrap a file handle to allow seeks back to the beginning

Sometimes data coming from a socket or other input file handle isn't what it was supposed to be. For example, suppose you are reading from a buggy server which is supposed to return an XML stream but can also return an unformatted error message. (This often happens because the server doesn't handle incorrect input very well.)

A ReseekFile helps solve this problem. It is a wrapper to the original input stream but provides a buffer. Read requests to the ReseekFile get forwarded to the input stream, appended to a buffer, then returned to the caller. The buffer contains all the data read so far.

The ReseekFile can be told to reseek to the start position. The next read request will come from the buffer, until the buffer has been read, in which case it gets the data from the input stream. This newly read data is also appended to the buffer.

When buffering is no longer needed, use the 'nobuffer()' method. This tells the ReseekFile that once it has read from the buffer it should throw the buffer away. After nobuffer is called, the behaviour of 'seek' is no longer defined.

For example, suppose you have the server as above which either gives an error message is of the form:

  ERROR: cannot do that

or an XML data stream, starting with "

   infile = urllib2.urlopen("http://somewhere/")    infile = ReseekFile.ReseekFile(infile)    s = infile.readline()    if s.startswith("ERROR:"):      raise Exception(s[:-1])    infile.seek(0)    infile.nobuffer() # Don't buffer the data    ... process the XML from infile ...

This module also implements 'prepare_input_source(source)' modeled on xml.sax.saxutils.prepare_input_source. This opens a URL and if the input stream is not already seekable, wraps it in a ReseekFile.

Python, 161 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# Written in 2003 by Andrew Dalke, Dalke Scientific Software, LLC.
# This software has been released to the public domain.  No
# copyright is asserted.

from cStringIO import StringIO

class ReseekFile:
    """wrap a file handle to allow seeks back to the beginning

    Takes a file handle in the constructor.
    
    See the module docstring for more documentation.
    """
    def __init__(self, file):
        self.file = file
        self.buffer_file = StringIO()
        self.at_beginning = 1
        try:
            self.beginning = file.tell()
        except (IOError, AttributeError):
            self.beginning = 0
        self._use_buffer = 1
        
    def seek(self, offset, whence = 0):
        """offset, whence = 0

        Seek to a given byte position.  Only supports whence == 0
        and offset == the initial value of ReseekFile.tell() (which
        is usually 0, but not always.)
        """
        if whence != 0:
            raise TypeError("Unexpected whence value of %s; expecting 0" % \
                            (whence,))
        if offset != self.beginning:
            raise TypeError("Unexpected offset value of %r; expecting '%s'" % \
                             (offset, self.beginning))
        self.buffer_file.seek(0)
        self.at_beginning = 1
        
    def tell(self):
        """the current position of the file

        The initial position may not be 0 if the underlying input
        file supports tell and it not at position 0.
        """
        if not self.at_beginning:
            raise TypeError("ReseekFile cannot tell except at the beginning of file")
        return self.beginning

    def _read(self, size):
        if size < 0:
            y = self.file.read()
            z = self.buffer_file.read() + y
            if self._use_buffer:
                self.buffer_file.write(y)
            return z
        if size == 0:
            return ""
        x = self.buffer_file.read(size)
        if len(x) < size:
            y = self.file.read(size - len(x))
            if self._use_buffer:
                self.buffer_file.write(y)
            return x + y
        return x
        
    def read(self, size = -1):
        """read up to 'size' bytes from the file

        Default is -1, which means to read to end of file.
        """
        x = self._read(size)
        if self.at_beginning and x:
            self.at_beginning = 0
        self._check_no_buffer()
        return x

    def readline(self):
        """read a line from the file"""

        # Can we get it out of the buffer_file?
        s = self.buffer_file.readline()
        if s[-1:] == "\n":
            return s
        # No, so now we read a line from the input file
        t = self.file.readline()

        # Append the new data to the buffer, if still buffering
        if self._use_buffer:
            self.buffer_file.write(t)
        
        self._check_no_buffer()

        return s + t

    def readlines(self):
        """read all remaining lines from the file"""
        s = self.read()
        lines = []
        i, j = 0, s.find("\n")
        while j > -1:
            lines.append(s[i:j+1])
            i = j+1
            j = s.find("\n", i)
        if i < len(s):
            # Only get here if the last line doesn't have a newline
            lines.append(s[i:])
        return lines

    def _check_no_buffer(self):
        # If 'nobuffer' called and finished with the buffer file
        # then get rid of the buffer and redirect everything to
        # the original input file.
        if self._use_buffer == 0 and self.buffer_file.tell() == \
                                        len(self.buffer_file.getvalue()):
            # I'm doing this for the slightly better performance
            self.seek = getattr(self.file, "seek", None)
            self.tell = getattr(self.file, "tell", None)
            self.read = self.file.read
            self.readline = self.file.readline
            self.readlines = self.file.readlines
            del self.buffer_file

    def nobuffer(self):
        """tell the ReseekFile to stop using the buffer once it's exhausted"""
        self._use_buffer = 0

def prepare_input_source(source):
    """given a URL, returns a xml.sax.xmlreader.InputSource

    Works like xml.sax.saxutils.prepare_input_source.  Wraps the
    InputSource in a ReseekFile if the URL returns a non-seekable
    file.

    To turn the buffer off if that happens, you'll need to do
    something like

    f = source.getCharacterStream()
     ...
    try:
       f.nobuffer()
    except AttributeError:
       pass

    or

    if isinstance(f, ReseekFile):
      f.nobuffer()
    
    """
    from xml.sax import saxutils
    source = saxutils.prepare_input_source(source)
    # Is this correct?  Don't know - don't have Unicode exprerience
    f = source.getCharacterStream() or source.getByteStream()
    try:
        f.tell()
    except (AttributeError, IOError):
        f = ReseekFile.ReseekFile(f)
        source.setByteStream(f)
        source.setCharacterStream(None)
    return source

Don't use bound methods for the ReseekFile. When the buffer is empty, the ReseekFile reassigns the input file's read/readlines/etc. method as instance variable. This gives slightly better performance at the cost of not allowing an infrequently used idiom.

Use tell() to get the beginning byte location. ReseekFile will attempt to get the real position from the wrapped file and use that as the beginning location. If the wrapped file does not support tell(), ReseekFile.tell() will return 0.

readlines does not yet support a sizehint. Want to an implementation?

The latest version of this code can be found at http://www.dalkescientific.com/Python/