Text files are most often read by-line, with excellent direct Python support. Sometimes we need to use other units, such as the paragraph -- a sequence of non-empty lines separated by empty lines. Python doesn't support that directly, but, as usual, it's not too hard to add such functionality.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
class Paragraphs: def __init__(self, fileobj, separator=None): # self.seq: the underlying line-sequence # self.line_num: current index into self.seq (line number) # self.para_num: current index into self (paragraph number) import xreadlines try: self.seq = fileobj.xreadlines() except AttributeError: self.seq = xreadlines.xreadlines(fileobj) self.line_num = 0 self.para_num = 0 # allow for optional passing of separator-function if separator is None: def separator(line): return line == '\n' elif not callable(separator): raise TypeError, "separator argument must be callable" self.separator = separator def __getitem__(self, index): if index != self.para_num: raise TypeError, "Only sequential access supported" self.para_num += 1 # start where we left off, and skip 0+ separator lines i = self.line_num while 1: # note: if this raises IndexError, it's OK to propagate # it, since we're also a finished-sequence in this case line = self.seq[i] i += 1 if not self.separator(line): break # accumulate 1+ non-blank lines into list result result = [line] while 1: # here we must intercept IndexError, since we're not # finished, even when the underlying sequence is -- # we have one or more lines in result to be returned try: line = self.seq[i] except IndexError: break i += 1 if self.separator(line): break result.append(line) # update self state, return string result self.line_num = i return ''.join(result) # here's an example function, showing off usage: def show_paragraphs(filename,numpars=5): pp = Paragraphs(open(filename)) for p in pp: print "Par#%d, line# %d: %s" % ( pp.para_num, pp.line_num, repr(p)) if pp.para_num>numpars: break
We define a 'paragraph' as a string formed by joining a non-empty sequence of non-separator lines, separated by non-empty sequences of separator lines from adjoining paragraphs. By default, a separator line is one that equals '\n' (empty line), although this concept is easy to generalize (so we let client code pass in a separator-discriminant function at instantiation time: it may be any callable that takes a line and returns true for a separator line; by default, we use equality-comparison with '\n').
This adapter class is a special case of sequence adaptation by bunching: an underlying sequence (here, a sequence of lines, provided by xreadlines on a file or file-like object) is bunched up into another sequence of larger units (here, a sequence of paragraph-strings). The pattern is easy to generalize to other sequence-bunching needs. (of course, it's even easier with iterators and generators in Python 2.2, but even good old Python 2.1's pretty good already:-).
We need an index into the underlying sequence, and a way to check that our __getitem__ is being called with properly sequential indices (as the for statement does), so we take the occasion to expose the indices as being potentially useful attributes .line_num and .para_num of our object -- thus, client code can determine during a sequential scan at what point we are regarding the indexing on the underlying line sequence, the paragraph sequence, or both, without needing to keep track of things itself.
The code emphasizes clarity and linearity -- no special tricks. Thus, we have two separate loops, each in the usual "while 1: ... if xxx: break" pattern: first, one to skip over 0+ separators that may occur; then, a separate one to accumulate non-separators into a result list. The loops might be merged to save a few lines, but only at the cost of extra complexity (a status variable recalling if we're skipping separators or accumulating non-separators), not a good trade-off here (nor in most other places!-). We use a separate local variable i rather than operating on self.line_num directly in the body of method __getitem__ -- a stylistic choice that seems preferable here (more concision, reached in a way that enhances clarity as well as speed in this case). Again, we could save a couple of lines by eschewing this and using self.line_num more directly in the four spots we currently use i (two in each while loop). The second loop might be made shorter by using 'while not self.separator(line)' as the head, since we know we have line as a non-separator at the start, but I preferred to keep the current similitude and symmetry between the two loops intact -- again, a stylistic choice promoting simplicity.
Function show_paragraphs shows off all the simple features of class Paragraphs and can be used to unit-test the latter by feeding it a known textfile.