Read a text file by-paragraph « Python recipes

Text files are most often read by-line, with excellent direct Python support. Sometimes we need to use other units, such as the paragraph -- a sequence of non-empty lines separated by empty lines. Python doesn't support that directly, but, as usual, it's not too hard to add such functionality.

      class Paragraphs:
    def __init__(self, fileobj, separator=None):
        # self.seq: the underlying line-sequence
        # self.line_num: current index into self.seq (line number)
        # self.para_num: current index into self (paragraph number)
        import xreadlines
        try: self.seq = fileobj.xreadlines()
        except AttributeError: self.seq = xreadlines.xreadlines(fileobj)
        self.line_num = 0
        self.para_num = 0
        # allow for optional passing of separator-function
        if separator is None:
            def separator(line): return line == '\n'
        elif not callable(separator):
            raise TypeError, "separator argument must be callable"
        self.separator = separator
    def __getitem__(self, index):
        if index != self.para_num:
            raise TypeError, "Only sequential access supported"
        self.para_num += 1
        # start where we left off, and skip 0+ separator lines
        i = self.line_num
        while 1:
            # note: if this raises IndexError, it's OK to propagate
            # it, since we're also a finished-sequence in this case
            line = self.seq[i]
            i += 1
            if not self.separator(line): break
        # accumulate 1+ non-blank lines into list result
        result = [line]
        while 1:
            # here we must intercept IndexError, since we're not
            # finished, even when the underlying sequence is --
            # we have one or more lines in result to be returned
            try: line = self.seq[i]
            except IndexError: break
            i += 1
            if self.separator(line): break
            result.append(line)
        # update self state, return string result
        self.line_num = i
        return ''.join(result)

# here's an example function, showing off usage:
def show_paragraphs(filename,numpars=5):
    pp = Paragraphs(open(filename))
    for p in pp:
        print "Par#%d, line# %d: %s" % (
            pp.para_num, pp.line_num, repr(p))
        if pp.para_num>numpars: break

      

We define a 'paragraph' as a string formed by joining a non-empty sequence of non-separator lines, separated by non-empty sequences of separator lines from adjoining paragraphs. By default, a separator line is one that equals '\n' (empty line), although this concept is easy to generalize (so we let client code pass in a separator-discriminant function at instantiation time: it may be any callable that takes a line and returns true for a separator line; by default, we use equality-comparison with '\n').

This adapter class is a special case of sequence adaptation by bunching: an underlying sequence (here, a sequence of lines, provided by xreadlines on a file or file-like object) is bunched up into another sequence of larger units (here, a sequence of paragraph-strings). The pattern is easy to generalize to other sequence-bunching needs. (of course, it's even easier with iterators and generators in Python 2.2, but even good old Python 2.1's pretty good already:-).

We need an index into the underlying sequence, and a way to check that our __getitem__ is being called with properly sequential indices (as the for statement does), so we take the occasion to expose the indices as being potentially useful attributes .line_num and .para_num of our object -- thus, client code can determine during a sequential scan at what point we are regarding the indexing on the underlying line sequence, the paragraph sequence, or both, without needing to keep track of things itself.

The code emphasizes clarity and linearity -- no special tricks. Thus, we have two separate loops, each in the usual "while 1: ... if xxx: break" pattern: first, one to skip over 0+ separators that may occur; then, a separate one to accumulate non-separators into a result list. The loops might be merged to save a few lines, but only at the cost of extra complexity (a status variable recalling if we're skipping separators or accumulating non-separators), not a good trade-off here (nor in most other places!-). We use a separate local variable i rather than operating on self.line_num directly in the body of method __getitem__ -- a stylistic choice that seems preferable here (more concision, reached in a way that enhances clarity as well as speed in this case). Again, we could save a couple of lines by eschewing this and using self.line_num more directly in the four spots we currently use i (two in each while loop). The second loop might be made shorter by using 'while not self.separator(line)' as the head, since we know we have line as a non-separator at the start, but I preferred to keep the current similitude and symmetry between the two loops intact -- again, a stylistic choice promoting simplicity.

Function show_paragraphs shows off all the simple features of class Paragraphs and can be used to unit-test the latter by feeding it a known textfile.

Tags: files

4 comments

Magnus Lie Hetland 22 years, 9 months ago # | flag

Using iterators and generators. A simple generator version (requires "from __future__ import generators" in 2.2):

def paragraphs(file, separator=None):
    if not callable(separator):
        def separator(line): return line == '\n'
    paragraph = []
    for line in file:
        if separator(line):
            if paragraph:
                yield ''.join(paragraph)
                paragraph = []
        else:
            paragraph.append(line)
    yield ''.join(paragraph)

Magnus Lie Hetland 22 years, 9 months ago # | flag

Addendum. If one doesn't want an empty paragraph at the end when the file has one or more trailing separators (and one usually wouldn't want that), the last line should probably read:

if paragraph: yield ''.join(paragraph)

Terry Reedy 22 years, 9 months ago # | flag

Wrong Separator. If someone misunderstands call sequence and passes in a string such as '++\n', you silently replace it with wrong thing.

if not callable(separator):
    def separator(line): return line == '\n'
self.separator = separator

Better to generate right thing or raise TypeError by inserting

if separator != None: raise TypeError, "separator must be callable"

Alex Martelli (author) 22 years, 8 months ago # | flag

good point! fixed in this new version, thanks for pointing it out. It's definitely better to diagnose an unusable argument than to silently ignore it:-).

◄	Python recipes (4591)	►
◄	Alex Martelli's recipes (27)	►
◄	Python Cookbook Edition 2 (117)	►
◄	Python Cookbook Edition 1 (103)	►

Read a text file by-paragraph (Python recipe) by Alex Martelli
ActiveState Code (http://code.activestate.com/recipes/66063/)

4 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Read a text file by-paragraph (Python recipe) by Alex Martelli ActiveState Code (http://code.activestate.com/recipes/66063/)

4 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Read a text file by-paragraph (Python recipe) by Alex Martelli
ActiveState Code (http://code.activestate.com/recipes/66063/)