You have a file with long lines split over two or more lines, with backslashes to indicate that a continuation line follows. You want to rejoin those split lines.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | class LogicalLines:
def __init__(self, fileobj, continued=None):
# self.seq: the underlying line-sequence
# self.phys_num: current index into self.seq (physical line number)
# self.logi_num: current index into self (logical line number)
import xreadlines
try: self.seq = fileobj.xreadlines()
except AttributeError: self.seq = xreadlines.xreadlines(fileobj)
self.phys_num = 0
self.logi_num = 0
# allow for optional passing of continued-function
if not callable(continued):
def continued(line):
if line.endswith('\\\n'): return 1,line[:-2]
else: return 0, line
self.continued = continued
def __getitem__(self, index):
if index != self.logi_num:
raise TypeError, "Only sequential access supported"
self.logi_num += 1
result = []
while 1:
# Note: we must intercept IndexError, since we may not
# be finished, even when the underlying sequence is --
# we may have one or more lines in result to be returned
try: line = self.seq[self.phys_num]
except IndexError:
if result: break
else: raise
self.phys_num += 1
continues, line = self.continued(line)
result.append(line)
if not continues: break
# return string result
return ''.join(result)
# here's an example function, showing off usage:
def show_logicals(fileob,numlines=5):
ll = LogicalLines(fileob)
for l in ll:
print "Log#%d, phys# %d: %s" % (
ll.logi_num, ll.phys_num, repr(l))
if ll.logi_num>numlines: break
if __name__=='__main__':
from cStringIO import StringIO
ff = StringIO(
"""prima \seconda \terza
quarta \quinta
sesta
settima \ottava
""")
show_logicals( ff )
# a simpler approach, if the need is of a 1-off kind, might be:
# logical_line = []
# for physical_line in fileobj.xreadlines():
# if physical_line.endswith('\\\n'):
# logical_line.append(physical_line[:-2])
# else:
# logical_line = ''.join(logical_line) + physical_line
# process_full_record(logical_line)
# logical_line = []
# if logical_line: process_full_record(''.join(logical_line))
|
Inspired by Recipe 8.1 in O'Reilly's Perl Cookbook. See also http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66063, recipe "Read a text file by-paragraph", since the structure is quite similar. We could have picked a more ad-hoc approach, closer to the logic of the Perl recipe, here shown in the ending comment of this recipe.
However, a class wrapper is a much more natural, reusable-code approach in Python, and this exemplifies a similar but different kind of line-bunching from recipe 66063, and is similarly extensible (here, by passing a "continued" function that takes a physical line and returns a pair -- first item true if the line is to be continued, false if this finishes the logical line -- second item, part or all of the physical line to be used in composing the logical line). Again, this shows an important general approach.
Here, the ending "if __name__=='main'" part does perform a simple test, in this case with a simulated-file object, just to show the base functionaliry.
a generator version. I was going to use Alex's implementation, but since xreadlines has been deprecated, I wrote a generator version instead:
This has the downside of needing the whole file (or having to chunk it manually) at once, but has the nice upside of using splitlines to handle DOS/Unix/Max line ending conventions seamlessly.
daniel wang's generator version doesn't remove the continuation characters from the lines it generates, plus it reads the entire file into memory at once (and converts it to a list).
Here's a generator version which processes its input iteratively and gets rid of the continuation characters.