Welcome, guest | Sign In | My Account | Store | Cart

You have a file with long lines split over two or more lines, with backslashes to indicate that a continuation line follows. You want to rejoin those split lines.

Python, 64 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class LogicalLines:
    def __init__(self, fileobj, continued=None):
        # self.seq: the underlying line-sequence
        # self.phys_num: current index into self.seq (physical line number)
        # self.logi_num: current index into self (logical line number)
        import xreadlines
        try: self.seq = fileobj.xreadlines()
        except AttributeError: self.seq = xreadlines.xreadlines(fileobj)
        self.phys_num = 0
        self.logi_num = 0
        # allow for optional passing of continued-function
        if not callable(continued):
            def continued(line):
                if line.endswith('\\\n'): return 1,line[:-2]
                else: return 0, line
        self.continued = continued
    def __getitem__(self, index):
        if index != self.logi_num:
            raise TypeError, "Only sequential access supported"
        self.logi_num += 1
        result = []
        while 1:
            # Note: we must intercept IndexError, since we may not
            # be finished, even when the underlying sequence is --
            # we may have one or more lines in result to be returned
            try: line = self.seq[self.phys_num]
            except IndexError:
                if result: break
                else: raise
            self.phys_num += 1
            continues, line = self.continued(line)
            result.append(line)
            if not continues: break
        # return string result
        return ''.join(result)

# here's an example function, showing off usage:
def show_logicals(fileob,numlines=5):
    ll = LogicalLines(fileob)
    for l in ll:
        print "Log#%d, phys# %d: %s" % (
            ll.logi_num, ll.phys_num, repr(l))
        if ll.logi_num>numlines: break

if __name__=='__main__':
    from cStringIO import StringIO
    ff = StringIO(
"""prima \seconda \terza
quarta \quinta
sesta
settima \ottava
""")
    show_logicals( ff )

# a simpler approach, if the need is of a 1-off kind, might be:
# logical_line = []
# for physical_line in fileobj.xreadlines():
#     if physical_line.endswith('\\\n'):
#         logical_line.append(physical_line[:-2])
#     else:
#         logical_line = ''.join(logical_line) + physical_line
#         process_full_record(logical_line)
#         logical_line = []
# if logical_line: process_full_record(''.join(logical_line))

Inspired by Recipe 8.1 in O'Reilly's Perl Cookbook. See also http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66063, recipe "Read a text file by-paragraph", since the structure is quite similar. We could have picked a more ad-hoc approach, closer to the logic of the Perl recipe, here shown in the ending comment of this recipe.

However, a class wrapper is a much more natural, reusable-code approach in Python, and this exemplifies a similar but different kind of line-bunching from recipe 66063, and is similarly extensible (here, by passing a "continued" function that takes a physical line and returns a pair -- first item true if the line is to be continued, false if this finishes the logical line -- second item, part or all of the physical line to be used in composing the logical line). Again, this shows an important general approach.

Here, the ending "if __name__=='main'" part does perform a simple test, in this case with a simulated-file object, just to show the base functionaliry.

2 comments

daniel wang 17 years, 10 months ago  # | flag

a generator version. I was going to use Alex's implementation, but since xreadlines has been deprecated, I wrote a generator version instead:

def loglines(rawdata):
    lines = []
    for i in rawdata.splitlines():
        lines.append(i)
        if not i.endswith("\\"):
            yield "".join(lines)
            lines = []
    if len(lines) > 0: yield "".join(lines)

This has the downside of needing the whole file (or having to chunk it manually) at once, but has the nice upside of using splitlines to handle DOS/Unix/Max line ending conventions seamlessly.

# print out the merged lines of 'test.txt':
for i in loglines(open('test.txt').read()):
  print i
Martin Miller 9 years, 4 months ago  # | flag

daniel wang's generator version doesn't remove the continuation characters from the lines it generates, plus it reads the entire file into memory at once (and converts it to a list).

Here's a generator version which processes its input iteratively and gets rid of the continuation characters.

def logical_lines(rawdata):
    parts = []
    for line in rawdata:
        line = line.rstrip()  # remove any trailing newline
        if line.endswith('\\'):
            parts.append(line[:-1])
        else:
            yield ''.join(parts) + line
            parts = []
    if parts: yield ''.join(parts)