ilines is a generator that takes an iterable and produces lines of text. The input iterable should produce blocks of bytes (as type str) such as might be produced by reading a file in binary. The output lines are formed by the same rule as the "universal newlines" file mode [f = file(name, 'U')] and are produced "on-line" -- when lines are discovered, they are produced.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | def ilines(source_iterable):
'''yield lines as in universal-newlines from a stream of data blocks'''
tail = ''
for block in source_iterable:
if not block:
continue
if tail.endswith('\015'):
yield tail[:-1] + '\012'
if block.startswith('\012'):
pos = 1
else:
tail = ''
else:
pos = 0
try:
while True: # While we are finding LF.
npos = block.index('\012', pos) + 1
try:
rend = npos - 2
rpos = block.index('\015', pos, rend)
if pos:
yield block[pos : rpos] + '\n'
else:
yield tail + block[:rpos] + '\n'
pos = rpos + 1
while True: # While CRs 'inside' the LF
rpos = block.index('\015', pos, rend)
yield block[pos : rpos] + '\n'
pos = rpos + 1
except ValueError:
pass
if '\015' == block[rend]:
if pos:
yield block[pos : rend] + '\n'
else:
yield tail + block[:rend] + '\n'
elif pos:
yield block[pos : npos]
else:
yield tail + block[:npos]
pos = npos
except ValueError:
pass
# No LFs left in block. Do all but final CR (in case LF)
try:
while True:
rpos = block.index('\015', pos, -1)
if pos:
yield block[pos : rpos] + '\n'
else:
yield tail + block[:rpos] + '\n'
pos = rpos + 1
except ValueError:
pass
if pos:
tail = block[pos:]
else:
tail += block
if tail:
yield tail
|
Many data sources produce their data in fits and starts -- sockets, compression expansion, and (at its heart) most I/O. The data often doesn't arrive at convenient boundaries, but you often want to consume it in logical units. For text, this is often line-by-line. Python has "universal newline" processing for reading files written in system-idiomatic end-of-line conventions with its mode 'U'. There are, however, other data sources (rss feeds, compression expansion, timeout-controlled input) producing raw bytes that would benefit from this conversion.
Generators and iteration provide clear ways of expressing on-line operations. Often you don't need processes connected by pipes or threads connected by queues to produce "buffering" results, and this recipe is an example of how you can use generators for giving those results. By connecting to a data source this way, a program showing the first screenful of text from a data source can fill that screen as the data is being retrieved.
This recipe provides a useful tool for extracting text lines from arbitrary data sources. The recipe also provides a relatively simple example of how to build on-line agorithms.
It is very useful for csv reader, especially when I need to process stream that cannot be opened in universal newline mode, e.g. BlobStore in Google AppEngine