A common task, especially in text processing, is to break some input sequence into chunks (lines, paragraphs, records, etc.) and process them one by one. The iterators of builtin strings and files can be considered such chunkers, breaking the object into characters and lines respectively. This recipe is a generic way to break any iterable into consecutive blocks, specified by a delimiter or a pair of (start,end) delimiters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | from operator import itemgetter
from itertools import groupby,imap
_undefined = object()
def iterblocks(iterable, start, end=_undefined, skip_delim=True):
'''Create an iterator over consecutive items (I{blocks}) of the given iterable.
@param start: The delimiter denoting the start of a block. It can be:
1. a predicate C{p(item)} that returns true if the item is a
delimiter and false otherwise, or
2. a non-callable object C{obj}: equivalent to C{lambda item: item==start}
@param end: If not None, the delimiter denoting the end of a block. Items
after an end delimiter but before the next start delimiter are skipped.
Takes the same values as C{start}.
@param skip_delim: True if the delimiter(s) are to be skipped, false otherwise.
@type skip_delim: C{bool}
'''
def get_predicate(arg):
return arg if callable(arg) else (
arg.__eq__ if hasattr(arg,'__eq__') else
lambda item: item == arg)
def stamped_items(items):
count = 0
startsblock = get_predicate(start)
if end is _undefined:
for item in items:
if startsblock(item):
count += 1
if skip_delim: continue
yield count,item
else:
endsblock = get_predicate(end)
inblock = False
for item in items:
if inblock:
if endsblock(item):
inblock = False
if skip_delim: continue
elif startsblock(item):
count += 1
inblock = True
if skip_delim: continue
else: continue
yield count,item
get2nd = itemgetter(1)
for count, block in groupby(stamped_items(iterable), itemgetter(0)):
yield imap(get2nd, block)
if __name__ == '__main__':
import re
# a slow version of str.split
for chars in iterblocks('Hello World', re.compile(r'\s').match):
print ''.join(chars)
source = """\
> name1....
line_11
line_12
line_13
...
> name2 ...
line_21
line_22
...""".splitlines()
for lines in iterblocks(source, start=re.compile('>').match, end='...'):
print list(lines)
|
v1.1: Fixed bug for consecutive delimiters v1.2: Changed "end=None" with "end=_undefined" since None can be a valid delimiter v1.3: Added forgotten imports
Variant with itertools.groupby(). FWIW, with a well crafted key= function, groupby() makes short work of data partitioning problems:
Neat! It's also trivial to retain the delimiters by changing "return result" to "return result if skip_delim else result>=1". groupby is a real swiss knife!
Incorporating comment #1: Here's a rewrite, using the method described in comment #1, and retaining the optional end and skip_delim logic of the original post: