Welcome, guest | Sign In | My Account | Store | Cart
2

You want to access portions of a string. For example, you've read a fixed-width record and want to extract the fields.

Python, 47 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# slicing is great, of course, but it only does one field at a time:
afield = theline[3:8]

# if you want to think in terms of field-length, struct.unpack may 
# sometimes be handier:
import struct

# get a 5-byte string, skip 3, then get 2 8-byte strings, then all the rest:
baseformat = "5s 3x 8s 8s"
numremain = len(theline)-struct.calcsize(baseformat)
format = "%s %ds" % (baseformat, numremain)
leading, s1, s2, trailing = struct.unpack(format, theline)

# of course, the computation of the last field's length is best
# encapsulated in a function:
def fields(baseformat, theline, lastfield=None):
    numremain = len(theline)-struct.calcsize(baseformat)
    format = "%s %d%s" % (baseformat, numremain, lastfield and "s" or "x")
    return struct.unpack(format, theline)
# note that caching/memoizing on (baseformat, len(theline), lastfield) may
# well be useful here if this is called in a loop -- an easy speedup

# split at five byte boundaries:
numfives, therest = divmod(len(theline), 5)
form5 = "%s %dx" % ("5s "*numfives, therest)
fivers = struct.unpack(form5, theline)

# again, this is no doubt best encapsulated:
def split_by(theline, n, lastfield=None):
    numblocks, therest = divmod(len(theline), n)
    baseblock = "%d%s"%(n,lastfield and "s" or "x")
    format = "%s %dx"%(baseblock*numblocks, therest)

# chopping a string into individual characters is of course easier:
chars = list(theline)

# if you prefer to think of your data as being cut up at specific columns,
# then slicing and list comprehensions may be handier:
cuts = [8,14,20,26,30]
pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[sys.maxint]) ]

# once more, encapsulation is advisable:
def split_at(theline, cuts, lastfield=None):
    pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts) ]
    if lastfield:
        pieces.append(theline(cuts[-1]:))
    return pieces

This recipe is inspired by O'Reilly's "Perl Cookbook" Recipe 1.1. Python's slicing takes the place of Perl's substr. Perl's unpack and Python's struct.unpack are rather similar, though Perl's is slightly handier as it accepts a "field-length" of "*" for the last field to mean "all the rest", while, in Python, we have to compute and insert the exact length for either extraction or skipping. This shouldn't be a major issue, since such extraction tasks will most often be encapsulated into small, probably-local functions (where "memoizing", aka automatic caching, may help a lot with performance if the function is called in a loop, to avoid repeating some computations).

In a purely-Python context, the point of this recipe is to remind that struct.unpack IS often a viable, and not rarely a preferable, alternative to slicing -- not quite as often as unpack vs substr in Perl, given the lack of a *-valued field-length, but often enough to be worth keeping in mind.

In the code as presented a decision worth noticing (and perhaps worth criticizing) is that of having a "lastfield=None" optional parameter to each of the encapsulation functions -- this reflects the observation that often we want to skip the last, unknown-length subfield, but often enough we want to retain it instead. The use of lastfield in the "cutesy" expression 'lastfield and "s" or "y"' (equivalent to C's "lastfield?'s':'c'") saves an 'if/else', but it's unclear whether the saving is worth the cuteness -- '"sx"[not lastfield]' and other similar alternatives being roughly equivalent in this respect. When lastfield is false, applying the struct.unpack to just a prefix of theline (specifically theline[:struct.calcsize(format)]) is an alternative, but that's not easy to merge with the case of lastfield being true, when the format does need a supplementary Nx field for some N=len(theline)-struct.calcsize(format).

Performance is not emphasized as crucial to any of these idioms, except for the reminder of memoizing as an often-useful technique. "Premature optimization is the root of all evil". Make your code CLEAR, SIMPLE, and SOLID, first, and worry about making it truly optimal only afterwards... if at all (most often, in real life, the clear, simple, solid solution will be fast enough!-).

1 comment

John Pywtorak 12 years, 1 month ago  # | flag

I think the intention was the following, slight patch. # for the last function

def split_at(theline, cuts, lastfield=True):
    pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts) ]
    if lastfield:
        pieces.append(theline[cuts[-1]:])
    return pieces

Notice the boolean value vs. None and the proper slicing in the

conditional vs. the "()"

Thanks for the nice recipe

John Pywtorak

Add a comment

Sign in to comment