Welcome, guest | Sign In | My Account | Store | Cart

To split a string more times, hierarchically.

Python, 30 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def nestedSplit(astring, sep=None, *subsep):
    """nestedSplit(astring, sep=None, *subsep): given astring, and one or more split
    strings, it splits astring hierarchically. The first split key is the higher level one.
    Ex.: nestedSplit("a b\nc d", "\n", " ") => [['a', 'b'], ['c', 'd']] """
    if subsep:
        return [nestedSplit(fragment, *subsep) for fragment in astring.split(sep)]
    return astring.split(sep)


if __name__ == '__main__':
    st = "a b\nc d"
    print st
    print nestedSplit(st, "\n", " ")
    print

    tetris = """\
    ....
    .##.
    .##.
    ....

    ####
    ####
    ..##
    ..##"""

    from textwrap import dedent
    tetris = dedent(tetris)
    print tetris
    print nestedSplit(tetris, "\n\n", "\n")

This is just a very basic implementation, it can be augmented/improved in many ways:

  • removing the recursivity to speed it up a little;
  • removing empty lines, filtering out unwanted things, etc;
  • mapping a given function on the leaves of this tree of lists;
  • adding another feature: alternative splitting strings, like (a sequence of alternative splitting strings can be given instead of a single string):

s = "a b\nc,d" multiSplit(s, "\n", [" ", ","]) ==> [['a', 'b'], ['c', 'd']]

V. 1.1: fixed a silly bug (originally the parameters were inverted), thank you Shane Holloway, next time I'll test things more. V. 1.2: changed its definition to a word present in the English language. V. 1.3: changed the name and improved the function on the base of Erik Wilsher code (but nestedSplit doesn't contain the not).

5 comments

Shane Holloway 18 years, 5 months ago  # | flag

split[1:] vs split[-1:].

def multiSplit(astring, *splits):
    """multiSplit(astring, *splits): given astring, and one or more split strings, it
    splits astring gerarchically. The first split key is the higher level one. Ex.:
    s = "a b\nc d"
    split(s, "\n", " ") => [['a', 'b'], ['c', 'd']]  """
    if len(splits) <= 1:
        return astring.split(*splits)
    else:
        sp = splits[1:]
        return [multiSplit(st, *sp) for st in astring.split(splits[0])]

Notice the splits[1:] instead of splits[-1:] -- in case you want to split on more than three parameters.

Now you can correctly split::

tetris = """\
. . . .
. # # .
. # # .
. . . .

# # # #
# # # #
. . # #
. . # #"""

from textwrap import dedent
tetris = dedent(tetris)
print tetris
print multiSplit(tetris, "\n\n", "\n", " ")
Martin Miller 18 years, 5 months ago  # | flag

gerarchically? Took me a moment, but I think the term you mean in the description is "hierarchically", not "gerarchically". The latter isn't a word in the English language.

It's not a coding issue, but fixing it might help people understand and find the recipe...

Erik Wilsher 18 years, 2 months ago  # | flag

You could rearrange the argurment-list to have a mandatory first separator. If you have a mandatory first separator argument, you can better emulate the behaviour of string.split. In addition you avoid the slicing of the separator argument.

def deepsplit(s, sep=None, *subsep):
    r"""deepsplit -- a nested string splitting function
    usage:
    >>> s = 'a b\nc d'
    >>> deepsplit(s)  #split on whitepace, flat
    ['a', 'b', 'c', 'd']
    >>> deepsplit(s, ' ')  #split on space, flat
    ['a', 'b\nc', 'd']
    >>> deepsplit(s, '\n', ' ') #split on <cr>, then space
    [['a', 'b'], ['c', 'd']]
    """
    if not subsep:
        return s.split(sep)
    return [deepsplit(fragment, *subsep) for fragment in s.split(sep)]

if __name__ == '__main__':
    import doctest
    doctest.testmod()
Erik Wilsher 18 years, 2 months ago  # | flag

Mistake in comment above. Slight mistake in the comment above. The first separator arg is not mandatory, but has a default value.

bearophile - (author) 18 years, 1 month ago  # | flag

Thank you. You solution is quite better than mine. I don't know if this deserves to become a standard string method. It's slick and elegant, and I use it now and then, but I don't know how often other people can use something like this.

Created by bearophile - on Wed, 5 Oct 2005 (PSF)
Python recipes (4591)
bearophile -'s recipes (15)

Required Modules

Other Information and Tasks