ActiveState Code

Recipe 440663: Hierarchical Split


To split a string more times, hierarchically.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def nestedSplit(astring, sep=None, *subsep):
    """nestedSplit(astring, sep=None, *subsep): given astring, and one or more split
    strings, it splits astring hierarchically. The first split key is the higher level one.
    Ex.: nestedSplit("a b\nc d", "\n", " ") => [['a', 'b'], ['c', 'd']] """
    if subsep:
        return [nestedSplit(fragment, *subsep) for fragment in astring.split(sep)]
    return astring.split(sep)


if __name__ == '__main__':
    st = "a b\nc d"
    print st
    print nestedSplit(st, "\n", " ")
    print

    tetris = """\
    ....
    .##.
    .##.
    ....

    ####
    ####
    ..##
    ..##"""

    from textwrap import dedent
    tetris = dedent(tetris)
    print tetris
    print nestedSplit(tetris, "\n\n", "\n")

Discussion

This is just a very basic implementation, it can be augmented/improved in many ways:

  • removing the recursivity to speed it up a little;
  • removing empty lines, filtering out unwanted things, etc;
  • mapping a given function on the leaves of this tree of lists;
  • adding another feature: alternative splitting strings, like (a sequence of alternative splitting strings can be given instead of a single string):

s = "a b\nc,d" multiSplit(s, "\n", [" ", ","]) ==> [['a', 'b'], ['c', 'd']]

V. 1.1: fixed a silly bug (originally the parameters were inverted), thank you Shane Holloway, next time I'll test things more. V. 1.2: changed its definition to a word present in the English language. V. 1.3: changed the name and improved the function on the base of Erik Wilsher code (but nestedSplit doesn't contain the not).

Comments

  1. 1. At 10:51 a.m. on 6 oct 2005, Shane Holloway said:

    split[1:] vs split[-1:].

    def multiSplit(astring, *splits):
        """multiSplit(astring, *splits): given astring, and one or more split strings, it
        splits astring gerarchically. The first split key is the higher level one. Ex.:
        s = "a b\nc d"
        split(s, "\n", " ") => [['a', 'b'], ['c', 'd']]  """
        if len(splits) <= 1:
            return astring.split(*splits)
        else:
            sp = splits[1:]
            return [multiSplit(st, *sp) for st in astring.split(splits[0])]
    

    Notice the splits[1:] instead of splits[-1:] -- in case you want to split on more than three parameters.

    Now you can correctly split::

    tetris = """\
    . . . .
    . # # .
    . # # .
    . . . .
    
    # # # #
    # # # #
    . . # #
    . . # #"""
    
    from textwrap import dedent
    tetris = dedent(tetris)
    print tetris
    print multiSplit(tetris, "\n\n", "\n", " ")
    
  2. 2. At 10:36 a.m. on 12 oct 2005, Martin Miller said:

    gerarchically? Took me a moment, but I think the term you mean in the description is "hierarchically", not "gerarchically". The latter isn't a word in the English language.

    It's not a coding issue, but fixing it might help people understand and find the recipe...

  3. 3. At 12:45 p.m. on 23 jan 2006, Erik Wilsher said:

    You could rearrange the argurment-list to have a mandatory first separator. If you have a mandatory first separator argument, you can better emulate the behaviour of string.split. In addition you avoid the slicing of the separator argument.

    def deepsplit(s, sep=None, *subsep):
        r"""deepsplit -- a nested string splitting function
        usage:
        >>> s = 'a b\nc d'
        >>> deepsplit(s)  #split on whitepace, flat
        ['a', 'b', 'c', 'd']
        >>> deepsplit(s, ' ')  #split on space, flat
        ['a', 'b\nc', 'd']
        >>> deepsplit(s, '\n', ' ') #split on <cr>, then space
        [['a', 'b'], ['c', 'd']]
        """
        if not subsep:
            return s.split(sep)
        return [deepsplit(fragment, *subsep) for fragment in s.split(sep)]
    
    if __name__ == '__main__':
        import doctest
        doctest.testmod()
    
  4. 4. At 12:53 p.m. on 23 jan 2006, Erik Wilsher said:

    Mistake in comment above. Slight mistake in the comment above. The first separator arg is not mandatory, but has a default value.

  5. 5. At 4:55 a.m. on 19 feb 2006, bearophile - (the author) said:

    Thank you. You solution is quite better than mine. I don't know if this deserves to become a standard string method. It's slick and elegant, and I use it now and then, but I don't know how often other people can use something like this.

Sign in to comment