Cheap-date trick; a different way to parse « Python recipes

... a light meal with a heavy dose of "tutorial mash" on the side.

In the constructive spirit of "more ways to solve a problem"; this is a portion of my lateral, occasionally oblique, solutions. Nothing new in le régime de grande, but hopefully the conceptual essence will amuse.

Initially started as a response to recipe 577135 which parses incremental date fragments and preserves micro-seconds where available. That script does more work than this, for sure, but requires special flow-control and iterates a potentially incumbering shopping list (multi-dimensional with some detail).

So here's a different box for others to play with. Upside-down in a sense, it doesn't hunt for anything but a numerical "pulse"; sequences of digits punctuated by other 'stuff' we don't much care about.

Missing a lot of things, intentionally, this snippet provides several examples demoin' flexibility. Easy to button-up, redecorate and extend later for show, till then the delightful commentary makes it hard enough to see bones already -- all six lines or so!

Note: The core script is repeated for illustrative purposes. The first is step-by-step, the second is lean and condensed for utilitarian purposes. It is the second, shorter, version that I yanked from a file and gussied up.

      #!/usr/bin/env python
"""
2012-03-05, weeee!

This is a really simple script, the docs are WAY longer, that
dices a date-string returning a list of integers or a dict if
key-words are supplied.

... IT SLICES, IT DICES, IT HAS SHARP EDGES!
    ===============================================
    Not production-ready, 'nless you like to play with razors.
    There is no type-checking, no assertion for field-order etc.
    This simply, blindly and unintelligently guts the string.   
    If the order changes, it bites... you get the idea.

Some examples, more plus arg-defs b'low

    TEST_DATE = "2012-03-05 13:05:14.453728"
    
    # return list of int's in original order
    cheap_date(TEST_DATE)
    [2012, 3, 5, 13, 5, 14, 453728]
    
    ISO_KEYS = ['t_year','t_mon','...'t_sec','t_usec']

    # return same list, mapped into a dict
    cheap_date(TEST_DATE, ISO_KEYS)
    {'t_mon': 3, 't_min': 5, 't_sec': 14, 't_hour': 13,
              't_day': 5, 't_year': 2012, 't_usec': 453728}
    
    # Keep the decimal t'gether using non-default regex
    # Note: list is str's, int("12.34") razors a ValueError
    cheap_date(TEST_DATE, [], DIG_N_DEC, str)
    ['2012', '03', '05', '13', '05', '14.453728']
    
    # dict's and format strings, naturally sweeeeet
    FMT_STR % cheap_date(TEST_DATE, ISO_KEYS, DIG_N_DEC, PAT)
    2012-03-05T13:05:14.454
    
    FMT_STR2 % cheap_date(TEST_DATE, ISO_KEYS, val_conv = str)
    13:05-03/05/2012

"""

import re

def cheap_date(dt_str, kw_list = [], reg_xp = r'\D', val_conv = int):
    """ Cheap incremental date parser preserving ISO microseconds

    dt_str:     String representing source date
           pass "2012-03-05 13:05:14.453728"
        returns [2012, 3, 5, 13, 5, 14, 453728]

      optional- ['2012', '03', '05', '13', '05', '14.453728']
        returns {'tm_year': 2012, 'tm_mday': 5, 'tm_mon': 3 ... }

    Optional arguments:

    kw_list:    Ordered list of return-dictionary keys
    reg_xp:     Regular expression used to split the string
    val_conv:   List-processor for data-conversion i.e. str --> int

    >>> cheap_date(TEST_DATE)
    [2012, 3, 5, 13, 5, 14, 453728]
    >>> cheap_date(TEST_DATE, ISO_KEYS[:3])
    {'t_mon': 3, 't_day': 5, 't_year': 2012}
    >>> FMT_STR2 % cheap_date(TEST_DATE, ISO_KEYS, val_conv = str)
    '13:05-03/05/2012'
    """
    # shake the numbers out with re ['2012', '03'...]
    tm_list = re.split(reg_xp, dt_str)
    
    # juice 'em: apply function to each list-value [2012, 03...]
    # you _could_ test if val_conv == str an omit this step
    tm_list = map(val_conv, tm_list)

    # Existence of this list, enables return of a dictionary
    if kw_list:
        # fabricate list of key-value pairs [['yr',2012],[....],]
        tm_list = zip(kw_list, tm_list)
        
        # map the key-val pairs into a dict to be proud of
        tm_list = dict(tm_list)
    
    return tm_list


def cheaper_date(dt_str, kw_list = [], reg_xp = r'\D', val_conv = int):
    """ Cheaper date parser, with a few less teeth

    >>> FMT_STR2 % cheaper_date(TEST_DATE, ISO_KEYS, val_conv = str)
    '13:05-03/05/2012'
    """
    # The functionality above, tucked in a thin blankie.
    try:
        tm_list = map(val_conv, re.split(reg_xp, dt_str))
    except ValueError, e:
        print "Conversion proc (int?) spewed a matched value"
        print e
        raise
        
    if kw_list:
        tm_list = dict(zip(kw_list, tm_list))

    return tm_list

if __name__ == '__main__':
    import doctest

    # Some Q&D convenience, ta get 'r done.
    TEST_DATE = "2012-03-05 13:05:14.453728"

    # Keys match  number-seq. order of date to parse
    ISO_KEYS = ['t_year','t_mon','t_day','t_hour','t_min','t_sec','t_usec']

    # Slow lrner moi! Ages till I grep'd the non-obv. & betwix da lines.
    # A.K.A: "To select or not select? TITQ!" I mean "\d" to "\D" 

    # The following exp. splits, and discards, NON-number sequences.
    DIGITS_ONLY = r'\D' # DEFAULT, digits only: 12.56-> ['12','56']

    # \d is inverse|not \D, [^....] inverse|not's the match 
    DIG_N_DEC = r'[^\d\.]' # retain decmal no's 34.78-> ['34.78',]
    
    # With a dict[ionary] and format strings, it happens eh?
    FMT_STR = "T".join(["%(t_year)d-%(t_mon)02d-%(t_day)02d",
                                "%(t_hour)02d:%(t_min)02d:%(t_sec)0.3f"])
    # Same info, just shuffled for my simple-minded amusement
    FMT_STR2 = "%(t_hour)s:%(t_min)s-%(t_mon)s/%(t_day)s/%(t_year)s"
    
    # Pick-A-Type... for demo. Its stupid, assumes a string of digits only.
    # There are safer/elegantisher ways to do this... more calories though.
    def PAT(s): 
        try:
            return int(s)
        except ValueError:
            return float(s)

    doctest.testmod()

      

Sadly, the working-bits are only a few lines, it took obscenely longer to carve it from the lib., document then add my 'helpful' commentary. But why stop there, there's room here too!

Joking aside, the following is intended for beginners trying to understand what's going on here, and a bit of elsewhere too.

Date and time strings are encoded-sequences of characters with implied labels. As long as we infer the positional labels properly, all is golden.

The prime example for this discussion is this puppy:

2012-03-01T13:00:00

Large, coarse values to small, progressing from most significant on the left to the least on the right. This clear date-structure is predictable and adaptable.

It is this way for good reason as ambiguity is expensive, but also it is part of the ISO 8601 spec. It can maintain integrity even while loosing accuracy. Meaning '2012-03' is still valid, interpretable and meets the standard.

A common approach is to go hunting with pre-defined patterns, seeking a match within the target data. The following pattern precisely interprets an isolated sequence for specific elements of date and time.

"%Y-%m-%d %H:%M:%S"

Stalking "conventional" patterns within character sequences is effective as long as the data is consistent and clean. Python can fish wee date sequences from oceans of non-sequences. Using date & time libraries is common and generally practical, however the space between the raw data and libraries that gets interesting.

But we don't need to hunt, sometimes a simple shake will relieve branches of their prize(s). With ISO format, and others, we know exactly which patterns are where within the sequence.

The relevant date-portion of the sequence is 4-1-2-1-2 when viewing the digit to delimiter relationships. With a teeny bit of work, python's string manipulation makes short work of this predictable format. All fine till an exception rolls along.

The old newspaper joke "Editorial is useful to space the advertising" is essentially true here too. It's all about the digits, the other stuff is just filler.

It's the exceptions that trip us, alterations that frustrate us and unexpected change that often 'toasts' us.

Recipe 577135 solves the ISO incremental representation problems. It can extract month-resolution as easily as "by the second" date-time sequences. It employs a table of increasing sequences to extract what it can.

formats = ["YYYY", "YYYY-MM", "YYYY-MM-DD", ...]

The concept is to begin from either end, apply to the target until reacing the success VS failure threshold. The largest successful match is usually the keeper. From one small script, extract '2012-03' and '2012-03-05T12:02', etc., nice.

For the "coding-rodeo's" I've been in date-strings haven't been so pretty, especially those with a mix of important text and digits. In fact, they often look about as good as real rodeo arena's and pens or stables after a weekend stampede!

For the above recipe, the 're' library and it's 'split' method made shredding date-strings trivial. Following up with its documentation, you'll discover it is a steroid-driven, monster-version of python's string.split()

Without delving deeply into regular expressions, the difference is how divisions are specified. In python a single matching specification can be applied to a string. Subsequent splits require specification and iteration across prevous segments.

With 're' and 'grep' pattern specifications, a vast array of factors can be considered before a line is cleaved. In the case of this script, though, it isn't so high-brow.

The default pattern matches anything that is not a digit, divides the string at that point, then continues onto the next. This means the hyphens, spaces and colons are all recognized delimiters enabling the following.

re.split("\D", "2012-03-05")
        returns --> ['2012', '03', '05']

Using python's slice syntax, the string is blindly carvable. Fragile if any of the positions shift due to extra characters, etc.

t_year = date_var[:4]
t_mon  = date_var[5:7]
t_day  = date_var[-2:]

As well, the following non-ISO are easily parsable based upon this simple approach. Modifications are needd for some things, but overall its adaptable. Like above, this is accomplished with only one 'split' instruction, python slicing, etc., would require more.

2/6/91          --> ['2', '6', '91']
19880601-120231 --> ['19880601', '120231']
23.01.09        --> ['23', '01', '09']

Note: To the specific matching pattern "it all tastes like chicken" when it comes to punctuation, and any other non-digit characters.

Given some insight into significance of sequential order, the above is easily mapped into a dictionary, other formats or into a value for computational purposes.

Observant may have noticed the results are lists of strings. '01' is not equal to 1 but int('01') is. The map function used, by default, will visit each element of a list and attempt to replace the strings with integers. There are circumstances, if the expression is altered, that ValueErrors can occur.

map(int, ['2012', '03', '05']) --> [2012, 03, 05]

Note: This 3-element integer list can be extended for use with date & time libraries:

date_list = map(int, ['2012', '03', '05']) # convert to integers
    --> [2012, 03, 05]
date_list.extend([0,0,0,0,0,0])
    --> [2012, 03, 05, 0, 0, 0, 0, 0, 0]
time.mktime([2012, 03, 05, 0, 0, 0, 0, 0, 0])
    --> 1330930800.0

The next, pertinent, consideration is the significance of sequential order. In the main example its obviously year, month, day, etc. However with others it isn't so obious. The first non ISO example above is ambiguous, the second is clear but requres some work and the third is clear.

Pyhton's zip function, not to be confused with the compression application, marries two separate lists into fresh lists of key-value pairs. Converting the resultant ordered pairs into a dict keys the values.

zip([1,2,3],['a','b','c']) --> [[1:'a'],[2:'b'],[3:'c']]
dict([[1:'a'],[2:'b'],[3:'c']]) --> {1:'a',2:'b',3:'c'}


2/6/91:  dict(zip(['mth','day','year'],['2','6','91'])
        --> {'mth': '6','day': '2','year': '91'}
23.01.09:  dict(zip(['day','mth','year'],['23', '01', '09'])
        --> {'day': '23','mth': '01','year': 09'}

Bonus points!

Mis-matched length operations can cause range, key and value errors when being manipulated outside their bounds. The zip function, however, gracefully stops iterating when one or both lists are exhausted.

This cleanly accommodates the variable sized ISO dates allowing the year-month-day dates to function as easily as those including the microseconds portion. This also means, if you only require the first three fields in a dictionary, passing only those three keys automatically truncates the remaining values. This will not work out of sequence, only left-to-right at this time.

Happy trails!

Tags: cheap, date, format, grep, parse, regex, sharp

◄	Python recipes (4591)	►
◄	Scott S-Allen's recipes (4)	►

Cheap-date trick; a different way to parse (Python recipe) by Scott S-Allen
ActiveState Code (http://code.activestate.com/recipes/578064/)

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Cheap-date trick; a different way to parse (Python recipe) by Scott S-Allen ActiveState Code (http://code.activestate.com/recipes/578064/)