Welcome, guest | Sign In | My Account | Store | Cart

Break all of time up into "slices" in order to categorize events.

Python, 48 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import weblog.combined, sys, time, math

def getTimeslice(period, utime):
    low = int(math.floor(utime)) - period + 1
    high = int(math.ceil(utime)) + 1
    for x in range(low, high):
        if x % period == 0:
            return x

def main(files):
    START = time.mktime([2001,11,12,9,0,0,0,0,0])
    END   = time.mktime([2001,11,12,10,0,0,0,0,0])
    t = 0
    slices = {}
    for file in files:
        print file
        log = weblog.combined.Parser(open(file))
        i = 0
        while log.getlogent():
            if log.utime<START or log.utime>END: continue
            slice = getTimeslice(60, log.utime)
            if slices.get(slice) is None:
                slices[slice] = 1
            else:
                slices[slice]=slices[slice]+1
            i=i+1
        print i
        t = t + i

    avg = None
    peak = 0
    peak_ts = 0
    for ts in slices.keys():
        if avg is None:
            avg = slices[ts]
        else:
            avg = (avg + slices[ts]) / 2
        if slices[ts] > peak:
            peak = slices[ts]
            peak_ts = ts
        
    print "Total: %s" % t
    print "Average: %s" % avg
    print "Peak: %s (at %s seconds)" % (peak, peak_ts)

if __name__ == '__main__':
    files = sys.argv[1:]
    main(files)

When analyzing some types of logs like webserver logs, you'd like to attribute "hits" to "time buckets" in order to answer questions like "what is the busiest hour of the day for my website"?

The above script uses the "weblog" web log analysis framework by Mark Nottingham (which seems to be usable only with Python 1.5, due to backwards incompatibilities with Python 2.1), available from http://www.mnot.net/scripting/python/WebLog/. It analyzes a set of Apache web server access logs for a time period. It outputs the total number of "hits" as well as the peak and average number of hits per minute. It extends the weblog framework in the __main__ routine, using the "getTimeslice" method to obtain an integer that represents a unique 60-second period of time during the log period. Then the __main__ routine uses this timeslice as a key in a dictionary which maps timelice to number of hits, allowing the script to report a "peak" 60-second period.

I've also successfully used this strategy for things like opportunistic garbage collection, where it's useful to be able to place collections of items into "buckets" that are represented by a timeslice, dumping them only when the bucket is expired.