Welcome, guest | Sign In | My Account | Store | Cart

Below you will find a simple python class that I wrote to test performance of various different sorting techniques. In this cas: list, heapq, and bisect. For printing out data, I make use of the very cool decimal module to limit errant floating point numbers with many digits. It helps me decide when I want to think about different types of sorts.

Python's list.sort is so good that generally you are not going to want to write your own sort, and instead use it and some of the new features recently added to it. However, with a list.sort you pay for sorting at retrieval time. A different type of sort using data structures to sort at creation time instead of retrieval can offer some performance characteristics, that may make one consider them instead.

Python, 99 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import timeit,heapq,bisect,random,decimal
   
class actions:
    def __init__(self,data):
        self.data=data
        self.data_types=('heapq','bisect','list')
    def create(self,the_type):
        new_data=[]
        if the_type=='heapq':
            for i in Data: heapq.heappush(new_data,i)
            print len(new_data)
        elif the_type=='list':
            for i in Data: new_data.append(i)
        elif the_type=='bisect':
            for i in Data: bisect.insort(new_data,i)
        self.new_data=new_data
    def pop1(self,the_type):
        if the_type=='heapq':
            heapq.heappop(self.new_data)
        elif the_type=='list':
            self.new_data.sort(reverse=True)
            self.new_data.pop()
        elif the_type=='bisect':
            self.new_data.pop(0)
    def pop2(self,the_type):
        if the_type=='heapq':
            heapq.heappush(self.new_data,0)
            heapq.heappop(self.new_data)
        elif the_type=='list':
            self.new_data.append(0)
            self.new_data.sort()
            self.new_data.pop()
        elif the_type=='bisect':
            bisect.insort(self.new_data,0)
            self.new_data.pop(0)
    def show500(self,the_type):
        if the_type=='heapq':
            heapq.nlargest(500,self.new_data)
        elif the_type=='list' or 'bisect':      
                self.new_data[-500:]
#do rounding here w/decimal library to limit presence
#off odd floating point numbers
def rnd(flt):
    flt=round(flt,4)
    thread_context = decimal.getcontext()
    thread_context.prec=5
    dec_num=thread_context.create_decimal(str((flt)))
    return str(dec_num)


Results={}
Actions=['create','pop1','pop2','show500']
for num in (10000,100000,200000,1000000,):
    print 'at',num
    Data=[ random.randrange(1000000) for i in range(num) ]
    action_obj=actions(Data)
    for data_type in action_obj.data_types:
        Results[data_type]={}
        for action in Actions:
            specific_action=getattr(action_obj,action)
            begin=time.clock()
            specific_action(data_type)
            end=time.clock()-begin
            Results[data_type][action]=rnd(end)
    print '\t',
    for action in Actions: print action,'\t',
    print ''
    ref=0
    for data_type in Results:
        print data_type,'\t',
        for action in Actions:
            print Results[data_type][action],'\t',
        print ''
#######Results
at 10000
        create  pop1    pop2    show500
bisect  0.0782  0.0001  0.0001  0.0001
heapq   0.0101  0.0     0.0     0.0069
list    0.0061  0.0119  0.0013  0.0001
at 100000
        create  pop1    pop2    show500
bisect  10.994  0.0008  0.0013  0.0001
heapq   0.1156  0.0     0.0     0.0318
list    0.0968  0.1908  0.0577  0.0001
at 200000
        create  pop1    pop2    show500
bisect  78.837  0.0017  0.0046  0.0001
heapq   0.245   0.0     0.0     0.06
list    0.2039  0.4406  0.1402  0.0001
at 300000
        create  pop1    pop2    show500
bisect  201.39  0.0027  0.0065  0.0002
heapq   0.3438  0.0     0.0     0.0684
list    0.2888  0.7267  0.214   0.0001

at 1000000 #(skipping bisect)
        create  pop1    pop2    show500
heapq   1.4689  0.0004  0.0     0.1984
list    0.8342  3.1588  0.8904  0.0002

Sometimes something that has poor performance doesn't matter if you gain clarity or ease of use instead. For example, joining strings with 'a'+'b'+'c' can be slower than ''.join('a','b','c'). Unless the strings are big enough and you do it often enough, I wouldn't recommend ''.join. The clarity of 'a'+'b'+'c' is worthwhile.

Using heapq or bisect versus list.sort probably sit in the same boat as ''.join versus using +. Howeever, when you need heapq or bisect you''ll be happy to have it.

Heapq and bisect both pay for sorting at creation time instead of retrieval time. Interestingly, Heapq is to iterators what bisect is it sequences. With an iterator you only get the next item, cannot go backwards or retreive sequences of items. Heapq similarly only allows you to pop off of the data structure to get the lowest item. Pure heap operations do not get you sequences or allow you go go back and forth through the data(The special functions heapq.nlargest and nsmallest help limit this rigidity of a heap).

With Bisect the data is fully sorted, so you have full sequence power available to you. Bisect's major weakness is creation time easily reaching minutes while list.sort and heap are in tenths of a second.

Heapq obvious win is having extremely fast access to the lowest item, despite adding new data. It's creation time is also very good, almost as good as the list. It's one weakness looking at sequences of data (getting 500 largest items). Even then, heap doesn't perform badly with a million item list.

Having to sort after adding data is list.sort's weakness. Even at 100,000 items, sort is fast, especially with presorted data (look at pop2), so it may not be an issue.

You can reach my at pyguy2 on yahoo.

2 comments

Ian Bicking 19 years, 6 months ago  # | flag

printing floats. You can also print errant floats like '%.4f' % aFloat. There's a bunch of other ways to control float printing using %f, like '%10.4f' means left-pad (right justify) it out to ten characters (including .####), using spaces. Note the 10 applies to the entire length, including the decimal part and the -. '%010.4' means left-pad it with zeros. I think tehre's some other options as well, e.g., '%-10.4' will right-pad (left justify).

John Nielsen (author) 19 years, 6 months ago  # | flag

good idea. I've used them myself before and forgot about that. Thanks for the idea!

Created by John Nielsen on Fri, 24 Sep 2004 (PSF)
Python recipes (4591)
John Nielsen's recipes (36)

Required Modules

  • (none specified)

Other Information and Tasks