sorting -- old to new python 2.4 style (heapq,bisect,list.sort keywords,itemgetter) « Python recipes

Python 2.4 has added a new feature to handle the problem where you needed to do a sort based off of part of the data. In effect, it has simplified the Schwartzian Transform (which I learned many years ago from an perl article written by Randall Schwartz). The following code will start with the older style python sorting approaches, and then show the bisect and heapq modules, and finally the key,cmp,reverse methods newly added to sort. The sort method of the list is an in-place sort. There is also a new sorted() function which returns a copy of the list sorted for you.

      #return data which has groups of a name following by 2 scores.
def get_data():
    return [('fred',10,1),('bill',3,0),('betty',5,8),('jim',0,4)]

####basic older sorting approaches

#By default list's sort method will first sort the names, then the numbers
a=get_data()
a.sort()
print a
>>>[('betty', 5, 8), ('bill', 3, 0), ('fred', 10, 1), ('jim', 0, 4)]


#Now use a special function to sort by only the 1st score. This happens to be
element 1 of the 3 member tuple, so we reference elements a[1] and b[1] Notice
that betty is not first now, jim who has the lowest 1st score is now first
def cmp1(a,b):
    if a[1]<b[1]: return -1
    if a[1]==b[1]: return 0
    if a[1]>b[1]: return 1

#Even better, you can use of python's builtin cmp
def cmp1(a,b): return cmp(a[1],b[1])
    
a=get_data()
a.sort(cmp1)
print a
>>>[('jim', 0, 4), ('bill', 3, 0), ('betty', 5, 8), ('fred', 10, 1)]

#If you know you are only going to be sorting numbers, you can make the special 
function small enough so using an anonymous function lambda makes sense
a=get_data()
a.sort(lambda a, b: a[1]-b[1])
print a
>>>[('jim', 0, 4), ('bill', 3, 0), ('betty', 5, 8), ('fred', 10, 1)]


#Transform approach which does the same as above, except has the advantage of
avoiding the special function to which speeds up list.sort()
In this case we build a list whose 1st element is the part of the
data structure we want to be sorted.
a=get_data()
b=[]
for i in a:
    b.append((i[1],i))
b.sort()    
a=[ j[1] for j in b ]
print a
>>>[('jim', 0, 4), ('bill', 3, 0), ('betty', 5, 8), ('fred', 10, 1)]

#####Using the datastructure itself to do the sorting
#####python 2.4 adds: heapq and bisect implemented in C

#The default sort with bisect is simple, once the data is inserted into the list
#it is already sorted. No need to do a list.sort()
import bisect
a=[]
for i in get_data(): bisect.insort(a,i)
print a 
>>>[('betty', 5, 8), ('bill', 3, 0), ('fred', 10, 1), ('jim', 0, 4)]

#If you want to sort based off of the 1st score at element one,
#we insert pairs of data, in which the element zero of the pair
#is the item we want the sort to be based off of and then convert
#the data back to the original
import bisect
a=[]
for i in get_data():
        bisect.insort(a,(i[1],i))
#map out the original data, now it starts with jim instead of betty
print [ j[1] for j in a ]
>>> [('jim', 0, 4), ('bill', 3, 0), ('betty', 5, 8), ('fred', 10, 1)]

#The heapq is similar but it only guarantees element 0 always has the smallest
#item in the list. It has less overhead than bisect, but to get a sorted result,
#you have to pop from the heap, since the data at higher indexes isn't fully #sorted
import heapq
heaped_list=[]
for i in get_data():
        heapq.heappush(heaped_list,(i[1],i))
while 1:
    try:
        print heapq.heappop(heaped_list)[1],
    except IndexError:
        break
>>> ('jim', 0, 4) ('bill', 3, 0) ('betty', 5, 8) ('fred', 10, 1)

#####python 2.4 sorting approaches. 
#adds: cmp,key, and reverse keywords to list.sort()
#adds: operator.itemgetter which helps limit use of functions

#reverse argument, will do highest to lowest, jim then fred now
a=get_data()
a.sort(reverse=True)
print a
>>>[('jim', 0, 4), ('fred', 10, 1), ('bill', 3, 0), ('betty', 5, 8)]

#keyword argument cmp for the comparison function as above
a=get_data()
a.sort(cmp=cmp1)
print a
>>>[('jim', 0, 4), ('bill', 3, 0), ('betty', 5, 8), ('fred', 10, 1)]

#The keyword argument key is an easy way to do the Transform. You simply provide
#a function which returns the part of the data structure that you want the
#sort to be based off of, instead of creating a list yourself to house it like
#we do above.

This enables a really easy way to do a numberical sort of strings. Note: I use the new python 2.4 sorted function:
>>> j=['100','11','17','1000']
>>> sorted(j)
['100', '1000', '11', '17']
>>> sorted(j,key=int)
['11', '17', '100', '1000']
>>> 

a=get_data()
#anonymous function lambda returns index one, which is a score
a.sort(key=lambda i:i[1]) 
print a
>>>[('jim', 0, 4), ('bill', 3, 0), ('betty', 5, 8), ('fred', 10, 1)]



#This can be further enhanced with a new addition to the operator
#module called itemgetter. It let's you avoid the function entirely to get an
#index (check out operator.attrgetter for attributes).
import operator
a=get_data()
a.sort(key=operator.itemgetter(1)) #get index 1 of the item for sorting
print a
>>>[('jim', 0, 4), ('bill', 3, 0), ('betty', 5, 8), ('fred', 10, 1)]


#You can also use the 2 keywords together. This would be useful
#if for example you want to sort based off of the SUM of the scores
#rather than an individual one

#In this function we add the 2 scores together to enable a sort based off
#of both values. Bill who has the lowest sum is now first, not jim
#who had the lowest 1st score.
def cmp2(a,b):
    sum1=a[0]+a[1]
    sum2=b[0]+b[1]
    return cmp(sum1,sum2)

#The anonymous function lambda returns everything but the 1st element
#to avoid the name. We only want scores for summing.
a=get_data()
a.sort(key=lambda i:i[1:],cmp=cmp2)

print a 
>>>[('bill', 3, 0), ('jim', 0, 4), ('fred', 10, 1), ('betty', 5, 8)]

      

Earlier versions of python had list.sort() which combined with the use of special functions or lambda offered a good way to deal with sorting. The sort is an in-place sort, meaning another list is not returned from your sort.

The problem is a lot of times you would end up calling python functions to help with the sorting which adds overhead or you end up creating temporary data structures to so the sort.

Python 2.4 adds 1)3 keywords to list.sort method: key,cmp,reverse 2)heapq and bisect implemented in C -- sorted datastructures 3) operator.itemgetter(index) - avoids the overhead of a function call operator.attrgetter(attr) 4)sorted() -- if you want to return a sorted list, instead of in-place

They allow you to avoid using user defined functions in some cases, more clearly apply the idioms of sorting in functions that you build, and avoid building ancillary data structures to complete the sort.

You can reach me at pyguy2 on yahoo.

Tags: algorithms

7 comments

Simon Percivall 19 years, 7 months ago # | flag

user itemgetter and attrgetter. This recipe might be an even better introduction to sorting in Python 2.4 if you were to user operator.itemgetter and operator.attrgetter instead of defining your own functions to do their work.

John Nielsen (author) 19 years, 7 months ago # | flag

wanted to limit what you had to know -- but you are right. I wanted to specifically show off some basics of how list.sort() changed, to keep things simple and understandable. Way back, when I was teaching people perl, I tended to lose people if I wasn't careful. Python may not suffer as much from that, but I am still cautious. I'll put up another post adding to it some other ideas.

John Nielsen (author) 19 years, 7 months ago # | flag

decided to put it all in one big one. I changed my mind and put it all together.

Nick Coghlan 19 years, 4 months ago # | flag

sorted() is a builtin, rather than a static method of list.

Cheers,

Nick.

John Nielsen (author) 19 years, 1 month ago # | flag

thanks. Thanks, I fixed that.

john

Andreas Kloss 18 years, 9 months ago # | flag

Also, why do you build your own cmp?

def mycmp(a, b):
    return cmp(a[1], b[1])

will do.

John Nielsen (author) 18 years, 9 months ago # | flag

good point, should have mentioned it. I was trying to make it easier to see what is happening when you look at cmp1 and cmp2. But, I should have mentioned the builtin cmp. Thanks for bringing that up. I'll fix that.

◄	Python recipes (4591)	►
◄	John Nielsen's recipes (36)	►

sorting -- old to new python 2.4 style (heapq,bisect,list.sort keywords,itemgetter) (Python recipe) by John Nielsen
ActiveState Code (http://code.activestate.com/recipes/305304/)

7 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

sorting -- old to new python 2.4 style (heapq,bisect,list.sort keywords,itemgetter) (Python recipe) by John Nielsen ActiveState Code (http://code.activestate.com/recipes/305304/)

7 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

sorting -- old to new python 2.4 style (heapq,bisect,list.sort keywords,itemgetter) (Python recipe) by John Nielsen
ActiveState Code (http://code.activestate.com/recipes/305304/)