Downloads "All Recipe Authors" pages in ActiveState, uses regular expressions to parse author name and number of their recipes on each page. Finally, it displays the recipe submission distribution (the count of how many authors have submitted how many recipes each).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
import urllib2 import re page = 1 contrib =  # each element of contrib is a tuple consisting of the name of the user and the number of submitted recipes. while 1: # loop over pages print "Processing page %s" % (page) f=urllib2.urlopen("http://code.activestate.com/recipes/users/?page=%s" % (page)) html = f.read() f.close() pattern = '<li><a href="/recipes/users/.*/">(.*)</a>\s*<span class="secondary">\((.*) recipe[s]?\)</span>' res = re.findall(pattern, html) if res: contrib.extend(res) if html.find('<span class="next disabled">') != -1: # found at the last page break else: page += 1 # Print users and number of recipes on screen #for p in contrib: # print p, p # Number of recipes as a list: nrecipes = [int(p) for p in contrib] # Print the distribution n = 1 while n <= max(nrecipes): c = nrecipes.count(n) if c: print "%s people contribute %s recipes each" % (c,n) n += 1
Like most social sharing sites, contributions to ActiveState recipes have an uneven, nongaussian distribution: A few users have a large number of contributions while most users contribute only one or two recipes. This program is written to parse and analyze the statistics of recipe contribution in ActiveState.
The program downloads and processes the author list in ActiveState, starting at page 1, adds user names and corresponding number of recipes to the list "contrib", and repeats for the next page. The main loop ends if the page contains the string '<span class="next disabled">'.
At the moment of this submission, the output shows that 875 of 1427 authors (61%) have submitted a single recipe, and the top four authors (0.3%) altogether contribute 10% of all recipes.