Welcome, guest | Sign In | My Account | Store | Cart

Script for calculating top 20 most popular linux distributions. This is done by getting list of possible linux distributions from http://lwn.net/Distributions/. And after that - automated queries are send to yahoo search engine to get a pages count which every distribution returns. From these numbers linux distribution rating is built as percentage from total top 20 queries hits.

IMPORTANT: because there are about 300 queries which are send to yahoo,- load of yahoo server is pretty high, so for load balancing each query is send with 2 seconds delay. Despite to this there is a good chance to get temporary ban from yahoo search service, because of high load from one IP address. SO, USE AT YOUR OWN RISK !!!

Python, 44 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import urllib2
import re
import time

def LinuxDistros():
  req = urllib2.Request("http://lwn.net/Distributions/")
  f = urllib2.urlopen(req)
  t = f.read()
  f.close()
  rc = re.compile('<li> <b><a href.*>(.*)</a></b><br>')
  return rc.findall(t)

def DistroRank(nix):
  enc = "http://search.yahoo.com/search?p="+urllib2.quote('"'+nix+'" "linux distribution"')
  req = urllib2.Request(enc)
  req.add_header('User-Agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20051111 Firefox/1.5 BAVM/1.0.0')
  f = urllib2.urlopen(req)
  t = f.read()
  f.close()
  rc = re.compile('<span id="infotext">1 - 10 of (.*) for <strong>')
  rez = rc.search(t)
  if rez:
    return int(rez.groups()[0].replace(',',''))
  else:
    return 0

def TopDistros():
  print 'Fetching ranks from search engine...'
  distros = LinuxDistros()
  res = []
  for d in distros:
    res.append((DistroRank(d),d))
    print 'Fetched', len(res),'distro of',len(distros)
    time.sleep(2)
  res = sorted(res,reverse=True)[:20]
  total = sum(r for r,d in res)
  res = [(round(100.*r/total,2), d) for r,d in res]
  print '-'*20
  print 'Distro  Rating(%)'
  print '-'*20
  for r,d in res:
    print d,r

TopDistros()

Results of 2009-03-15:


Distro Rating(%)
  1. Ubuntu 27.1
  2. Fedora 15.35
  3. SuSE Linux 5.17
  4. KNOPPIX 4.97
  5. CentOS 4.4
  6. Mandriva Linux 3.7
  7. Debian GNU/Linux 3.41
  8. TINY 3.39
  9. PCLinuxOS 3.31
  10. Arch Linux 3.26
  11. FAN 3.03
  12. Red Hat Enterprise 3.0
  13. Xandros Linux 2.98
  14. Absolute 2.77
  15. RULE 2.72
  16. Moblin 2.48
  17. Gentoo Linux 2.39
  18. FIRE 2.38
  19. Linspire 2.12
  20. Zenwalk 2.08

It is interesting to compare results with http://distrowatch.com/dwres.php?resource=major Distro watch takes Ubuntu into first place, so the same holds for yahoo hits for Ubuntu. But from the second place distrowatch and yahoo "opinions" separates. Distrowatch in second place takes openSUSE, and third - Fedora. But by looking at yahoo hits we see that Fedora gets 3 times more hits than SuSE Linux !!. So by yahoo hits - Fedora should be second and SuSe - third. Further positions disagrees more with DistroWatch.

1 comment

Anand 15 years, 1 month ago  # | flag

Perhaps it is a good idea to provide an option to limit the number of queries ? 337 queries seems too high to calculate 20 top distributions. Give an option which is equal to the number of lwn distributions you want to fetch.