Welcome, guest | Sign In | My Account | Store | Cart

This code allows you to search Google scholar from Python code. The result is returned in a nice dictionary format with each field addressed by its key.

Python, 141 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import httplib
import urllib
from BeautifulSoup import BeautifulSoup
import re

class GoogleScholarSearch:
	"""
	@brief This class searches Google Scholar (http://scholar.google.com)

	Search for articles and publications containing terms of interest.
	
	Usage example:\n
	<tt>
	> from google_search import *\n
	> searcher = GoogleScholarSearch()\n
	> searcher.search(['breast cancer', 'gene'])
	</tt>
	"""
	def __init__(self):
		"""
		@brief Empty constructor.
		"""
		self.SEARCH_HOST = "scholar.google.com"
		self.SEARCH_BASE_URL = "/scholar"

	def search(self, terms, limit=10):
		"""
		@brief This function searches Google Scholar using the specified terms.
		
		Returns a list of dictionarys. Each
		dictionary contains the information related to the article:
			"URL"		: link to the article/n
			"Title"		: title of the publication/n
			"Authors"	: authors (example: DF Easton, DT Bishop, D Ford)/n
			"JournalYear" 	: journal name & year (example: Nature, 2001)/n
			"JournalURL"	: link to the journal main website (example: www.nature.com)/n
			"Abstract"	: abstract of the publication/n
			"NumCited"	: number of times the publication is cited/n
			"Terms"		: list of search terms used in the query/n

		@param terms List of search terms
		@param limit Maximum number of results to be returned (default=10)
		@return List of results, this is the empty list if nothing is found
		"""
		params = urllib.urlencode({'q': "+".join(terms), 'num': limit})
		headers = {'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

		url = self.SEARCH_BASE_URL+"?"+params
		conn = httplib.HTTPConnection(self.SEARCH_HOST)
		conn.request("GET", url, {}, headers)
    
		resp = conn.getresponse()      
        
		if resp.status==200:
			html = resp.read()
			results = []
			html = html.decode('ascii', 'ignore')
                        
			# Screen-scrape the result to obtain the publication information
			soup = BeautifulSoup(html)
			citations = 0
			for record in soup('p', {'class': 'g'}):
             
				# Includeds error checking
				topPart = record.first('span', {'class': 'w'})                                
                
				pubURL = topPart.a['href']
				# Clean up the URL, make sure it does not contain '\' but '/' instead
				pubURL = pubURL.replace('\\', '/')

				pubTitle = ""
                
				for part in topPart.a.contents:
					pubTitle += str(part)
                
				if pubTitle == "":
					match1 = re.findall('<b>\[CITATION\]<\/b><\/font>(.*)- <a',str(record))
					match2 = re.split('- <a',match1[citations])
					pubTitle = re.sub('<\/?(\S)+>',"",match2[0])
					citations = citations + 1
               
				authorPart = record.first('font', {'color': 'green'}).string
				if str(authorPart)=='Null':	
					authorPart = ''
					# Sometimes even BeautifulSoup can fail, fall back to regex
					m = re.findall('<font color="green">(.*)</font>', str(record))
					if len(m)>0:
						authorPart = m[0]
				num = authorPart.count(" - ")
				# Assume that the fields are delimited by ' - ', the first entry will be the
				# list of authors, the last entry is the journal URL, anything in between
				# should be the journal year
				idx_start = authorPart.find(' - ')
				idx_end = authorPart.rfind(' - ')
				pubAuthors = authorPart[:idx_start]				
				pubJournalYear = authorPart[idx_start + 3:idx_end]
				pubJournalURL = authorPart[idx_end + 3:]
				# If (only one ' - ' is found) and (the end bit contains '\d\d\d\d')
				# then the last bit is journal year instead of journal URL
				if pubJournalYear=='' and re.search('\d\d\d\d', pubJournalURL)!=None:
					pubJournalYear = pubJournalURL
					pubJournalURL = ''
                               
				# This can potentially fail if all of the abstract can be contained in the space
				# provided such that no '...' is found
				delimiter = soup.firstText("...").parent
				pubAbstract = ""
				while str(delimiter)!='Null' and (str(delimiter)!='<b>...</b>' or pubAbstract==""):
					pubAbstract += str(delimiter)
					delimiter = delimiter.nextSibling
				pubAbstract += '<b>...</b>'
                
				match = re.search("Cited by ([^<]*)", str(record))
				pubCitation = ''
				if match != None:
					pubCitation = match.group(1)
				results.append({
					"URL": pubURL,
					"Title": pubTitle,
					"Authors": pubAuthors,
					"JournalYear": pubJournalYear,
					"JournalURL": pubJournalURL,
					"Abstract": pubAbstract,
					"NumCited": pubCitation,
					"Terms": terms
				})
			return results
		else:
			print "ERROR: ",
			print resp.status, resp.reason
			return []

if __name__ == '__main__':
    search = GoogleScholarSearch()
    pubs = search.search(["breast cancer", "gene"], 10)
    for pub in pubs:
        print pub['Title']
        print pub['Authors']
        print pub['JournalYear']
        print pub['Terms']
        print "======================================"

So far as I know this is the only way to retrieve Google scholar search result from Python since Google does not release any API for Google scholar. Note that you will need an older version of BeautifulSoup (v2.1.1) which can be downloaded from http://www.physics.ox.ac.uk/users/santoso/BeautifulSoup.py.

4 comments

Taoufik En-Najjary 14 years ago  # | flag

Mistake. It did not work. I think you forgot to define AuthorPart's atributes.

urania 11 years, 7 months ago  # | flag

Thx for this great script. But I think using the BibTeX function of google scholar is more useful in an academic setting. So here is my alternative

import urllib, urllib2, cookielib, urllister, sys

def get_page(url):
  filename = "cookies.txt"
  headers = {
    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12'
  }
  request = urllib2.Request(url, None, headers)
  cookies = cookielib.MozillaCookieJar(filename, None, None)
  cookies.load()
  cookie_handler= urllib2.HTTPCookieProcessor(cookies)
  redirect_handler= urllib2.HTTPRedirectHandler()
  opener = urllib2.build_opener(redirect_handler,cookie_handler)
  response = opener.open(request)
  return response.read()

def get_bibtex(termstring, base_url, scholar_subdir):
  terms=termstring.rsplit(" ")
  searchstring="\""+"+".join(terms)+"\""
  params = urllib.urlencode({'q': searchstring})
  url = base_url+url_subdir+"?"+params
  page_data = get_page(url)
  parser = urllister.URLLister()
  parser.feed(page_data)
  parser.close()
  biburls=[]
  for url in parser.urls:
      if url.find('scholar.bib')>0:
          biburls.append(url)
  if biburls:
   retpage=get_page(base_url+biburls[0])
  else:
   retpage=""
   return retpage

if __name__ == "__main__":
  if len(sys.argv)==2:
    base_url = "http://scholar.google.at"
    url_subdir = "/scholar"
    titles_file=sys.argv[1]
    file = open(titles_file)
    lines=[]
    while 1:
      termstring=file.readline()
      if not termstring:
          break
      else:
          print get_bibtex(termstring, base_url, url_subdir)
  else:
    print "usage: python "+sys.argv[0]+" titles.txt\n titles.txt must contain a list of publication titles."

the usage of this script is simple. Provide a list of terms or titles in a text file and separate every search by a new line within the file. The text file should look something like this

Resonance of an Optical Monopole Antenna Probed by Single Molecule Fluorescence
Optical antennas direct single-molecule emission
...

additionally a cookies.txt is needed in order to tell google that you want BibTeX entries to be added (normaly done by changing the "scholar preferences"). Here is a sample of the cookies.txt file

# Netscape HTTP Cookie File
# http://www.netscape.com/newsref/std/cookie_spec.html
# This is a generated file!  Do not edit.
.scholar.google.com TRUE    /   FALSE   XXX GSP ID=XXX:IN=XXX+XXX:CF=4

this data can easily be extracted from your browser. for firefox i can recommend the "edit cookies" addon.

Warning: Google will flag you as a bot if you intensively use this script. so be responsible (and use a random wait interval between get_bibtex interations).

urania 11 years, 7 months ago  # | flag

Since i could not fit everything into the 3000 chars/comment i would like to mention here that importing the urllister requires urllister.py

"""Extract list of URLs in a web page

This program is part of "Dive Into Python", a free Python book for
experienced programmers.  Visit http://diveintopython.org/ for the
latest version.
"""

__author__ = "Mark Pilgrim (mark@diveintopython.org)"
__version__ = "$Revision: 1.2 $"
__date__ = "$Date: 2004/05/05 21:57:19 $"
__copyright__ = "Copyright (c) 2001 Mark Pilgrim"
__license__ = "Python"

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):
        href = [v for k, v in attrs if k=='href']
        if href:
            self.urls.extend(href)

if __name__ == "__main__":
    import urllib
    usock = urllib.urlopen("http://diveintopython.org/")
    parser = URLLister()
    parser.feed(usock.read())
    parser.close()
    usock.close()
    for url in parser.urls: print url

which i got from http://diveintopython.org/html_processing/extracting_data.html

haench 9 years, 9 months ago  # | flag

I used the 2nd scipt from urania (Thanks!). But there seems to be a small issue: I got it working only by removing one(1) space in fron of the line ("return retpage") to fix the indention.