This recipe shows one how to retrieve word definitions from the website www.dictionary.com.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | """The following routines are specific to queries to
www.dictionary.com (as of 2003-07-23)"""
def get_def_page(word):
"""Retrieve the definition page for the word of interest.
"""
import urllib
url = "http://www.dictionary.com/cgi-bin/dict.pl?term=%s" % word
fo = urllib.urlopen(url)
page = fo.read()
return page
def get_definitions(wlist):
"""Return a dictionary comprising words (keys) and a definition
lists (values).
"""
ddict = {}
for word in wlist:
text = get_def_page(word)
defs = extract_defs(text)
ddict[word] = defs
return ddict
def extract_defs(text):
"""The site formats its definitions as list items <LI>definition</LI>
We first look for all of the list items and then strip them of any
remaining tags (like <ul>, <CITE>, etc.). This is done using simple
regular expressions, but could probably be done more robustly by
the method detailed in
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52281 .
"""
import re
clean_defs = []
LI_re = re.compile(r'<LI>(.*)</LI>')
HTML_re = re.compile(r'<[^>]+>\s*')
defs = LI_re.findall(text)
# remove internal tags
for d in defs:
clean_d = HTML_re.sub('',d)
if clean_d: clean_defs.append(clean_d)
return clean_defs
#--------------------------------------------------------------------
#
#--------------------------------------------------------------------
if __name__ == "__main__":
defdict = get_definitions(['monty','python','language'])
print defdict
|
I had a need to look up definitions for a list of words and couldn't find a convenient way to do this programmatically. The code above seems to work well for this purpose, but may not be robust under all circumstances. It will also break if the queried website changes its definition page format.
Some Problems with the regular expression. Hy
Always amazing how short and compact python is. However, I tested the script with several words, and saw that it does not show the first definition of the page. Try Viper for example. Result on the page are 4 hits, shown are only the last 3. The pagesource for the first hit is different, the hit beeing embedded in other html code.
Greetings
Chris