This recipe shows one how to retrieve word definitions from the website www.dictionary.com.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | """The following routines are specific to queries to
www.dictionary.com (as of 2003-07-23)"""
def get_def_page(word):
"""Retrieve the definition page for the word of interest.
"""
import urllib
url = "http://www.dictionary.com/cgi-bin/dict.pl?term=%s" % word
fo = urllib.urlopen(url)
page = fo.read()
return page
def get_definitions(wlist):
"""Return a dictionary comprising words (keys) and a definition
lists (values).
"""
ddict = {}
for word in wlist:
text = get_def_page(word)
defs = extract_defs(text)
ddict[word] = defs
return ddict
def extract_defs(text):
"""The site formats its definitions as list items <LI>definition</LI>
We first look for all of the list items and then strip them of any
remaining tags (like <ul>, <CITE>, etc.). This is done using simple
regular expressions, but could probably be done more robustly by
the method detailed in
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52281 .
"""
import re
clean_defs = []
LI_re = re.compile(r'<LI[^>]*>(.*)</LI>')
HTML_re = re.compile(r'<[^>]+>\s*')
defs = LI_re.findall(text)
# remove internal tags
for d in defs:
clean_d = HTML_re.sub('',d)
if clean_d: clean_defs.append(clean_d)
return clean_defs
#--------------------------------------------------------------------
#
#--------------------------------------------------------------------
if __name__ == "__main__":
defdict = get_definitions(['monty','python','language'])
print defdict
|
I had a need to look up definitions for a list of words and couldn't find a convenient way to do this programmatically. The code above seems to work well for this purpose, but may not be robust under all circumstances. It will also break if the queried website changes its definition page format.
regex broken now, for dictionary.com. dictionary.com may have added some arguments to their LI tags, which broke this script. A working regex is:
regex broken? Thanks for the comment. Although I haven't found cases where the original regex would not work, I have updated the recipe with your suggested regex (which seems a little more robust to changes in the list tags).
A few more regex problems. First off, thanks for the clean code sample. I'm new to Python, and I learned a lot from this.
I came across some minor regex problems when running this:
When I looked up 'pig', several parts of the definition were left out. The problem appears to be that sometimes the <LI> and </LI> tags are on different lines. I modified the regex compile to allow '.' to match newline, which seemed to fix the problem:
LI_re = re.compile(r'<LI[^>]>(.)</LI>', re.DOTALL))
When I looked up 'octothorpe', I didn't get any definition. The problem appears to be that some definitions use the DD tag instead of the LI tag.