Welcome, guest | Sign In | My Account | Store | Cart

This recipe shows one how to retrieve word definitions from the website www.dictionary.com.

Python, 56 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
"""The following routines are specific to queries to 
www.dictionary.com (as of 2003-07-23)"""

def get_def_page(word):
    """Retrieve the definition page for the word of interest.                                                   
                                                                                                                
    """
    import urllib
    url = "http://www.dictionary.com/cgi-bin/dict.pl?term=%s" % word
    fo = urllib.urlopen(url)
    page = fo.read()
    return page

def get_definitions(wlist):
    """Return a dictionary comprising words (keys) and a definition                                             
    lists (values).                                                                                             
                                                                                                                
    """
    ddict = {}
    for word in wlist:
        text = get_def_page(word)
        defs = extract_defs(text)
        ddict[word] = defs
    return ddict

def extract_defs(text):
    """The site formats its definitions as list items <LI>definition</LI>                                       
                                                                                                                
    We first look for all of the list items and then strip them of any                                          
    remaining tags (like <ul>, <CITE>, etc.). This is done using simple 
    regular expressions, but could probably be done more robustly by
    the method detailed in
    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52281                                                                  .
                                                                                                                
    """
    import re

    clean_defs = []
    LI_re = re.compile(r'<LI[^>]*>(.*)</LI>')
    HTML_re = re.compile(r'<[^>]+>\s*')
    defs = LI_re.findall(text)
    # remove internal tags                                                                                      
    for d in defs:
        clean_d = HTML_re.sub('',d)
        if clean_d: clean_defs.append(clean_d)

    return clean_defs


#--------------------------------------------------------------------                                           
#                                                                                                               
#--------------------------------------------------------------------                                           
if __name__ == "__main__":

    defdict = get_definitions(['monty','python','language'])
    print defdict

I had a need to look up definitions for a list of words and couldn't find a convenient way to do this programmatically. The code above seems to work well for this purpose, but may not be robust under all circumstances. It will also break if the queried website changes its definition page format.

3 comments

fooman activestate jones 18 years, 4 months ago  # | flag

regex broken now, for dictionary.com. dictionary.com may have added some arguments to their LI tags, which broke this script. A working regex is:

LI_re = re.compile(r'&lt;LI[^&gt;]*&gt;(.*)&lt;/LI&gt;')
gyro funch (author) 18 years, 3 months ago  # | flag

regex broken? Thanks for the comment. Although I haven't found cases where the original regex would not work, I have updated the recipe with your suggested regex (which seems a little more robust to changes in the list tags).

maxhearn 18 years, 1 month ago  # | flag

A few more regex problems. First off, thanks for the clean code sample. I'm new to Python, and I learned a lot from this.

I came across some minor regex problems when running this:

  1. When I looked up 'pig', several parts of the definition were left out. The problem appears to be that sometimes the <LI> and </LI> tags are on different lines. I modified the regex compile to allow '.' to match newline, which seemed to fix the problem:

    LI_re = re.compile(r'<LI[^>]>(.)</LI>', re.DOTALL))

  2. When I looked up 'octothorpe', I didn't get any definition. The problem appears to be that some definitions use the DD tag instead of the LI tag.

Created by gyro funch on Wed, 23 Jul 2003 (PSF)
Python recipes (4591)
gyro funch's recipes (6)

Required Modules

Other Information and Tasks