Welcome, guest | Sign In | My Account | Store | Cart

This recipe shows one how to retrieve word definitions from the website www.dictionary.com.

Python, 56 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
"""The following routines are specific to queries to 
www.dictionary.com (as of 2003-07-23)"""

def get_def_page(word):
    """Retrieve the definition page for the word of interest.                                                   
                                                                                                                
    """
    import urllib
    url = "http://www.dictionary.com/cgi-bin/dict.pl?term=%s" % word
    fo = urllib.urlopen(url)
    page = fo.read()
    return page

def get_definitions(wlist):
    """Return a dictionary comprising words (keys) and a definition                                             
    lists (values).                                                                                             
                                                                                                                
    """
    ddict = {}
    for word in wlist:
        text = get_def_page(word)
        defs = extract_defs(text)
        ddict[word] = defs
    return ddict

def extract_defs(text):
    """The site formats its definitions as list items <LI>definition</LI>                                       
                                                                                                                
    We first look for all of the list items and then strip them of any                                          
    remaining tags (like <ul>, <CITE>, etc.). This is done using simple 
    regular expressions, but could probably be done more robustly by
    the method detailed in
    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52281                                                                  .
                                                                                                                
    """
    import re

    clean_defs = []
    LI_re = re.compile(r'<LI>(.*)</LI>')
    HTML_re = re.compile(r'<[^>]+>\s*')
    defs = LI_re.findall(text)
    # remove internal tags                                                                                      
    for d in defs:
        clean_d = HTML_re.sub('',d)
        if clean_d: clean_defs.append(clean_d)

    return clean_defs


#--------------------------------------------------------------------                                           
#                                                                                                               
#--------------------------------------------------------------------                                           
if __name__ == "__main__":

    defdict = get_definitions(['monty','python','language'])
    print defdict

I had a need to look up definitions for a list of words and couldn't find a convenient way to do this programmatically. The code above seems to work well for this purpose, but may not be robust under all circumstances. It will also break if the queried website changes its definition page format.

1 comment

Christian Ergh 20 years, 4 months ago  # | flag

Some Problems with the regular expression. Hy

Always amazing how short and compact python is. However, I tested the script with several words, and saw that it does not show the first definition of the page. Try Viper for example. Result on the page are 4 hits, shown are only the last 3. The pagesource for the first hit is different, the hit beeing embedded in other html code.

Greetings

Chris