Welcome, guest | Sign In | My Account | Store | Cart

Retrieve word definitions from online dictionary site (Python recipe) by gyro funch
ActiveState Code (http://code.activestate.com/recipes/211886/)

This recipe shows one how to retrieve word definitions from the website www.dictionary.com.

      """The following routines are specific to queries to 
www.dictionary.com (as of 2003-07-23)"""

def get_def_page(word):
    """Retrieve the definition page for the word of interest.                                                   
                                                                                                                
    """
    import urllib
    url = "http://www.dictionary.com/cgi-bin/dict.pl?term=%s" % word
    fo = urllib.urlopen(url)
    page = fo.read()
    return page

def get_definitions(wlist):
    """Return a dictionary comprising words (keys) and a definition                                             
    lists (values).                                                                                             
                                                                                                                
    """
    ddict = {}
    for word in wlist:
        text = get_def_page(word)
        defs = extract_defs(text)
        ddict[word] = defs
    return ddict

def extract_defs(text):
    """The site formats its definitions as list items <LI>definition</LI>                                       
                                                                                                                
    We first look for all of the list items and then strip them of any                                          
    remaining tags (like <ul>, <CITE>, etc.). This is done using simple 
    regular expressions, but could probably be done more robustly by
    the method detailed in
    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52281                                                                  .
                                                                                                                
    """
    import re

    clean_defs = []
    LI_re = re.compile(r'<LI>(.*)</LI>')
    HTML_re = re.compile(r'<[^>]+>\s*')
    defs = LI_re.findall(text)
    # remove internal tags                                                                                      
    for d in defs:
        clean_d = HTML_re.sub('',d)
        if clean_d: clean_defs.append(clean_d)

    return clean_defs


#--------------------------------------------------------------------                                           
#                                                                                                               
#--------------------------------------------------------------------                                           
if __name__ == "__main__":

    defdict = get_definitions(['monty','python','language'])
    print defdict

      

I had a need to look up definitions for a list of words and couldn't find a convenient way to do this programmatically. The code above seems to work well for this purpose, but may not be robust under all circumstances. It will also break if the queried website changes its definition page format.

1 comment

Christian Ergh 20 years, 4 months ago # | flag

Some Problems with the regular expression. Hy

Always amazing how short and compact python is. However, I tested the script with several words, and saw that it does not show the first definition of the page. Try Viper for example. Result on the page are 4 hits, shown are only the last 3. The pagesource for the first hit is different, the hit beeing embedded in other html code.

Greetings

Chris

Created by gyro funch on Wed, 23 Jul 2003 (PSF)

◄	Python recipes (4591)	►
◄	gyro funch's recipes (6)	►

Required Modules

urllib
re

Other Information and Tasks

Licensed under the PSF License
Viewed 10933 times
Revision 1

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

Retrieve word definitions from online dictionary site (Python recipe) by gyro funch ActiveState Code (http://code.activestate.com/recipes/211886/)