Welcome, guest | Sign In | My Account | Store | Cart

Notice! PyPM is being replaced with the ActiveState Platform, which enhances PyPM’s build and deploy capabilities. Create your free Platform account to download ActivePython or customize Python with the packages you require and get automatic updates.

Download
ActivePython
INSTALL>
pypm install serpextract

How to install serpextract

  1. Download and install ActivePython
  2. Open Command Prompt
  3. Type pypm install serpextract
 Python 2.7Python 3.2Python 3.3
Windows (32-bit)
Windows (64-bit)
Mac OS X (10.5+)
Linux (32-bit)
Linux (64-bit)
0.2.5 Available View build log
 
License
LICENSE.txt
Imports
Lastest release
version 0.2.5 on Jan 9th, 2014
https://travis-ci.org/Parsely/serpextract.png?branch=master

serpextract provides easy extraction of keywords from search engine results pages (SERPs).

This module is possible in large part to the very hard work of the Piwik team. Specifically, we make extensive use of their list of search engines.

Installation

Latest release on PyPI:

$ pip install serpextract

Or the latest development version (not recommended):

$ pip install -e git://github.com/Parsely/serpextract.git#egg=serpextract

Usage

Command Line

Command-line usage, returns the engine name and keyword components separated by a comma and enclosed in quotes:

$ serpextract "http://www.google.ca/url?sa=t&rct=j&q=ars%20technica"
"Google","ars technica"

You can also print out a list of all the SearchEngineParsers currently available in your local cache via:

$ serpextract -l
Python

System Message: ERROR/3 (<string>, line 43)

Unknown directive type "code-block".

.. code-block:: python

    from serpextract import get_parser, extract, is_serp, get_all_query_params

    non_serp_url = 'http://arstechnica.com/'
    serp_url = ('http://www.google.ca/url?sa=t&rct=j&q=ars%20technica&source=web&cd=1&ved=0CCsQFjAA'
                '&url=http%3A%2F%2Farstechnica.com%2F&ei=pf7RUYvhO4LdyAHf9oGAAw&usg=AFQjCNHA7qjcMXh'
                'j-UX9EqSy26wZNlL9LQ&bvm=bv.48572450,d.aWc')

    get_all_query_params()
    # ['key', 'text', 'search_for', 'searchTerm', 'qrs', 'keyword', ...]

    is_serp(serp_url)
    # True
    is_serp(non_serp_url)
    # False

    get_parser(serp_url)
    # SearchEngineParser(engine_name='Google', keyword_extractor=['q'], link_macro='search?q={k}', charsets=['utf-8'])
    get_parser(non_serp_url)
    # None

    extract(serp_url)
    # ExtractResult(engine_name='Google', keyword=u'ars technica', parser=SearchEngineParser(...))
    extract(non_serp_url)
    # None

Naive Detection

The list of search engine parsers that Piwik and therefore serpextract uses is far from exhaustive. If you want serpextract to attempt to guess if a given referring URL is a SERP, you can specify use_naive_method=True to serpextract.is_serp or serpextract.extract. By default, the naive method is disabled.

Naive search engine detection tries to find an instance of r'\.?search\.' in the netloc of a URL. If found, serpextract will then try to find a keyword in the query portion of the URL by looking for the following params in order:

_naive_params = ('q', 'query', 'k', 'keyword', 'term',)

If one of these are found, a keyword is extracted and an ExtractResult is constructed as:

ExtractResult(domain, keyword, None)  # No parser, but engine name and keyword

System Message: ERROR/3 (<string>, line 87)

Unknown directive type "code-block".

.. code-block:: python

    # Not a recognized search engine by serpextract
    serp_url = 'http://search.piccshare.com/search.php?cat=web&channel=main&hl=en&q=test'

    is_serp(serp_url)
    # False

    extract(serp_url)
    # None

    is_serp(serp_url, use_naive_method=True)
    # True

    extract(serp_url, use_naive_method=True)
    # ExtractResult(engine_name=u'piccshare', keyword=u'test', parser=None)

Custom Parsers

In the event that you have a custom search engine that you'd like to track which is not currently supported by Piwik/serpextract, you can create your own instance of serpextract.SearchEngineParser and either pass it explicitly to either serpextract.is_serp or serpextract.extract or add it to the internal list of parsers.

System Message: ERROR/3 (<string>, line 112)

Unknown directive type "code-block".

.. code-block:: python

    # Create a parser for PiccShare
    from serpextract import SearchEngineParser, is_serp, extract

    my_parser = SearchEngineParser(u'PiccShare',          # Engine name
                                   u'q',                  # Keyword extractor
                                   u'/search.php?q={k}',  # Link macro
                                   u'utf-8')              # Charset
    serp_url = 'http://search.piccshare.com/search.php?cat=web&channel=main&hl=en&q=test'

    is_serp(serp_url)
    # False

    extract(serp_url)
    # None

    is_serp(serp_url, parser=my_parser)
    # True

    extract(serp_url, parser=my_parser)
    # ExtractResult(engine_name=u'PiccShare', keyword=u'test', parser=SearchEngineParser(engine_name=u'PiccShare', keyword_extractor=[u'q'], link_macro=u'/search.php?q={k}', charsets=[u'utf-8']))


You can also permanently add a custom parser to the internal list of parsers that serpextract maintains so that you no longer have to explicitly pass a parser object to serpextract.is_serp or serpextract.extract.

System Message: ERROR/3 (<string>, line 140)

Unknown directive type "code-block".

.. code-block:: python

    from serpextract import SearchEngineParser, add_custom_parser, is_serp, extract

    my_parser = SearchEngineParser(u'PiccShare',          # Engine name
                                   u'q',                  # Keyword extractor
                                   u'/search.php?q={k}',  # Link macro
                                   u'utf-8')              # Charset
    add_custom_parser(u'search.piccshare.com', my_parser)

    serp_url = 'http://search.piccshare.com/search.php?cat=web&channel=main&hl=en&q=test'
    is_serp(serp_url)
    # True

    extract(serp_url)
    # ExtractResult(engine_name=u'PiccShare', keyword=u'test', parser=SearchEngineParser(engine_name=u'PiccShare', keyword_extractor=[u'q'], link_macro=u'/search.php?q={k}', charsets=[u'utf-8']))


Tests

There are some basic tests for popular search engines, but more are required:

$ pip install -r requirements.txt
$ nosetests

Caching

Internally, this module caches an OrderedDict representation of Piwik's list of search engines which is stored in serpextract/search_engines.pickle. This isn't intended to change that often and so this module ships with a cached version.

Subscribe to package updates

Last updated Jan 9th, 2014

What does the lock icon mean?

Builds marked with a lock icon are only available via PyPM to users with a current ActivePython Business Edition subscription.

Need custom builds or support?

ActivePython Enterprise Edition guarantees priority access to technical support, indemnification, expert consulting and quality-assured language builds.

Plan on re-distributing ActivePython?

Get re-distribution rights and eliminate legal risks with ActivePython OEM Edition.