Welcome, guest | Sign In | My Account | Store | Cart

Notice! PyPM is being replaced with the ActiveState Platform, which enhances PyPM’s build and deploy capabilities. Create your free Platform account to download ActivePython or customize Python with the packages you require and get automatic updates.

Download
ActivePython

ragstoriches is unavailable in PyPM, because there aren't any builds for it in the package repositories. Click the linked icons to find out why.

 Python 2.7Python 3.2Python 3.3
Windows (32-bit)
Windows (64-bit)
Mac OS X (10.5+)
Linux (32-bit)
Linux (64-bit)
 
Links
License
MIT

ragstoriches is a combined library/framework to ease writing web scrapers using gevent and requests.

A simple example to tell the story:

System Message: ERROR/3 (<string>, line 9)

Unknown directive type "code-block".

.. code-block:: python

   import re

   from lxml.html import document_fromstring
   from ragstoriches.scraper import Scraper

   scraper = Scraper(__name__)

   @scraper
   def index(requests, url='http://eastidaho.craigslist.org/search/act?query=+'):
       html = document_fromstring(requests.get(url).content)

       for ad_link in html.cssselect('.row a'):
           yield 'posting', ad_link.get('href')

       nextpage = html.cssselect('.nextpage a')
       if nextpage:
           yield 'index', nextpage[0].get('href')


   @scraper
   def posting(requests, url):
       html = document_fromstring(requests.get(url).content)

       title = html.cssselect('.postingtitle')[0].text.strip()
       id = re.findall(r'\d+', html.cssselect('div.postinginfos p')[0].text)[0]
       date = html.cssselect('.postinginfos date')[0].text.strip()
       body = html.cssselect('#postingbody')[0].text_content().strip()

       print title
       print '=' * len(title)
       print 'post *%s*, posted on %s' % (id, date)
       print body
       print

Install the library and lxml, cssselect using pip install ragstoriches lxml cssselect, then save the above as craigs.py, finally run with ragstoriches craigs.py.

You will get a bunch of jumbled output, so next step is redirecting stdout to a file:

ragstoriches craigs.py > output.md

Try giving different urls for this scraper on the command-line:

System Message: ERROR/3 (<string>, line 57)

Unknown directive type "code-block".

.. code-block:: sh

   ragstoriches craigs.py http://newyork.craigslist.org/mnh/acc/  > output.md  # hustle
   ragstoriches craigs.py http://orangecounty.craigslist.org/wet/ > output.md  # writing OC
   ragstoriches craigs.py http://seattle.craigslist.org/w4m/      > output.md  # sleepless in seattle

There are a lot of commandline-options available, see ragstoriches --help for a list.

Parsing HTML

ragstoriches works with almost any kind of HTML-parsing library, while using lxml is recommend, you can easily use BeautifulSoup4 or another library (lxml, in my tests, turned out to be about five times as fast as BeautifulSoup though and the CSS-like selectors are a joy to use).

Writing scrapers

A scraper module consists of some initialization code and a number of subscrapers. Scraping starts by calling the a scraper named index (see the example above).

Scrapers make use of dependency injection - the argument names are looked up in a scope and filled with the relevant instance. This means that if your subscraper takes an argument called url, it will always be the URL to scrape, requests always a pool instance and so on.

The following predefined injections are available:

  • requests: A requests session. Can be treated like the top-level API of requests. As long as you use it for fetching webpages, you never have to worry about blocking or exceeding concurrency limits.
  • url: The url to scrape and parse.
  • data: A callback for passing data out of the scraper. See the example below.

Return values of scrapers are ignored. However, if a scraper is a generater (i.e. contains a yield statement), it should yield tuples of subscraper_name, url or subscraper_name, url, context. These are added to the queue of jobs to scrape, the contents of context are added to the scope for all following subscraper calls.

The url-yield is urlparse.urljoin-ed onto the url passed into the scraper, this means that you do not have to worry about relative links, they just work.

Using receivers

You can decouple scraping/parsing and the actual processing of the resulting data by using receivers. Let's rewrite the example above a slight bit by replacing the second function with this:

System Message: ERROR/3 (<string>, line 117)

Unknown directive type "code-block".

.. code-block:: python

   @scraper
   def posting(requests, url, data):
       html = document_fromstring(requests.get(url).content)

       data('posting',
           title=html.cssselect('.postingtitle')[0].text.strip(),
           id=re.findall(r'\d+', html.cssselect('div.postinginfos p')[0].text)[0],
           date=html.cssselect('.postinginfos date')[0].text.strip(),
           body=html.cssselect('#postingbody')[0].text_content().strip(),
       )

Two differences: We inject data as an argument and instead of printing our data, we pass it to the new callable.

When you call data, the first argument is the name of a subreceiver and everything passed into it gets passed on to every receiver loaded. We don't load any receivers, so running the scraper will do nothing but fill up the data-queue.

To illustrate, put the following into a file called printer.py:

System Message: ERROR/3 (<string>, line 140)

Unknown directive type "code-block".

.. code-block:: python

   from ragstoriches.receiver import Receiver

   receiver = Receiver(__name__)


   @receiver
   def posting(**kwargs):
       print 'New posting: %r' % kwargs

Afterwards, run ragstoriches -q craigs.py printer.py. The result will be that the receiver prints the extracted data to stdout, nicely decoupling extraction and processing.

Caching

You can transparently cache downloaded data, this is especially useful when developing. Simply pass --cache some_name to ragstoriches, which will use requests-cache for caching.

Usage as a library

You can use ragstoriches as a library (instead of as a framework, by using the commandline utilities) as well, but there is no detailed documentation. Drop me a line if this is important for you.

Subscribe to package updates

What does the lock icon mean?

Builds marked with a lock icon are only available via PyPM to users with a current ActivePython Business Edition subscription.

Need custom builds or support?

ActivePython Enterprise Edition guarantees priority access to technical support, indemnification, expert consulting and quality-assured language builds.

Plan on re-distributing ActivePython?

Get re-distribution rights and eliminate legal risks with ActivePython OEM Edition.