Welcome, guest | Sign In | My Account | Store | Cart

Notice! PyPM is being replaced with the ActiveState Platform, which enhances PyPM’s build and deploy capabilities. Create your free Platform account to download ActivePython or customize Python with the packages you require and get automatic updates.

Download
ActivePython
INSTALL>
pypm install screp

How to install screp

  1. Download and install ActivePython
  2. Open Command Prompt
  3. Type pypm install screp
 Python 2.7Python 3.2Python 3.3
Windows (32-bit)
Windows (64-bit)
Mac OS X (10.5+)
Linux (32-bit)
Linux (64-bit)
0.3.2 Available View build log
 
Author
License
LGPL
Imports
Lastest release
version 0.3.2 on Jan 9th, 2014

What is screp?

screp is a command line utility that provides easy and flexible scrapping of HTML documents. It works by finding a set of anchors (specified using a CSS selector) and then extracting information relative to those anchors, optionally post processing it using a set of standard operations. For each anchor it outputs a record formatted according to one of the supported formats (CSV, JSON or general).

Invoking screp

screp is invoked using the following syntax:

$ screp [OPTION] FORMAT_SPEC PRIMARY_SELECTOR [FILES]

where: * FORMAT_SPEC is a format specification, one of:

System Message: ERROR/3 (<string>, line 26)

Unexpected indentation.
  • -c CSV_FORMAT_SPEC, formats each record as a comma-separated-values row
  • -j JSON_FORMAT_SPEC, formats each record as a JSON object and the whole output as a list of JSON objects
  • -f GENERAL_FORMAT_SPEC, formats each record according to a general format where computed values are substituted to their specifications (similar to bash parameter substitution)

System Message: WARNING/2 (<string>, line 32)

Block quote ends without a blank line; unexpected unindent.
  • PRIMARY_SELECTOR is a CSS selector that specifies the primary anchor, as detailed below
  • FILE can be either a local file or an absolute URL; if no FILEs are specified the standard input is read

How does screp work?

screp tries to automate many of the steps taken when writing your own scrapper, steps like:

  • fetching the HTML documents, if necessary
  • parsing HTML
  • locating areas of interest in the DOM of the document
  • locating interesting information around those areas
  • simple processing of these pieces of information
  • formatting of the information
  • outputting the information

To use screp, you need to take a series of steps: * tell screp where to take the HTML documents; it works with multiple

System Message: ERROR/3 (<string>, line 54)

Unexpected indentation.
documents, from sources such as the web, the local file-system or STDIN

System Message: WARNING/2 (<string>, line 55)

Block quote ends without a blank line; unexpected unindent.
  • define the primary anchor using a CSS selector: these are elements through which you access records of interest in the HTML documents

  • specify the output format; this implies specifying: - terms, which are string computed relative to the anchors - how these terms are combined to produce a record; currently screp supports

    System Message: ERROR/3 (<string>, line 60)

    Unexpected indentation.

    three methods of specifying formats:
    • CSV
    • JSON
    • general format
  • optionally, you can also define secondary anchors, which are elements computed relative to the primary anchor that can be used to define terms in a more succinct way

Defining terms

A term has the following format:

anchor.accessor.accessor.accessor|filter|filter|filter

In other words, a term is an anchor(primary or secondary) followed by zero or more accessors followed by zero or more filters.

Accessors and filters (also collectively called actions) are functions that take the output value of the last function (or the anchor, if this is the first action) and output another value. In other words, they form a pipeline. Accessors act on DOM elements and sets (actually ordered lists) of elements, whereas filters act on strings. Each action has an in_type and an out_type. For a term to be correctly defined the out_type of an action needs to match the in_type of the following action.

The supported types are: 'string', 'element', 'element_set'.

Actions can have zero or more parameters. When the action takes parameters it is specified as a function:

action(parameter1, parameter2, parameter3)

When not, only the action name is specified (no parentheses).

Finally, terms have restrictions of the out_type of their last action (also called the out_type of the term): * if a term is used inside a format specification, its out_type must be

System Message: ERROR/3 (<string>, line 98)

Unexpected indentation.
'string'

System Message: WARNING/2 (<string>, line 99)

Block quote ends without a blank line; unexpected unindent.
  • if a term is used to define a secondary anchor, its out_type must be 'element'
Examples of terms

These are correct term definitions:

'$.parent.parent.attr(title)|upper' outputs 'string'
'@.desc(".record").first' outputs 'element
'anchor.ancestors(".box").children(".price")' outputs 'element_set'

Predefined anchors and actions

The following anchors are predefined: * $ is the primary anchor defined by the primary anchor selector * @ is the primary anchor representing the root of the current document

The following accessors are predefined: * first [in_type='element_set', out_type='element']: returns the first

System Message: ERROR/3 (<string>, line 120)

Unexpected indentation.
element in an element_set

System Message: WARNING/2 (<string>, line 121)

Block quote ends without a blank line; unexpected unindent.
  • last [in_type='element_set', out_type='element']: returns the last element in an element_set
  • nth(n) [in_type='element_set', out_type='element']: returns the n-th element in an element_set; it also supports negative indexes, where -1 represents the last element, -2 the second-to-last element, and so on
  • class [in_type='element', out_type='string']: returns the value of the 'class' attribute * id [in_type='element', out_type='string']: returns the value of the 'id' attribute * parent [in_type='element', out_type='element']: returns the parent of the current element
  • text [in_type='element', out_type='string']: returns the text enclosed by the current element
  • tag [in_type='element', out_type='string']: returns the tag of the current element
  • attr(attr_name) [in_type='element', out_type='string']: returns the value of the current element's attribute with name 'attr_name'
  • desc(css_sel) [in_type='element', out_type='element_set']: returns the ordered list of descendants of the current element selected by the CSS selector specified by 'css_sel'
  • fdesc(css_sel) [in_type='element', out_type='element']: equivalent to .desc(css_sel).first
  • ancestors(css_sel) [in_type='element', out_type='element_set']: returns the list of ancestors of the current element that satisfy the CSS selector specified by 'css_sel'
  • children(css_sel) [in_type='element', out_type='element_set']: returns the list of children of the current element that satisfy the CSS selector specified by 'css_sel'
  • psiblings(css_sel) [in_type='element', out_type='element_set']: returns the list of preceding siblings of the current element that satisfy the CSS selector specified by 'css_sel'
  • fsiblings(css_sel) [in_type='element', out_type='element_set']: returns the list of following siblings of the current element that satisfy the CSS selector specified by 'css_sel'
  • siblings(css_sel) [in_type='element', out_type='element_set']: returns the list of siblings of the current element that satisfy the CSS selector specified by 'css_sel'
  • matching(css_sel) [in_type='element_set', out_type='element_set']: filters an element_set and returns all elements that match the CSS selector specified by 'css_sel'

The following filters are predefined: * upper [in_type='string', out_type='string']: converts string to uppercase * lower [in_type='string', out_type='string']: converts string to lowercase * trim [in_type='string', out_type='string']: removes spaces at the

System Message: ERROR/3 (<string>, line 164)

Unexpected indentation.
beginning and end of the string

System Message: WARNING/2 (<string>, line 165)

Block quote ends without a blank line; unexpected unindent.
  • strip(chars) [in_type='string', out_type='string']: removes characters specified by 'chars' at the beginning and end of the string
  • replace(old, new) [in_type='string', out_type='string']: replaces all occurrences of 'old' with 'new'
  • resub(pattern, repl) [in_type='string', out_type='string']: performs a regular expression substitution; pattern and repl are have the formats taken by the re.sub Python function from the standard Python library;

Specifying output formats

CSV format

The CSV output format is specified using the -c option. Optionally, using the -H option you can specify a CSV header to output before outputting records.

Example:

-c '$.attr(title), $.parent.desc(".price").text | trim' -H 'name, price'
JSON format

The JSON output format is defined using the -j option. It formats the output as a JSON list of objects, one for each record. The --indent-json flat tells screp to indent each object. The format is specified as a comma-separated list of key=value pairs, where the key represents the JSON key in the record object while value is a term specification.

Example:

- j 'text=$.text, ptext=$.parent.text | upper, gptext=$.parent.parent.text'
General format

Then general format is specified by a general string containing term specifications. To distinguish it from the general format, each term specification is surrounded by braces. When formatting a record each term specification is substituted with the computed value for that term.

Example:

-f 'some header {$.parent.text | replace("X", "Y")} some middle {$.tag} some
tail'

Specifying secondary anchors

Secondary anchors are specified using the -a option. There can be any number of secondary anchors definitions. The definitions have the format <name>=<term> where <name> is an identifier and <term> is a term definition relative to any of the previously defined anchors (primary or secondary) that has outputs an element. Secondary anchors can be redefined in later -a options but only the last definition is retained.

Secondary anchors examples

These are examples of secondary anchors definitions:

-a 'p=$.parent' -a 'gp=p.parent'

-a 'interesting=$.fdesc(".interesting-class")' -a
'interesting=interesting.parent'

Subscribe to package updates

Last updated Jan 9th, 2014

What does the lock icon mean?

Builds marked with a lock icon are only available via PyPM to users with a current ActivePython Business Edition subscription.

Need custom builds or support?

ActivePython Enterprise Edition guarantees priority access to technical support, indemnification, expert consulting and quality-assured language builds.

Plan on re-distributing ActivePython?

Get re-distribution rights and eliminate legal risks with ActivePython OEM Edition.