html2data | Python Package Manager Index (PyPM)

INSTALL>

pypm install html2data

[+]

How to install html2data

Download and install ActivePython
Open Command Prompt
Type pypm install html2data

Python 2.7

Python 3.2

Python 3.3

Windows (32-bit)

The build is available for this platform; click to see other versions

0.4.3	Available	View build log
0.4.2	Available	View build log
0.4.1	Failed	View build log
0.4	Failed	View build log
0.3	Available	View build log
0.2	Available	View build log

Windows (64-bit)

0.4.3	Available	View build log
0.4.2	Available	View build log
0.4.1	Failed	View build log
0.4	Failed	View build log
0.3	Available	View build log
0.2	Available	View build log

Mac OS X (10.5+)

0.4.3	Available	View build log
0.4.2	Available	View build log
0.4.1	Failed	View build log
0.4	Failed	View build log
0.3	Available	View build log
0.2	Available	View build log

Linux (32-bit)

0.4.3	Available	View build log
0.4.2	Available	View build log
0.4.1	Failed	View build log
0.4	Failed	View build log
0.3	Available	View build log
0.2	Available	View build log

Linux (64-bit)

0.4.3	Available	View build log
0.4.2	Available	View build log
0.4.1	Failed	View build log
0.4	Failed	View build log
0.3	Available	View build log
0.2	Available	View build log

html2data html data xpath crawler transform

Author

Daniel Perez Rada

License

BSD

Dependencies

Imports

html2data
tests

Lastest release

version 0.4.3 on Nov 26th, 2011

Welcome to Html2Data

Author:	Daniel Perez Rada <dperezrada@gmail.com>

Description

A simple way to transform a HTML file or URL to structured data. You only need to define the xpath to the element. Optionaly you can define functions to be applied after. You can easily write XPATH using the firebug extension, copy XPATH (I recommend edit the XPATH given by firebug, making it shorter).

Example

Import

>>> from html2data import HTML2Data

Create instance

>>> html = """<!DOCTYPE html><html lang="en"><head>
        <meta charset="utf-8" />
        <title>Example Page</title>
        <link rel="stylesheet" href="css/main.css" type="text/css" />
                </head>
                <body>
                <h1><b>Title</b></h1>
                <div class="description">This is not a valid HTML
                </body>
        </html>"""
>>> h2d_instance = HTML2Data(html = html) #You can also create it from a url = url

Using XPATH config

One you have the object

>>> config = [
    {'name': 'header_title', 'xpath': '//head/title/text()'},
    {'name': 'body_title', 'xpath': '//h1/b/text()'},
    {'name': 'description', 'xpath': '//div[@class="description"]/text()'},
]

>>> h2d_instance.parse(config = config)
{'header_title': 'Example Page', 'body_title': 'Title', 'description': 'This is not a valid HTML'}

Using CSS SELECTOR config

>>> config = [
        {'name': 'header_title', 'css': 'head title'},
        {'name': 'body_title', 'css': 'h1 b '},
        {'name': 'description', 'css': 'div.description'},
    ]

>>> h2d_instance.parse(config = config)
{'header_title': 'Example Page', 'body_title': 'Title', 'description': 'This is not a valid HTML'}

Real life example

import urllib2

from html2data import HTML2Data

response = urllib2.urlopen('http://sil.senado.cl/cgi-bin/sil_ultproy.pl')
html = response.read()

config = [
    {'name': 'fecha', 'css': 'td:nth-child(1)'},
    {'name': 'id', 'css': 'td:nth-child(2) a'},
    {'name': 'nombre', 'css': 'td:nth-child(3)'},
    {'name': 'estado', 'css': 'td:nth-child(4)'},
]

html_instance = HTML2Data(html = html)
rows = html_instance.parse_one(css = 'td td tr', multiple = True, text = False)
for row_element in rows:
    row_in_html = HTML2Data(tree = row_element)
    print row_in_html.parse(config = config)

You will get something like:

{'nombre': 'Reforma Constitucional que restablece obligatoriedad del voto.', 'fecha': '24/11/2011', 'estado': 'En tramitación', 'id': '8062-07'}
..
{'nombre': 'Prohíbe el anatocismo.', 'fecha': '02/11/2011', 'estado': 'En tramitación', 'id': '8007-03'}

Requirement

lxml 2.0+

httplib2

Tests

Requirement

ludibrio

nose

Run

>> nosetests

PyPM Index

html2data 0.4.3 (experimental)

A simple way to transform a HTML file or URL to structured data.

How to install html2data

Links

Author

License

Dependencies

Imports

Lastest release

Welcome to Html2Data

Description

Example

Import

Create instance

Using XPATH config

Using CSS SELECTOR config

Real life example

Requirement

Tests

Requirement

Run

Subscribe to package updates

Download Stats

What does the lock icon mean?

Need custom builds or support?

Plan on re-distributing ActivePython?

Accounts

PyPM

Feedback & Information

ActiveState

PyPM Index

html2data 0.4.3 (experimental) A simple way to transform a HTML file or URL to structured data.

How to install html2data

Links

Author

License

Dependencies

Imports

Lastest release

Welcome to Html2Data

Description

Example

Import

Create instance

Using XPATH config

Using CSS SELECTOR config

Real life example

Requirement

Tests

Requirement

Run

Subscribe to package updates

Download Stats

What does the lock icon mean?

Need custom builds or support?

Plan on re-distributing ActivePython?

Accounts

PyPM

Feedback & Information

ActiveState

html2data 0.4.3 (experimental)

A simple way to transform a HTML file or URL to structured data.