Welcome, guest | Sign In | My Account | Store | Cart

ActiveState Code »

PyPM Index

Recent Packages Popular Packages Python 3 Authors Imports

metadata_parser 0.5.1 (experimental)

A module to parse metadata out of documents

INSTALL>

pypm install metadata-parser

[+]

How to install metadata_parser

Download and install ActivePython
Open Command Prompt
Type pypm install metadata-parser

Python 2.7

Python 3.2

Python 3.3

Windows (32-bit)

The build is available for this platform; click to see other versions

0.5.1	Available	View build log
0.4.12	Available	View build log
0.4.4	Available	View build log

Windows (64-bit)

0.5.1	Available	View build log
0.4.12	Available	View build log
0.4.4	Available	View build log

Mac OS X (10.5+)

0.5.1	Available	View build log
0.4.12	Available	View build log
0.4.4	Available	View build log

Linux (32-bit)

0.5.1	Available	View build log
0.4.12	Available	View build log
0.4.11	Available	View build log
0.4.4	Available	View build log

Linux (64-bit)

0.5.1	Available	View build log
0.4.12	Available	View build log
0.4.11	Available	View build log
0.4.4	Available	View build log

opengraph protocol facebook

Author

Jonathan Vanasco

License

MIT

Dependencies

Imports

metadata_parser

Lastest release

version 0.5.1 on May 21st, 2013

MetadataParser is a python module for pulling metadata out of web documents.

It requires BeautifulSoup , and was largely based on Erik River's opengraph module ( https://github.com/erikriver/opengraph ).

I needed something more aggressive than Erik's module , so had to fork.

Installation

pip install metadata_parser

Features

it pulls as much metadata out of a document as possible
you can set a 'strategy' for finding metadata ( ie, only accept opengraph or page attributes )

Notes

This requires BeautifulSoup 3 or 4. If it can import bs4 it does, otherwise it tries BeautifulSoup (3)
For speed, it will instantiate a BeautifulSoup parser with lxml , and fall back to 'none' (the internal pure python) if it can't load lxml

The default 'strategy' is to look in this order:: og,dc,meta,page og = OpenGraph dc = DublinCore meta = metadata page = page elements

You can specify a strategy as a comma-separated list of the above.

The only 2 page elements currently supported are:: <title>VALUE</title> -> metadata['page']['title'] <link rel="canonical" href="VALUE"> -> metadata['page']['link']

Usage

From an URL

>>> import metadata_parser
>>> page = metadata_parser.MetadataParser(url="http://www.cnn.com")
>>> print page.metadata
>>> print page.get_field('title')
>>> print page.get_field('title',strategy='og')
>>> print page.get_field('title',strategy='page,og,dc')

From HTML

>>> HTML = """<here>"""
>>> page = metadata_parser.MetadataParser(html=HTML)
>>> print page.metadata
>>> print page.get_field('title')
>>> print page.get_field('title',strategy='og')
>>> print page.get_field('title',strategy='page,og,dc')

Subscribe to package updates

Last updated May 21st, 2013

Download Stats

Last month:	2

What does the lock icon mean?

Builds marked with a lock icon are only available via PyPM to users with a current ActivePython Business Edition subscription.

Need custom builds or support?

ActivePython Enterprise Edition guarantees priority access to technical support, indemnification, expert consulting and quality-assured language builds.

Plan on re-distributing ActivePython?

Get re-distribution rights and eliminate legal risks with ActivePython OEM Edition.

Accounts

PyPM

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.