Welcome, guest | Sign In | My Account | Store | Cart

Notice! PyPM is being replaced with the ActiveState Platform, which enhances PyPM’s build and deploy capabilities. Create your free Platform account to download ActivePython or customize Python with the packages you require and get automatic updates.

Download
ActivePython
INSTALL>
pypm install metadata-parser

How to install metadata_parser

  1. Download and install ActivePython
  2. Open Command Prompt
  3. Type pypm install metadata-parser
 Python 2.7Python 3.2Python 3.3
Windows (32-bit)
0.5.1 Available View build log
0.4.12 Available View build log
0.4.4 Available View build log
Windows (64-bit)
0.5.1 Available View build log
0.4.12 Available View build log
0.4.4 Available View build log
Mac OS X (10.5+)
0.5.1 Available View build log
0.4.12 Available View build log
0.4.4 Available View build log
Linux (32-bit)
0.5.1 Available View build log
0.4.12 Available View build log
0.4.11 Available View build log
0.4.4 Available View build log
Linux (64-bit)
0.5.1 Available View build log
0.4.12 Available View build log
0.4.11 Available View build log
0.4.4 Available View build log
 
License
MIT
Imports
Lastest release
version 0.5.1 on May 21st, 2013

MetadataParser is a python module for pulling metadata out of web documents.

It requires BeautifulSoup , and was largely based on Erik River's opengraph module ( https://github.com/erikriver/opengraph ).

I needed something more aggressive than Erik's module , so had to fork.

Installation

pip install metadata_parser

Features

  • it pulls as much metadata out of a document as possible
  • you can set a 'strategy' for finding metadata ( ie, only accept opengraph or page attributes )

Notes

  1. This requires BeautifulSoup 3 or 4. If it can import bs4 it does, otherwise it tries BeautifulSoup (3)
  2. For speed, it will instantiate a BeautifulSoup parser with lxml , and fall back to 'none' (the internal pure python) if it can't load lxml
The default 'strategy' is to look in this order:
og,dc,meta,page og = OpenGraph dc = DublinCore meta = metadata page = page elements

You can specify a strategy as a comma-separated list of the above.

The only 2 page elements currently supported are:
<title>VALUE</title> -> metadata['page']['title'] <link rel="canonical" href="VALUE"> -> metadata['page']['link']

Usage

From an URL

>>> import metadata_parser
>>> page = metadata_parser.MetadataParser(url="http://www.cnn.com")
>>> print page.metadata
>>> print page.get_field('title')
>>> print page.get_field('title',strategy='og')
>>> print page.get_field('title',strategy='page,og,dc')

From HTML

>>> HTML = """<here>"""
>>> page = metadata_parser.MetadataParser(html=HTML)
>>> print page.metadata
>>> print page.get_field('title')
>>> print page.get_field('title',strategy='og')
>>> print page.get_field('title',strategy='page,og,dc')

Subscribe to package updates

Last updated May 21st, 2013

Download Stats

Last month:2

What does the lock icon mean?

Builds marked with a lock icon are only available via PyPM to users with a current ActivePython Business Edition subscription.

Need custom builds or support?

ActivePython Enterprise Edition guarantees priority access to technical support, indemnification, expert consulting and quality-assured language builds.

Plan on re-distributing ActivePython?

Get re-distribution rights and eliminate legal risks with ActivePython OEM Edition.