Welcome, guest | Sign In | My Account | Store | Cart

Notice! PyPM is being replaced with the ActiveState Platform, which enhances PyPM’s build and deploy capabilities. Create your free Platform account to download ActivePython or customize Python with the packages you require and get automatic updates.

Download
ActivePython
INSTALL>
pypm install topia.termextract

How to install topia.termextract

  1. Download and install ActivePython
  2. Open Command Prompt
  3. Type pypm install topia.termextract
 Python 2.7Python 3.2Python 3.3
Windows (32-bit)
1.1.0 Available View build log
Windows (64-bit)
1.1.0 Available View build log
Mac OS X (10.5+)
1.1.0 Available View build log
Linux (32-bit)
1.1.0 Available View build log
Linux (64-bit)
1.1.0 Available View build log
 
License
ZPL 2.1
Dependencies
Lastest release
version 1.1.0 on Jan 5th, 2011

This package determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength.

Detailed Documentation

Term Extraction

This package implements text term extraction by making use of a simple Parts-Of-Speech (POS) tagging algorithm.

http://bioie.ldc.upenn.edu/wiki/index.php/Part-of-Speech

The POS Tagger

POS Taggers use a lexicon to mark words with a tag. A list of available tags can be found at:

http://bioie.ldc.upenn.edu/wiki/index.php/POS_tags

Since words can have multiple tags, the determination of the correct tag is not always simple. This implementation, however, does not try to infer linguistic use and simply chooses the first tag in the lexicon.

>>> from topia.termextract import tag
>>> tagger = tag.Tagger()
>>> tagger
<Tagger for english>

To get the tagger ready for its work, we need to initialize it. In this implementation the lexicon is loaded.

>>> tagger.initialize()

Now we are ready to rock and roll.

Tokenizing

The first step of tagging is to tokenize the text into terms.

>>> tagger.tokenize('This is a simple example.')
['This', 'is', 'a', 'simple', 'example', '.']

While most tokenizers ignore punctuation, it is important for us to keep it, since we need it later for the term extraction. Let's now look at some more complex cases:

  • Quoted Text
>>> tagger.tokenize('This is a "simple" example.')
['This', 'is', 'a', '"', 'simple', '"', 'example', '.']
>>> tagger.tokenize('"This is a simple example."')
['"', 'This', 'is', 'a', 'simple', 'example', '."']
  • Non-letters within words.
>>> tagger.tokenize('Parts-Of-Speech')
['Parts-Of-Speech']
>>> tagger.tokenize('amazon.com')
['amazon.com']
>>> tagger.tokenize('Go to amazon.com.')
['Go', 'to', 'amazon.com', '.']
  • Various punctuation.
>>> tagger.tokenize('Quick, go to amazon.com.')
['Quick', ',', 'go', 'to', 'amazon.com', '.']
>>> tagger.tokenize('Live free; or die?')
['Live', 'free', ';', 'or', 'die', '?']
  • Tolerance to incorrect punctuation.
>>> tagger.tokenize('Hi , I am here.')
['Hi', ',', 'I', 'am', 'here', '.']
  • Possessive structures.
>>> tagger.tokenize("my parents' car")
['my', 'parents', "'", 'car']
>>> tagger.tokenize("my father's car")
['my', 'father', "'s", 'car']
  • Numbers.
>>> tagger.tokenize("12.4")
['12.4']
>>> tagger.tokenize("-12.4")
['-12.4']
>>> tagger.tokenize("$12.40")
['$12.40']
  • Dates.
>>> tagger.tokenize("10/3/2009")
['10/3/2009']
>>> tagger.tokenize("3.10.2009")
['3.10.2009']

Okay, that's it.

Tagging

The next step is tagging. Tagging is done in two phases. During the first phase terms are assigned a tag by looking at the lexicon and the normalized form is set to the term itself. In the second phase, a set of rules is applied to each tagged term and the tagging and normalization is tweaked.

>>> tagger('This is a simple example.')
[['This', 'DT', 'This'],
['is', 'VBZ', 'is'],
['a', 'DT', 'a'],
['simple', 'JJ', 'simple'],
['example', 'NN', 'example'],
['.', '.', '.']]

So wow, this determination was dead on. Let's try a plural form noun and see what happens:

>>> tagger('These are simple examples.')
[['These', 'DT', 'These'],
['are', 'VBP', 'are'],
['simple', 'JJ', 'simple'],
['examples', 'NNS', 'example'],
['.', '.', '.']]

So far so good. Let's test a few more cases:

>>> tagger("The fox's tail is red.")
[['The', 'DT', 'The'],
['fox', 'NN', 'fox'],
["'s", 'POS', "'s"],
['tail', 'NN', 'tail'],
['is', 'VBZ', 'is'],
['red', 'JJ', 'red'],
['.', '.', '.']]
>>> tagger("The fox can't really jump over the fox's tail.")
[['The', 'DT', 'The'],
['fox', 'NN', 'fox'],
['can', 'MD', 'can'],
["'t", 'RB', "'t"],
['really', 'RB', 'really'],
['jump', 'VB', 'jump'],
['over', 'IN', 'over'],
['the', 'DT', 'the'],
['fox', 'NN', 'fox'],
["'s", 'POS', "'s"],
['tail', 'NN', 'tail'],
['.', '.', '.']]
Rules
  • Correct Default Noun Tag
>>> tagger('Ikea')
[['Ikea', 'NN', 'Ikea']]
>>> tagger('Ikeas')
[['Ikeas', 'NNS', 'Ikea']]
  • Verify proper nouns at beginning of sentence.
>>> tagger('. Police')
[['.', '.', '.'], ['police', 'NN', 'police']]
>>> tagger('Police')
[['police', 'NN', 'police']]
>>> tagger('. Stephan')
[['.', '.', '.'], ['Stephan', 'NNP', 'Stephan']]
  • Determine Verb after Modal Verb
>>> tagger('The fox can jump')
[['The', 'DT', 'The'],
['fox', 'NN', 'fox'],
['can', 'MD', 'can'],
['jump', 'VB', 'jump']]
>>> tagger("The fox can't jump")
[['The', 'DT', 'The'],
['fox', 'NN', 'fox'],
['can', 'MD', 'can'],
["'t", 'RB', "'t"],
['jump', 'VB', 'jump']]
>>> tagger('The fox can really jump')
[['The', 'DT', 'The'],
['fox', 'NN', 'fox'],
['can', 'MD', 'can'],
['really', 'RB', 'really'],
['jump', 'VB', 'jump']]
  • Normalize Plural Forms
>>> tagger('examples')
[['examples', 'NNS', 'example']]
>>> tagger('stresses')
[['stresses', 'NNS', 'stress']]
>>> tagger('cherries')
[['cherries', 'NNS', 'cherry']]

Some cases that do not work:

>>> tagger('men')
[['men', 'NNS', 'men']]
>>> tagger('feet')
[['feet', 'NNS', 'feet']]
Term Extraction

Now that we can tag a text, let's have a look at the term extractions.

>>> from topia.termextract import extract
>>> extractor = extract.TermExtractor()
>>> extractor
<TermExtractor using <Tagger for english>>

As you can see, the extractor maintains a tagger:

>>> extractor.tagger
<Tagger for english>

When creating an extractor, you can also pass in a tagger to avoid frequent tagger initialization:

>>> extractor = extract.TermExtractor(tagger)
>>> extractor.tagger is tagger
True

Let's get the terms for a simple text.

>>> extractor("The fox can't jump over the fox's tail.")
[]

We got no terms. That's because by default at least 3 occurences of a term must be detected, if the term consists of a single word.

The extractor maintains a filter component. Let's register the trivial permissive filter, which simply return everything that the extractor suggests:

>>> extractor.filter = extract.permissiveFilter
>>> extractor("The fox can't jump over the fox's tail.")
[('tail', 1, 1), ('fox', 2, 1)]

But let's look at the default filter again, since it allows tweaking its parameters:

>>> extractor.filter = extract.DefaultFilter(singleStrengthMinOccur=2)
>>> extractor("The fox can't jump over the fox's tail.")
[('fox', 2, 1)]

Let's now have a look at multi-word terms. Oftentimes multi-word nouns and proper names occur only once or twice in a text. But they are often great terms! To handle this scenario, the concept of "strength" was introduced. Currently the strength is simply the amount of words in the term. By default, all terms with a strength larger than 1 are selected regardless of the number of occurances.

>>> extractor('The German consul of Boston resides in Newton.')
[('German consul', 1, 2)]
An Exmaple - A News Article

This document provides a simple example of extracting the terms of a BBC article from May 29, 2009. We will use several term extraction tools to compare the outcome.

>>> text ='''
... Police shut Palestinian theatre in Jerusalem.
...
... Israeli police have shut down a Palestinian theatre in East Jerusalem.
...
... The action, on Thursday, prevented the closing event of an international
... literature festival from taking place.
...
... Police said they were acting on a court order, issued after intelligence
... indicated that the Palestinian Authority was involved in the event.
...
... Israel has occupied East Jerusalem since 1967 and has annexed the
... area. This is not recognised by the international community.
...
... The British consul-general in Jerusalem , Richard Makepeace, was
... attending the event.
...
... "I think all lovers of literature would regard this as a very
... regrettable moment and regrettable decision," he added.
...
... Mr Makepeace said the festival's closing event would be reorganised to
... take place at the British Council in Jerusalem.
...
... The Israeli authorities often take action against events in East
... Jerusalem they see as connected to the Palestinian Authority.
...
... Saturday's opening event at the same theatre was also shut down.
...
... A police notice said the closure was on the orders of Israel's internal
... security minister on the grounds of a breach of interim peace accords
... from the 1990s.
...
... These laid the framework for talks on establishing a Palestinian state
... alongside Israel, but left the status of Jerusalem to be determined by
... further negotiation.
...
... Israel has annexed East Jerusalem and declares it part of its eternal
... capital.
...
... Palestinians hope to establish their capital in the area.
... '''
Yahoo Keyword Extractor

Yahoo provides a service that extracts terms from a piece of content using its immense search database.

http://developer.yahoo.com/search/content/V1/termExtraction.html

As you can see, the result is excellent:

<ResultSet>
<Result>british consul general</Result>
<Result>east jerusalem</Result>
<Result>literature festival</Result>
<Result>richard makepeace</Result>
<Result>international literature</Result>
<Result>israeli authorities</Result>
<Result>eternal capital</Result>
<Result>peace accords</Result>
<Result>security minister</Result>
<Result>israeli police</Result>
<Result>internal security</Result>
<Result>palestinian state</Result>
<Result>palestinian authority</Result>
<Result>british council</Result>
<Result>palestinians</Result>
<Result>negotiation</Result>
<Result>breach</Result>
<Result>1990s</Result>
<Result>closure</Result>
<Result>israel</Result>
</ResultSet>

Unfortunately, the service allows only 5000 requests per 24 hours. Also, there is no strength indicator on the terms.

TreeTagger

A POS tagger that uses some linguistics to tag a text. Here is its output:

System Message: WARNING/2 (<string>, line 369)

Literal block expected; none found.

Police NNS Police shut VVD shut Palestinian JJ Palestinian theatre NN theatre in IN in Jerusalem NP Jerusalem . SENT . Israeli JJ Israeli police NNS police have VHP have shut VVN shut down RP down a DT a Palestinian JJ Palestinian theatre NN theatre in IN in East NP East Jerusalem NP Jerusalem . SENT . The DT the action NN action , , , on IN on Thursday NP Thursday , , , prevented VVD prevent the DT the closing NN closing event NN event of IN of an DT an international JJ international literature NN literature festival NN festival from IN from taking VVG take place NN place . SENT . Police NNS Police said VVD say they PP they were VBD be acting VVG act on IN on a DT a court NN court order NN order , , , issued VVN issue after IN after intelligence NN intelligence indicated VVN indicate that IN that the DT the Palestinian NP Palestinian Authority NP Authority was VBD be involved VVN involve in IN in the DT the event NN event . SENT . Israel NP Israel has VHZ have occupied VVN occupy East NP East Jerusalem NP Jerusalem since IN since 1967 CD @card@ and CC and has VHZ have annexed VVN annex the DT the area NN area . SENT . This DT this is VBZ be not RB not recognised VVN recognise by IN by the DT the international JJ international community NN community . SENT . The DT the British JJ British consul-general NN <unknown> in IN in Jerusalem NP Jerusalem , , , Richard NP Richard Makepeace NP Makepeace , , , was VBD be attending VVG attend the DT the event NN event . SENT . " `` " I PP I think VVP think all DT all lovers NNS lover of IN of literature NN literature would MD would regard VV regard this DT this as IN as a DT a very RB very regrettable JJ regrettable moment NN moment and CC and regrettable JJ regrettable decision NN decision , , , " '' " he PP he added VVD add . SENT . Mr NP Mr Makepeace NP Makepeace said VVD say the DT the festival NN festival 's POS 's closing NN closing event NN event would MD would be VB be reorganised VVN <unknown> to TO to take VV take place NN place at IN at the DT the British NP British Council NP Council in IN in Jerusalem NP Jerusalem . SENT . The DT the Israeli JJ Israeli authorities NNS authority often RB often take VVP take action NN action against IN against events NNS event in IN in East NP East Jerusalem NP Jerusalem they PP they see VVP see as RB as connected VVN connect to TO to the DT the Palestinian JJ Palestinian Authority NP Authority . SENT . Saturday NP Saturday 's POS 's opening NN opening event NN event at IN at the DT the same JJ same theatre NN theatre was VBD be also RB also shut VVN shut down RP down . SENT . A DT a police NN police notice NN notice said VVD say the DT the closure NN closure was VBD be on IN on the DT the orders NNS order of IN of Israel NP Israel 's POS 's internal JJ internal security NN security minister NN minister on IN on the DT the grounds NNS ground of IN of a DT a breach NN breach of IN of interim JJ interim peace NN peace accords NNS accord from IN from the DT the 1990s NNS 1990s . SENT . These DT these laid VVD lay the DT the framework NN framework for IN for talks NNS talk on IN on establishing VVG establish a DT a Palestinian JJ Palestinian state NN state alongside IN alongside Israel NP Israel , , , but CC but left VVD leave the DT the status NN status of IN of Jerusalem NP Jerusalem to TO to be VB be determined VVN determine by IN by further JJR further negotiation NN negotiation . SENT . Israel NP Israel has VHZ have annexed VVN annex East NP East Jerusalem NP Jerusalem and CC and declares VVZ declare it PP it part NN part of IN of its PP$ its eternal JJ eternal capital NN capital . SENT . Palestinians NPS Palestinians hope VVP hope to TO to establish VV establish their PP$ their capital NN capital in IN in the DT the area NN area . SENT .

As you can see, the identification of TreeTagger is pretty good, but the output would need some analysis to produce a useful set of terms. Furthermore, TreeTagger is not free for commercial use.

Topia's Term Extractor

Topia's Term Extractor tries to produce results somewhere between a POS tagger like TreeTagger and Yahoo Keyword Extraction.

Since we are only interested in nouns, a very simple POS tagging algorithm can be deployed, which will provide good results most of the time. We then use some simple statistics and linguistics to produce a narrow but strong list of terms for the content.

>>> from topia.termextract import extract
>>> extractor = extract.TermExtractor()

Let's look at the result of the tagger first:

>>> printTaggedTerms(extractor.tagger(text)) #doctest: +REPORT_NDIFF
police          NN    police
shut            VBN   shut
Palestinian     JJ    Palestinian
theatre         NN    theatre
in              IN    in
Jerusalem       NNP   Jerusalem
.               .     .
Israeli         JJ    Israeli
police          NN    police
have            VBP   have
shut            VBN   shut
down            RB    down
a               DT    a
Palestinian     JJ    Palestinian
theatre         NN    theatre
in              IN    in
East            NNP   East
Jerusalem       NNP   Jerusalem
.               .     .
The             DT    The
action          NN    action
,               ,     ,
on              IN    on
Thursday        NNP   Thursday
,               ,     ,
prevented       VBN   prevented
the             DT    the
closing         VBG   closing
event           NN    event
of              IN    of
an              DT    an
international   JJ    international
literature      NN    literature
festival        NN    festival
from            IN    from
taking          VBG   taking
place           NN    place
.               .     .
police          NN    police
said            VBD   said
they            PRP   they
were            VBD   were
acting          VBG   acting
on              IN    on
a               DT    a
court           NN    court
order           NN    order
,               ,     ,
issued          VBN   issued
after           IN    after
intelligence    NN    intelligence
indicated       VBD   indicated
that            IN    that
the             DT    the
Palestinian     JJ    Palestinian
Authority       NNP   Authority
was             VBD   was
involved        VBN   involved
in              IN    in
the             DT    the
event           NN    event
.               .     .
Israel          NNP   Israel
has             VBZ   has
occupied        VBN   occupied
East            NNP   East
Jerusalem       NNP   Jerusalem
since           IN    since
1967            NN    1967
and             CC    and
has             VBZ   has
annexed         VBD   annexed
the             DT    the
area            NN    area
.               .     .
This            DT    This
is              VBZ   is
not             RB    not
recognised      VBD   recognised
by              IN    by
the             DT    the
international   JJ    international
community       NN    community
.               .     .
The             DT    The
British         JJ    British
consul-general  NN    consul-general
in              IN    in
Jerusalem       NNP   Jerusalem
,               ,     ,
Richard         NNP   Richard
Makepeace       NNP   Makepeace
,               ,     ,
was             VBD   was
attending       VBG   attending
the             DT    the
event           NN    event
.               .     .
"               "     "
I               PRP   I
think           VBP   think
all             DT    all
lovers          NNS   lover
of              IN    of
literature      NN    literature
would           MD    would
regard          VB    regard
this            DT    this
as              IN    as
a               DT    a
very            RB    very
regrettable     JJ    regrettable
moment          NN    moment
and             CC    and
regrettable     JJ    regrettable
decision        NN    decision
,"              ,     ,"
he              PRP   he
added           VBD   added
.               .     .
Mr              NNP   Mr
Makepeace       NNP   Makepeace
said            VBD   said
the             DT    the
festival        NN    festival
's              POS   's
closing         VBG   closing
event           NN    event
would           MD    would
be              VB    be
reorganised     NN    reorganised
to              TO    to
take            VB    take
place           NN    place
at              IN    at
the             DT    the
British         JJ    British
Council         NNP   Council
in              IN    in
Jerusalem       NNP   Jerusalem
.               .     .
The             DT    The
Israeli         JJ    Israeli
authorities     NNS   authority
often           RB    often
take            VB    take
action          NN    action
against         IN    against
events          NNS   event
in              IN    in
East            NNP   East
Jerusalem       NNP   Jerusalem
they            PRP   they
see             VB    see
as              IN    as
connected       VBN   connected
to              TO    to
the             DT    the
Palestinian     JJ    Palestinian
Authority       NNP   Authority
.               .     .
Saturday        NNP   Saturday
's              POS   's
opening         NN    opening
event           NN    event
at              IN    at
the             DT    the
same            JJ    same
theatre         NN    theatre
was             VBD   was
also            RB    also
shut            VBN   shut
down            RB    down
.               .     .
A               DT    A
police          NN    police
notice          NN    notice
said            VBD   said
the             DT    the
closure         NN    closure
was             VBD   was
on              IN    on
the             DT    the
orders          NNS   order
of              IN    of
Israel          NNP   Israel
's              POS   's
internal        JJ    internal
security        NN    security
minister        NN    minister
on              IN    on
the             DT    the
grounds         NNS   ground
of              IN    of
a               DT    a
breach          NN    breach
of              IN    of
interim         JJ    interim
peace           NN    peace
accords         NNS   accord
from            IN    from
the             DT    the
1990            NN    1990
s               PRP   s
.               .     .
These           DT    These
laid            VBN   laid
the             DT    the
framework       NN    framework
for             IN    for
talks           NNS   talk
on              IN    on
establishing    VBG   establishing
a               DT    a
Palestinian     JJ    Palestinian
state           NN    state
alongside       IN    alongside
Israel          NNP   Israel
,               ,     ,
but             CC    but
left            VBN   left
the             DT    the
status          NN    status
of              IN    of
Jerusalem       NNP   Jerusalem
to              TO    to
be              VB    be
determined      VBN   determined
by              IN    by
further         JJ    further
negotiation     NN    negotiation
.               .     .
Israel          NNP   Israel
has             VBZ   has
annexed         VBD   annexed
East            NNP   East
Jerusalem       NNP   Jerusalem
and             CC    and
declares        VBZ   declares
it              PRP   it
part            NN    part
of              IN    of
its             PRP$  its
eternal         JJ    eternal
capital         NN    capital
.               .     .
Palestinians    NNPS  Palestinian
hope            NN    hope
to              TO    to
establish       VB    establish
their           PRP$  their
capital         NN    capital
in              IN    in
the             DT    the
area            NN    area
.               .     .

Let's now apply the extractor.

>>> sorted(extractor(text))
[('British Council', 1, 2),
('British consul-general', 1, 2),
('East', 4, 1),
('East Jerusalem', 4, 2),
('Israel', 4, 1),
('Israeli authorities', 1, 2),
('Israeli police', 1, 2),
('Jerusalem', 8, 1),
('Mr Makepeace', 1, 2),
('Palestinian', 6, 1),
('Palestinian Authority', 2, 2),
('Palestinian state', 1, 2),
('Palestinian theatre', 2, 2),
('Palestinians hope', 1, 2),
('Richard Makepeace', 1, 2),
('court order', 1, 2),
('event', 6, 1),
('literature festival', 1, 2),
('opening event', 1, 2),
('peace accords', 1, 2),
('police', 4, 1),
('police notice', 1, 2),
('security minister', 1, 2),
('theatre', 3, 1)]
CHANGES
1.1.0 (2009-06-29)
  • Improved the dictionary a little bit to improve real scenarios.
1.0.0 (2009-05-30)
  • Initial Release
  • Part-Of-Speech Text Tagging using existing lexicon ans very simplisitc

System Message: WARNING/2 (<string>, line 949)

Bullet list ends without a blank line; unexpected unindent.

linguistic rules.

  • Term Extraction based on occurances and term strength.

Subscribe to package updates

Last updated Jan 5th, 2011

Download Stats

Last month:4

What does the lock icon mean?

Builds marked with a lock icon are only available via PyPM to users with a current ActivePython Business Edition subscription.

Need custom builds or support?

ActivePython Enterprise Edition guarantees priority access to technical support, indemnification, expert consulting and quality-assured language builds.

Plan on re-distributing ActivePython?

Get re-distribution rights and eliminate legal risks with ActivePython OEM Edition.