Welcome, guest | Sign In | My Account | Store | Cart

Notice! PyPM is being replaced with the ActiveState Platform, which enhances PyPM’s build and deploy capabilities. Create your free Platform account to download ActivePython or customize Python with the packages you require and get automatic updates.

Download
ActivePython
INSTALL>
pypm install oice.langdet

How to install oice.langdet

  1. Download and install ActivePython
  2. Open Command Prompt
  3. Type pypm install oice.langdet
 Python 2.7Python 3.2Python 3.3
Windows (32-bit)
1.0dev-r781 Available View build log
Windows (64-bit)
1.0dev-r781 Available View build log
Mac OS X (10.5+)
1.0dev-r781 Available View build log
Linux (32-bit)
1.0dev-r781 Available View build log
Linux (64-bit)
1.0dev-r781 Available View build log
 
License
GPL 3.0
Dependencies
Lastest release
version 1.0dev-r781 on Jan 5th, 2011
Language Detector
-----------------

This is a simple (yet powerful) automatic language detector. Currently
the only languages we are capable to detect are:

* English
* Spanish
* French

Installation and Usage
----------------------

To install just run the easy_install_ tool::

easy_install oice.langdet

.. _easy_install: http://peak.telecommunity.com/DevCenter/EasyInstall

This will install a console script ``langdet``. Run ``langdet``
passing a plain text filename as the first parameter. Examples::

langdet simple.txt

This will return the 2-letters `ISO 639-1`_ code of
the detected language.

.. _ISO 639-1: http://en.wikipedia.org/wiki/ISO_639

You may also use ``oice.langdet`` in Python scripts like this::

#!/usr/bin/env python2.5
from StringIO import StringIO

from oice.langdet import langdet
from oice.langdet import streams
from oice.langdet import languages

text = streams.Stream(StringIO(u"Must be a Python Unicode text"))
lang = langdet.LanguageDetector.detect(text)
if lang == languages.spanish:
print u'Texto en español'
elif lang == languages.english:
print u'English text'
else:
print u'France' # I don't speak/write French

Caveats
~~~~~~~

Currently there are some restrictions:

* ``langdet`` does not work properly with standard input nor
pipelines.

* You cannot use a file-like object directly with
``LanguageDetector``, i.e, you must use the ``Stream`` wrapper.

This is so because we try to guess the text encoding and normalize
it to a Python Unicode String. However, we plan to remove this
normalization step and count the frequency of octets and pairs of
octets instead.

* If the piece of text is not written in any of the languages we can
detect, the best match (see `How it works`_) is selected.

Work in progress
~~~~~~~~~~~~~~~~

In a sentence: trying to solve the first two caveats, and thinking in
Python 2.6 and Python 3.0.


How it works
------------

Language detection is based on stats on the frequency of letters and
pairs of letters of the input text.

The modules in the package ``oice.language.languages`` contains a
"footprint" of text in those languages.

The texts used in the generation of the footprints were:

* El ingenioso hidalgo Don Quijote de la Mancha

* The Holly Bible

* La Folle Journée, ou Le Mariage de Figaro

When trying to detect the language of some piece of text, first we
count the frequencies of letters and pairs of letters in the text and
then compare the results with the footprints of those language, the
best match is selected.

We use the simple `cosine similarity`__ equation to compare the text
with the footprints of those texts.

__ `Cosine Similarity Wikipedia`_

.. _Cosine Similarity Wikipedia: http://en.wikipedia.org/wiki/Cosine_similarity


Accuracy of the detection
-------------------------

To test the accuracy of this implementation we downloaded the full
`European Parliament Proceedings Parallel Corpus 1996-2006`__ and ran
the `langdet` script to the sets of English, Spanish and French
documents.

__ http://www.statmt.org/europarl/

For each language we count the times the correct `ISO 639-1`_ code was
returned by `langdet` like this (for counting documents detected as
Spanish written)::

find -type f -exec langdet {} \; | grep es | wc -l

The results are summarized in the following table:

.. table:: Summary of accuracy test for ``langdet``

=============	=======	=======	====== ===========
Real language	English	Spanish	French Errors [1]_
=============	=======	=======	====== ===========
English	98.78%	0%	0%     1.22%
Spanish	0%	100%	0%     0%
French		0%	0%	100%   0%
Danish		1.22%	16.08%	82.7%  0%
German		1.97%	0.15%	97.88% 0%
Finnish	0.65%	5.9%	93.45% 0%
Italian	0%	99.54%	0.46%  0%
=============	======= ======= ====== ===========

.. [1] Errors are generally produced when the detector cannot guess
the encoding of the input text.

In `Caveats`_ we propose a solution for this, however, it is
not clear the impact in the accuracy of detection.

The results shows that for documents in the languages that ``langdet``
can detect, ``langdet`` behaves almost perfect.

However, the results for documents in other languages show how
misleading ``langdet`` could be in such cases. We ran those test for
illustration purposes only.

Nevertheless this results also shows that it would be very difficult
for this simple algorithm to distinguish Spanish from Italian, and
French from German.

Changelog
=========

1.0 - Unreleased
----------------

* Initial release

Subscribe to package updates

Last updated Jan 5th, 2011

Download Stats

Last month:1

What does the lock icon mean?

Builds marked with a lock icon are only available via PyPM to users with a current ActivePython Business Edition subscription.

Need custom builds or support?

ActivePython Enterprise Edition guarantees priority access to technical support, indemnification, expert consulting and quality-assured language builds.

Plan on re-distributing ActivePython?

Get re-distribution rights and eliminate legal risks with ActivePython OEM Edition.