oice.langdet | Python Package Manager Index (PyPM)

INSTALL>

pypm install oice.langdet

[+]

How to install oice.langdet

Download and install ActivePython
Open Command Prompt
Type pypm install oice.langdet

Python 2.7

Python 3.2

Python 3.3

Windows (32-bit)

The build is available for this platform; click to see other versions

1.0dev-r781

Available

View build log

Windows (64-bit)

1.0dev-r781

Available

View build log

Mac OS X (10.5+)

1.0dev-r781

Available

View build log

Linux (32-bit)

1.0dev-r781

Available

View build log

Linux (64-bit)

1.0dev-r781

Available

View build log

Scientific/Engineering

Author

Universidad de las Ciencias Informáticas

License

GPL 3.0

Dependencies

Imports

oice.langdet
oice.langdet.languages

Lastest release

version 1.0dev-r781 on Jan 5th, 2011

Language Detector
-----------------

This is a simple (yet powerful) automatic language detector. Currently
the only languages we are capable to detect are:

* English
* Spanish
* French

Installation and Usage
----------------------

To install just run the easy_install_ tool::

easy_install oice.langdet

.. _easy_install: http://peak.telecommunity.com/DevCenter/EasyInstall

This will install a console script ``langdet``. Run ``langdet``
passing a plain text filename as the first parameter. Examples::

langdet simple.txt

This will return the 2-letters `ISO 639-1`_ code of
the detected language.

.. _ISO 639-1: http://en.wikipedia.org/wiki/ISO_639

You may also use ``oice.langdet`` in Python scripts like this::

#!/usr/bin/env python2.5
from StringIO import StringIO

from oice.langdet import langdet
from oice.langdet import streams
from oice.langdet import languages

text = streams.Stream(StringIO(u"Must be a Python Unicode text"))
lang = langdet.LanguageDetector.detect(text)
if lang == languages.spanish:
print u'Texto en español'
elif lang == languages.english:
print u'English text'
else:
print u'France' # I don't speak/write French

Caveats
~~~~~~~

Currently there are some restrictions:

* ``langdet`` does not work properly with standard input nor
pipelines.

* You cannot use a file-like object directly with
``LanguageDetector``, i.e, you must use the ``Stream`` wrapper.

This is so because we try to guess the text encoding and normalize
it to a Python Unicode String. However, we plan to remove this
normalization step and count the frequency of octets and pairs of
octets instead.

* If the piece of text is not written in any of the languages we can
detect, the best match (see `How it works`_) is selected.

Work in progress
~~~~~~~~~~~~~~~~

In a sentence: trying to solve the first two caveats, and thinking in
Python 2.6 and Python 3.0.


How it works
------------

Language detection is based on stats on the frequency of letters and
pairs of letters of the input text.

The modules in the package ``oice.language.languages`` contains a
"footprint" of text in those languages.

The texts used in the generation of the footprints were:

* El ingenioso hidalgo Don Quijote de la Mancha

* The Holly Bible

* La Folle Journée, ou Le Mariage de Figaro

When trying to detect the language of some piece of text, first we
count the frequencies of letters and pairs of letters in the text and
then compare the results with the footprints of those language, the
best match is selected.

We use the simple `cosine similarity`__ equation to compare the text
with the footprints of those texts.

__ `Cosine Similarity Wikipedia`_

.. _Cosine Similarity Wikipedia: http://en.wikipedia.org/wiki/Cosine_similarity


Accuracy of the detection
-------------------------

To test the accuracy of this implementation we downloaded the full
`European Parliament Proceedings Parallel Corpus 1996-2006`__ and ran
the `langdet` script to the sets of English, Spanish and French
documents.

__ http://www.statmt.org/europarl/

For each language we count the times the correct `ISO 639-1`_ code was
returned by `langdet` like this (for counting documents detected as
Spanish written)::

find -type f -exec langdet {} \; | grep es | wc -l

The results are summarized in the following table:

.. table:: Summary of accuracy test for ``langdet``

=============	=======	=======	====== ===========
Real language	English	Spanish	French Errors [1]_
=============	=======	=======	====== ===========
English	98.78%	0%	0%     1.22%
Spanish	0%	100%	0%     0%
French		0%	0%	100%   0%
Danish		1.22%	16.08%	82.7%  0%
German		1.97%	0.15%	97.88% 0%
Finnish	0.65%	5.9%	93.45% 0%
Italian	0%	99.54%	0.46%  0%
=============	======= ======= ====== ===========

.. [1] Errors are generally produced when the detector cannot guess
the encoding of the input text.

In `Caveats`_ we propose a solution for this, however, it is
not clear the impact in the accuracy of detection.

The results shows that for documents in the languages that ``langdet``
can detect, ``langdet`` behaves almost perfect.

However, the results for documents in other languages show how
misleading ``langdet`` could be in such cases. We ran those test for
illustration purposes only.

Nevertheless this results also shows that it would be very difficult
for this simple algorithm to distinguish Spanish from Italian, and
French from German.

Changelog
=========

1.0 - Unreleased
----------------

* Initial release

PyPM Index

oice.langdet 1.0dev-r781 (experimental)

Automatic Language Detector

How to install oice.langdet

Links

Author

License

Dependencies

Imports

Lastest release

Subscribe to package updates

Download Stats

What does the lock icon mean?

Need custom builds or support?

Plan on re-distributing ActivePython?

Accounts

PyPM

Feedback & Information

ActiveState

PyPM Index

oice.langdet 1.0dev-r781 (experimental) Automatic Language Detector

How to install oice.langdet

Links

Author

License

Dependencies

Imports

Lastest release

Subscribe to package updates

Download Stats

What does the lock icon mean?

Need custom builds or support?

Plan on re-distributing ActivePython?

Accounts

PyPM

Feedback & Information

ActiveState

oice.langdet 1.0dev-r781 (experimental)

Automatic Language Detector