How to install marisa-trie
- Download and install ActivePython
- Open Command Prompt
- Type
pypm install marisa-trie
Lastest release
marisa-trie

Static memory-efficient Trie-like structures for Python (2.x and 3.x).
String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.
Based on marisa-trie C++ library.
Note
There are official SWIG-based Python bindings included in C++ library distribution; this package provides an alternative Cython-based pip-installable Python bindings.
Installation
pip install marisa-trie
Usage
There are several Trie classes in this package:
- marisa_trie.Trie - read-only trie-based data structure that maps unicode keys to auto-generated unique IDs;
- marisa_trie.RecordTrie - read-only trie-based data structure that maps unicode keys to lists of data tuples. All tuples must be of the same format (the data is packed using python struct module).
- marisa_trie.BytesTrie - read-only Trie that maps unicode keys to lists of bytes objects.
marisa_trie.Trie
Create a new trie:
>>> import marisa_trie >>> trie = marisa_trie.Trie([u'key1', u'key2', u'key12'])
Check if key is in trie:
>>> u'key1' in trie True >>> u'key20' in trie False
Each key is assigned an unique ID from 0 to (n - 1), where n is the number of keys; you can use this ID to store a value in a separate structure (e.g. python list):
>>> trie.key_id(u'key2') 1
Key can be reconstructed from the ID:
>>> trie.restore_key(1) u'key2'
Find all prefixes of a given key:
>>> trie.prefixes(u'key12') [u'key1', u'key12']
There is also a generator version of .prefixes method called .iter_prefixes.
Find all keys from this trie that starts with a given prefix:
>> trie.keys(u'key1') [u'key1', u'key12']
(iterator version .iterkeys(prefix) is also available).
marisa_trie.RecordTrie
Create a new trie:
>>> keys = [u'foo', u'bar', u'foobar', u'foo'] >>> values = [(1, 2), (2, 1), (3, 3), (2, 1)] >>> fmt = "<HH" # a tuple with 2 short integers >>> trie = marisa_trie.RecordTrie(fmt, zip(keys, values))
Trie initial data must be an iterable of tuples (unicode_key, data_tuple). Data tuples will be converted to bytes with struct.pack(fmt, *data_tuple).
Take a look at http://docs.python.org/library/struct.html#format-strings for the format string specification.
Duplicate keys are allowed.
Check if key is in trie:
>>> u'foo' in trie True >>> u'spam' in trie False
Get a values list:
>>> trie[u'bar'] [(2, 1)] >>> trie[u'foo'] [(1, 2), (2, 1)] >>> trie.get(u'bar', 123) [(2, 1)] >>> trie.get(u'BAAR', 123) # default value 123
Find all prefixes of a given key:
>>> trie.prefixes(u'foobarz') [u'foo', u'foobar']
Find all keys from this trie that starts with a given prefix:
>> trie.keys(u'fo') [u'foo', u'foo', u'foobar']
Find all items from this trie that starts with a given prefix:
>> trie.items(u'fo') [(u'foo', (1, 2)), (u'foo', (2, 1), (u'foobar', (3, 3))]
Note
Iterator version of .keys() and .items() are not implemented yet.
marisa_trie.BytesTrie
BytesTrie is similar to RecordTrie, but the values are raw bytes, not tuples:
>>> keys = [u'foo', u'bar', u'foobar', u'foo'] >>> values = [b'foo-value', b'bar-value', b'foobar-value', b'foo-value2'] >>> trie = marisa_trie.BytesTrie(zip(keys, values)) >>> trie[u'bar'] [b'bar-value']
Persistence
Trie objects supports saving/loading, pickling/unpickling and memory mapped I/O.
Write trie to a stream:
>>> with open('my_trie.marisa', 'w') as f: ... trie.write(f)
Save trie to a file:
>>> trie.save('my_trie_copy.marisa')
Read trie from stream:
>>> trie2 = marisa_trie.Trie() >>> with open('my_trie.marisa', 'r') as f: ... trie.read(f)
Load trie from file:
>>> trie2.load('my_trie.marisa')
Trie objects are picklable:
>>> import pickle >>> data = pickle.dumps(trie) >>> trie3 = pickle.loads(data)
You may also build a trie using marisa-build command-line utility (provided by underlying C++ library; it should be downloaded and compiled separately) and then load the trie from the resulting file using .load() method.
Memory mapped I/O
It is possible to use memory mapped file as data source:
>>> trie = marisa_trie.RecordTrie(fmt).mmap('my_record_trie.marisa')
This way the whole dictionary won't be loaded to memory; memory mapped I/O is an easy way to share dictionary data among processes.
Warning
Memory mapped trie might cause a lot of random disk accesses which considerably increase the search time.
Trie storage options
marisa-trie C++ library provides some configuration options for trie storage; check http://marisa-trie.googlecode.com/svn/trunk/docs/readme.en.html page (scroll down to "Enumeration Constants" section) to get an idea.
These options are exposed as order, num_tries, cache_size and binary keyword arguments for trie constructors.
For example, set order to marisa_trie.LABEL_ORDER in order to make trie functions return results in alphabetical oder:
>>> trie = marisa_trie.RecordTrie(fmt, data, order=marisa_trie.LABEL_ORDER)
Benchmarks
My quick tests show that memory usage is quite decent. For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):
- dict(unicode words -> word lenghts): about 600M
- list(unicode words) : about 300M
- BaseTrie from datrie library: about 70M
- marisa_trie.RecordTrie : 11M
- marisa_trie.Trie: 7M
Note
Lengths of words were stored as values in datrie.BaseTrie and marisa_trie.RecordTrie. RecordTrie compresses similar values and the key compression is better so it uses much less memory than datrie.BaseTrie.
marisa_trie.Trie provides auto-assigned IDs. It is not possible to store arbitrary values in marisa_trie.Trie so it uses less memory than RecordTrie.
Benchmark results (100k unicode words, integer values (lenghts of the words), Python 3.2, macbook air i5 1.8 Ghz):
dict building 2.919M words/sec Trie building 0.394M words/sec BytesTrie building 0.355M words/sec RecordTrie building 0.354M words/sec dict __getitem__ (hits) 8.239M ops/sec Trie __getitem__ (hits) not supported BytesTrie __getitem__ (hits) 0.498M ops/sec RecordTrie __getitem__ (hits) 0.404M ops/sec dict get() (hits) 4.410M ops/sec Trie get() (hits) not supported BytesTrie get() (hits) 0.458M ops/sec RecordTrie get() (hits) 0.364M ops/sec dict get() (misses) 4.869M ops/sec Trie get() (misses) not supported BytesTrie get() (misses) 0.849M ops/sec RecordTrie get() (misses) 0.816M ops/sec dict __contains__ (hits) 8.053M ops/sec Trie __contains__ (hits) 1.018M ops/sec BytesTrie __contains__ (hits) 0.605M ops/sec RecordTrie __contains__ (hits) 0.618M ops/sec dict __contains__ (misses) 6.489M ops/sec Trie __contains__ (misses) 2.047M ops/sec BytesTrie __contains__ (misses) 1.079M ops/sec RecordTrie __contains__ (misses) 1.123M ops/sec dict items() 57.248 ops/sec Trie items() not supported BytesTrie items() 11.691 ops/sec RecordTrie items() 8.369 ops/sec dict keys() 217.920 ops/sec Trie keys() 19.589 ops/sec BytesTrie keys() 14.849 ops/sec RecordTrie keys() 15.369 ops/sec Trie.prefixes (hits) 0.594M ops/sec Trie.prefixes (mixed) 1.874M ops/sec Trie.prefixes (misses) 1.447M ops/sec RecordTrie.prefixes (hits) 0.103M ops/sec RecordTrie.prefixes (mixed) 0.458M ops/sec RecordTrie.prefixes (misses) 0.164M ops/sec Trie.iter_prefixes (hits) 0.588M ops/sec Trie.iter_prefixes (mixed) 1.470M ops/sec Trie.iter_prefixes (misses) 1.170M ops/sec Trie.keys(prefix="xxx"), avg_len(res)==415 5.044K ops/sec Trie.keys(prefix="xxxxx"), avg_len(res)==17 89.363K ops/sec Trie.keys(prefix="xxxxxxxx"), avg_len(res)==3 258.732K ops/sec Trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4 293.199K ops/sec Trie.keys(prefix="xxx"), NON_EXISTING 1169.524K ops/sec RecordTrie.keys(prefix="xxx"), avg_len(res)==415 3.836K ops/sec RecordTrie.keys(prefix="xxxxx"), avg_len(res)==17 73.591K ops/sec RecordTrie.keys(prefix="xxxxxxxx"), avg_len(res)==3 229.515K ops/sec RecordTrie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4 269.228K ops/sec RecordTrie.keys(prefix="xxx"), NON_EXISTING 1071.433K ops/sec
Tries from marisa_trie are static and uses less memory, tries from datrie are faster and can be updated.
You may also give DAWG a try - it is usually faster than marisa-trie and sometimes can use less memory (depending on data).
Please take this benchmark results with a grain of salt; this is a very simple benchmark on a single data set.
Current limitations
- The library is not tested with mingw32 compiler;
- .prefixes() method of BytesTrie and RecordTrie is quite slow and doesn't have iterator counterpart;
- read() and write() methods don't work with file-like objects (they work only with real files; pickling works fine for file-like objects);
- there are keys() and items() methods but no values() method.
Contributions are welcome!
Contributing
Development happens at github and bitbucket:
The main issue tracker is at github: https://github.com/kmike/marisa-trie/issues
Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.
If you found a bug in a C++ part please report it to the original bug tracker.
How is source code organized
There are 4 folders in repository:
- bench - benchmarks & benchmark data;
- lib - original unmodified marisa-trie C++ library which is bundled for easier distribution; if something is have to be fixed in this library consider fixing it in the original repo ;
- src - wrapper code; src/marisa_trie.pyx is a wrapper implementation; src/*.pxd files are Cython headers for corresponding C++ headers; src/*.cpp files are the pre-built extension code and shouldn't be modified directly (they should be updated via update_cpp.sh script).
- tests - the test suite.
Running tests and benchmarks
Make sure tox is installed and run
$ tox
from the source checkout. Tests should pass under python 2.6, 2.7, 3.2 and 3.3.
In order to run benchmarks, type
$ tox -c bench.ini
Authors & Contributors
- Mikhail Korobov <kmike84@gmail.com>
This module is based on marisa-trie C++ library by Susumu Yata & contributors.
License
Wrapper code is licensed under MIT License. Bundled marisa-trie C++ library is dual-licensed under LGPL and BSD 2-clause license.
CHANGES
0.5 (2013-05-07)
- BytesTrie.iterkeys, BytesTrie.iteritems, RecordTrie.iterkeys and RecordTrie.iteritems methods;
- wrapper is rebuilt with Cython 0.19;
- value_separator parameter for BytesTrie and RecordTrie.
0.4 (2013-02-28)
- improved trie building: weights optional parameter;
- improved trie building: unnecessary input sorting is removed;
- wrapper is rebuilt with Cython 0.18;
- bundled marisa-trie C++ library is updated to svn r133.
0.3.8 (2013-01-03)
- Rebuild wrapper with Cython pre-0.18;
- update benchmarks.
0.3.7 (2012-09-21)
- Update bundled marisa-trie C++ library (this may fix more mingw issues);
- Python 3.3 support is back.
0.3.6 (2012-09-05)
- much faster (3x-7x) .items() and .keys() methods for all tries; faster (up to 3x) .prefixes() method for Trie.
0.3.5 (2012-08-30)
- Pickling of RecordTrie is fixed (thanks lazarou for the report);
- error messages should become more useful.
0.3.4 (2012-08-29)
- Issues with mingw32 should be resolved (thanks Susumu Yata).
0.3.3 (2012-08-27)
- .get(key, default=None) method for BytesTrie and RecordTrie;
- small README improvements.
0.3.2 (2012-08-26)
- Small code cleanup;
- load, read and mmap methods returns 'self';
- I can't run tests (via tox) under Python 3.3 so it is removed from supported versions for now.
0.3.1 (2012-08-23)
- .prefixes() support for RecordTrie and BytesTrie.
0.3 (2012-08-23)
- RecordTrie and BytesTrie are introduced;
- IntTrie class is removed (probably temporary?);
- dumps/loads methods are renamed to tobytes/frombytes;
- benchmark & tests improvements;
- support for MARISA-trie config options is added.
0.2 (2012-08-19)
- Pickling/unpickling support;
- dumps/loads methods;
- python 3.3 workaround;
- improved tests;
- benchmarks.
0.1 (2012-08-17)
Initial release.