At its core setix provides a "set intersection index", an inverted index data structure designed for storing sets
of symbols and fast querying of sets intersecting the given set, with sorting based on the number of intersections
or a similarity measure.
Additionally, a wrapper for indexing strings is provided in setix.trgm, which implements a trigram index compatible
with the PostgreSQL extension pg_trgm.
Examples
Using a set index:
System Message: ERROR/3 (<string>, line 17)
Unknown directive type "code-block".
.. code-block:: python
import setix
ix = setix.SetIntersectionIndex ()
ix.add ((1, 2, 3))
ix.add ((1, 2, 4))
ix.add ((2, 3, 4))
ix.find ((1, 2), 1).get_list()
# returns [(2, [(1, 2, 3)]),
# (2, [(1, 2, 4)]),
# (1, [(2, 3, 4)])]
# (the order of the first two results can change as they have equal scores)
Using a trigram index:
System Message: ERROR/3 (<string>, line 34)
Unknown directive type "code-block".
.. code-block:: python
import setix.trgm
ix = setix.trgm.TrigramIndex ()
ix.add ("strength")
ix.add ("strenght")
ix.add ("strength and honor")
ix.find ("stremgth", threshold=1).get_list()
# returns [(6, ["strength and honor"])
# (6, ["strength"]),
# (4, ["strenght"])]
ix.find_similar ("stremgth", threshold=0.1).get_list()
# returns [(0.5, ["strength"]), # 6 intersections / (9 total + 9 total - 6)
# (0.29, ["strenght"]), # 4 intersections / (9 total + 9 total - 4)
# (0.27, ["strength and honor"])] # 6 intersections / (9 total + 19 total - 6)
In general, to search for phrases containing a misspelt word, a threshold of -3*N can be given where N is the number
of misspellings.
System Message: ERROR/3 (<string>, line 56)
Unknown directive type "code-block".
.. code-block:: python
ix.find ("stremgth", threshold=-3).get_list()
# returns [(6, ["strength and honor"]),
# (6, ["strength"])]