diff --git a/README.md b/README.md index fbb38a9..8a1ae8b 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,6 @@ tantivy-py Python bindings for [Tantivy](https://github.com/quickwit-oss/tantivy) the full-text search engine library written in Rust. - # Installation The bindings can be installed using from pypi using pip: @@ -18,170 +17,6 @@ If no binary wheel is present for your operating system the bindings will be build from source, this means that Rust needs to be installed before building can succeed. -Note that the bindings are using [PyO3](https://github.com/PyO3/pyo3), which -only supports python3. +# Documentation -# Development - -For compiling Python module: - -```bash -# create virtual env -python -m venv .venv -source .venv/bin/activate - -# install maturin, the build tool for PyO3 -pip install maturin - -# compile and install python module in venv -maturin develop -``` - - -Setting up a development environment can be done in a virtual environment using -[`nox`](https://nox.thea.codes) or using local packages using the provided `Makefile`. - -For the `nox` setup install the virtual environment and build the bindings using: - - python3 -m pip install nox - nox - -For the `Makefile` based setup run: - - make - -Running the tests is done using: - - make test - -# Usage - -The Python bindings have a similar API to Tantivy. To create a index first a schema -needs to be built. After that documents can be added to the index and a reader -can be created to search the index. - -## Building an index and populating it - -```python -import tantivy - -# Declaring our schema. -schema_builder = tantivy.SchemaBuilder() -schema_builder.add_text_field("title", stored=True) -schema_builder.add_text_field("body", stored=True) -schema_builder.add_integer_field("doc_id", stored=True, indexed=True) -schema = schema_builder.build() - -# Creating our index (in memory) -index = tantivy.Index(schema) -``` - -To have a persistent index, use the path -parameter to store the index on the disk, e.g: - -```python -import os - -index_path = os.path.abspath("index") -os.makedirs(index_path) -index = tantivy.Index(schema, path=index_path) -``` - -By default, tantivy offers the following tokenizers -which can be used in tantivy-py: - - `default` -`default` is the tokenizer that will be used if you do not - assign a specific tokenizer to your text field. - It will chop your text on punctuation and whitespaces, - removes tokens that are longer than 40 chars, and lowercase your text. - -- `raw` - Does not actual tokenizer your text. It keeps it entirely unprocessed. - It can be useful to index uuids, or urls for instance. - -- `en_stem` - - In addition to what `default` does, the `en_stem` tokenizer also - apply stemming to your tokens. Stemming consists in trimming words to - remove their inflection. This tokenizer is slower than the default one, - but is recommended to improve recall. - -to use the above tokenizers, simply provide them as a parameter to `add_text_field`. e.g. -```python -schema_builder.add_text_field("body", stored=True, tokenizer_name='en_stem') -``` - -### Adding one document. - -```python -writer = index.writer() -writer.add_document(tantivy.Document( - doc_id=1, - title=["The Old Man and the Sea"], - body=["""He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a fish."""], -)) -# ... and committing -writer.commit() -``` - - -## Building and Executing Queries - -First you need to get a searcher for the index - -```python -# Reload the index to ensure it points to the last commit. -index.reload() -searcher = index.searcher() -``` - -Then you need to get a valid query object by parsing your query on the index. - -```python -query = index.parse_query("fish days", ["title", "body"]) -(best_score, best_doc_address) = searcher.search(query, 3).hits[0] -best_doc = searcher.doc(best_doc_address) -assert best_doc["title"] == ["The Old Man and the Sea"] -print(best_doc) -``` - -### Valid Query Formats - -tantivy-py supports the query language used in tantivy. -Some basic query Formats. - - - - AND and OR conjunctions. -```python -query = index.parse_query('(Old AND Man) OR Stream', ["title", "body"]) -(best_score, best_doc_address) = searcher.search(query, 3).hits[0] -best_doc = searcher.doc(best_doc_address) -print(best_doc) -``` - - - +(includes) and -(excludes) operators. -```python -query = index.parse_query('+Old +Man chef -fished', ["title", "body"]) -hits = searcher.search(query, 3).hits -print(len(hits)) -``` -Note: in a query like above, a word with no +/- acts like an OR. - - - phrase search. -```python -query = index.parse_query('"eighty-four days"', ["title", "body"]) -(best_score, best_doc_address) = searcher.search(query, 3).hits[0] -best_doc = searcher.doc(best_doc_address) -print(best_doc) -``` - -- integer search -```python -query = index.parse_query("1", ["doc_id"]) -(best_score, best_doc_address) = searcher.search(query, 3).hits[0] -best_doc = searcher.doc(best_doc_address) -print(best_doc) -``` -Note: for integer search, the integer field should be indexed. - -For more possible query formats and possible query options, see [Tantivy Query Parser Docs.](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html) +Please see [the documentation](https://tantivy-py.readthedocs.io/en/latest/) for more information.