diff --git a/docs/about.md b/docs/about.md new file mode 100644 index 0000000..4b51d64 --- /dev/null +++ b/docs/about.md @@ -0,0 +1 @@ +# About diff --git a/docs/explanation.md b/docs/explanation.md new file mode 100644 index 0000000..b1303b4 --- /dev/null +++ b/docs/explanation.md @@ -0,0 +1 @@ +# Explanation diff --git a/docs/howto.md b/docs/howto.md new file mode 100644 index 0000000..c00256d --- /dev/null +++ b/docs/howto.md @@ -0,0 +1,46 @@ +# How-to Guides + +## Installation + +tantivy-py can be installed using from [pypi](pypi.org) using pip: + + pip install tantivy + +If no binary wheel is present for your operating system the bindings will be +build from source, this means that Rust needs to be installed before building +can succeed. + +Note that the bindings are using [PyO3](https://github.com/PyO3/pyo3), which +only supports python3. + +## Set up a development environment to work on tantivy-py itself + +Setting up a development environment can be done in a virtual environment using +[`nox`](https://nox.thea.codes) or using local packages using the provided `Makefile`. + +For the `nox` setup install the virtual environment and build the bindings using: + + python3 -m pip install nox + nox + +For the `Makefile` based setup run: + + make + +Running the tests is done using: + + make test + +## Working on tantivy-py documentation + +Please be aware that this documentation is structured using the [Diátaxis](https://diataxis.fr/) framework. In very simple terms, this framework will suggest the correct location for different kinds of documentation. Please make sure you gain a basic understanding of the goals of the framework before making large pull requests with new documentation. + +This documentation uses the [MkDocs](https://mkdocs.readthedocs.io/en/stable/) framework. This package is specified as an optional dependency in the `pyproject.toml` file. To install all optional dev dependencies into your virtual env, run the following command: + + pip install .[dev] + +The [MkDocs](https://mkdocs.readthedocs.io/en/stable/) documentation itself is comprehensive. MkDocs provides some additional context and help around [writing with markdown](https://mkdocs.readthedocs.io/en/stable/user-guide/writing-your-docs/#writing-with-markdown). + +If all you want to do is make a few edits right away, the documentation content is in the `/docs` directory and consists of [Markdown](https://www.markdownguide.org/) files, which can be edited with any text editor. + +The most efficient way to work is to run a MkDocs livereload server in the background. This will launch a local web server on your dev machine, serve the docs (by default at `http://localhost:8000`), and automatically reload the page after you save any changes to the documentation files. diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..9c01531 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,22 @@ +# Welcome to tantivy-py + +tantivy-py is a wrapper for the [tantivy](https://github.com/quickwit-oss/tantivy) full-text search engine, which is inspired by Apache Lucene. + +tantivy-py is [licensed](https://github.com/quickwit-oss/tantivy-py/blob/master/LICENSE) under the [MIT License](https://www.tldrlegal.com/license/mit-license). + +## Important links + +- [tantivy-py code repository](https://github.com/quickwit-oss/tantivy-py) +- [tantivy code repository](https://github.com/quickwit-oss/tantivy) +- [tantivy Documentation](https://docs.rs/crate/tantivy/latest) +- [tantivy query language](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html#method.parse_query) + +## How to use this documentation + +This documentation uses the [Diátaxis](https://diataxis.fr/) framework. The following sections are clearly separated: + +- [Tutorials](tutorials.md): when you want to learn +- [How-to Guides](howto.md): when need to accomplish a task +- [Explanation](howto.md): when you need a broader understanding and the thinking behind why certain things are set up in a particular way. +- [Reference](reference.md): when you need precise, detailed information + diff --git a/docs/reference.md b/docs/reference.md new file mode 100644 index 0000000..8ca4294 --- /dev/null +++ b/docs/reference.md @@ -0,0 +1,38 @@ +# Reference + +## Valid Query Formats + +tantivy-py supports the [query language](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html#method.parse_query) used in tantivy. +Below a few basic query formats are shown: + + - AND and OR conjunctions. +```python +query = index.parse_query('(Old AND Man) OR Stream', ["title", "body"]) +(best_score, best_doc_address) = searcher.search(query, 3).hits[0] +best_doc = searcher.doc(best_doc_address) +``` + + - +(includes) and -(excludes) operators. +```python +query = index.parse_query('+Old +Man chef -fished', ["title", "body"]) +(best_score, best_doc_address) = searcher.search(query, 3).hits[0] +best_doc = searcher.doc(best_doc_address) +``` +Note: in a query like above, a word with no +/- acts like an OR. + + - phrase search. +```python +query = index.parse_query('"eighty-four days"', ["title", "body"]) +(best_score, best_doc_address) = searcher.search(query, 3).hits[0] +best_doc = searcher.doc(best_doc_address) +``` + +- integer search +```python +query = index.parse_query('"eighty-four days"', ["doc_id"]) +(best_score, best_doc_address) = searcher.search(query, 3).hits[0] +best_doc = searcher.doc(best_doc_address) +``` +Note: for integer search, the integer field should be indexed. + +For more possible query formats and possible query options, see [Tantivy Query Parser Docs.](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html) diff --git a/docs/tutorials.md b/docs/tutorials.md new file mode 100644 index 0000000..1e07a89 --- /dev/null +++ b/docs/tutorials.md @@ -0,0 +1,82 @@ +# Tutorials + +## Building an index and populating it + +```python +import tantivy + +# Declaring our schema. +schema_builder = tantivy.SchemaBuilder() +schema_builder.add_text_field("title", stored=True) +schema_builder.add_text_field("body", stored=True) +schema_builder.add_integer_field("doc_id",stored=True) +schema = schema_builder.build() + +# Creating our index (in memory) +index = tantivy.Index(schema) +``` + +To have a persistent index, use the path +parameter to store the index on the disk, e.g: + +```python +index = tantivy.Index(schema, path=os.getcwd() + '/index') +``` + +By default, tantivy offers the following tokenizers +which can be used in tantivy-py: + - `default` +`default` is the tokenizer that will be used if you do not + assign a specific tokenizer to your text field. + It will chop your text on punctuation and whitespaces, + removes tokens that are longer than 40 chars, and lowercase your text. + +- `raw` + Does not actual tokenizer your text. It keeps it entirely unprocessed. + It can be useful to index uuids, or urls for instance. + +- `en_stem` + + In addition to what `default` does, the `en_stem` tokenizer also + apply stemming to your tokens. Stemming consists in trimming words to + remove their inflection. This tokenizer is slower than the default one, + but is recommended to improve recall. + +to use the above tokenizers, simply provide them as a parameter to `add_text_field`. e.g. +```python +schema_builder.add_text_field("body", stored=True, tokenizer_name='en_stem') +``` + +## Adding one document. + +```python +writer = index.writer() +writer.add_document(tantivy.Document( + doc_id=1, + title=["The Old Man and the Sea"], + body=["""He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a fish."""], +)) +# ... and committing +writer.commit() +``` + +## Building and Executing Queries + +First you need to get a searcher for the index + +```python +# Reload the index to ensure it points to the last commit. +index.reload() +searcher = index.searcher() +``` + +Then you need to get a valid query object by parsing your query on the index. + +```python +query = index.parse_query("fish days", ["title", "body"]) +(best_score, best_doc_address) = searcher.search(query, 3).hits[0] +best_doc = searcher.doc(best_doc_address) +assert best_doc["title"] == ["The Old Man and the Sea"] +print(best_doc) +``` + diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..2640f63 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,15 @@ +site_name: tantivy-py +# site_url: https://example.com +nav: + - Home: index.md + - Tutorials: tutorials.md + - How-to Guides: howto.md + - Explanation: explanation.md + - Reference: reference.md + - About: about.md +theme: readthedocs + +# Can nest documents under above sections +# - 'User Guide': +# - 'Writing your docs': 'writing-your-docs.md' +# - 'Styling your docs': 'styling-your-docs.md' diff --git a/pyproject.toml b/pyproject.toml index aebdf75..d8db0e5 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,5 +6,11 @@ build-backend = "maturin" name = "tantivy" requires-python = ">=3.7" +[project.optional-dependencies] +dev = [ + "nox", + "mkdocs", +] + [tool.maturin] bindings = "pyo3"