doc: describe the merge policy (#227)

2024-03-22 12:21:35 +01:00 · 2024-03-22 12:21:35 +01:00 · def60143a2
parent e9363e71d8
commit def60143a2
1 changed files with 28 additions and 0 deletions
--- a/docs/explanation.md
+++ b/docs/explanation.md
@ -1 +1,29 @@
 # Explanation
+
+## Merge policy
+
+When adding documents to a tantivy index, the indexed data will be recorded in multiple 
+sections, called _segments_. There is more information about the [Life of a Segment](https://github.com/quickwit-oss/tantivy/wiki/Life-of-a-Segment)
+on the [tantivy wiki at Github](https://github.com/quickwit-oss/tantivy/wiki).
+
+Currently, tantivy-py does not offer a way to customize the merge policy, but fortunately
+the default merge policy is the [`LogMergePolicy`](https://docs.rs/tantivy/latest/tantivy/merge_policy/struct.LogMergePolicy.html) 
+which is a good choice for most use cases. It is aliased as the [default merge policy here](https://docs.rs/tantivy/latest/tantivy/merge_policy/type.DefaultMergePolicy.html).
+
+Segment merging is performed in background threads. After adding documents to an index,
+it is important to allow time for those threads to complete merges. This is done by calling
+`writer.wait_merging_threads()` as the final step after adding data. This method will
+consume the writer and the identifier will no longer be usable.
+
+Here is a short description of the steps in pseudocode:
+
+```
+schema = Schema(...)
+index = Index(schema)
+writer = index.writer()
+for ... in data:
+    document = Document(...)
+    writer.add_document(...)
+writer.commit()
+writer.wait_merging_threads()
+```