# https://github.com/xhluca/bm25s Project Manual

Generated at: 2026-06-28 11:33:48 UTC

## Table of Contents

- [Overview, Installation, and Corpus Format](#page-1)
- [Core BM25 API: Scoring, Variants, and Retrieval](#page-2)
- [Tokenization, Indexing, and Persistence](#page-3)
- [Backends, Integrations, and CLI/MCP](#page-4)

<a id='page-1'></a>

## Overview, Installation, and Corpus Format

### Related Pages

Related topics: [Core BM25 API: Scoring, Variants, and Retrieval](#page-2)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/xhluca/bm25s/blob/main/README.md)
- [setup.py](https://github.com/xhluca/bm25s/blob/main/setup.py)
- [bm25s/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py)
- [bm25s/high_level/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/__init__.py)
- [bm25s/high_level/README.md](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/README.md)
- [bm25s/terminal/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/terminal/__init__.py)
- [bm25s/hf.py](https://github.com/xhluca/bm25s/blob/main/bm25s/hf.py)
- [examples/save_and_reload_end_to_end.py](https://github.com/xhluca/bm25s/blob/main/examples/save_and_reload_end_to_end.py)
- [examples/index_nq.py](https://github.com/xhluca/bm25s/blob/main/examples/index_nq.py)
</details>

# Overview, Installation, and Corpus Format

## What is bm25s?

`bm25s` is a Python library that implements the BM25 ranking function for text retrieval. BM25 is the same algorithm used by search engines such as Elasticsearch to score documents against a query. The library positions itself as a fast, dependency-light alternative to other Python BM25 implementations such as `rank-bm25`. According to the project description, `bm25s` is implemented in pure Python and uses SciPy sparse matrices to store eagerly computed scores, which the authors claim yields "orders of magnitude" speedup at query time. Source: [README.md]().

The library has no required dependencies beyond NumPy, and adds SciPy, stemming (`PyStemmer`), and `numba` only as optional extras. The README explicitly states that there are "no dependencies on Java or PyTorch". Source: [README.md]().

The high-level companion package, published on PyPI as `BM25`, wraps `bm25s` and exposes a 1-line API. The `bm25s/high_level` module is described as a "simple to use search interface, enabling 1-line indexing and 1-line searching", and by default enables numba compilation, stemming, and stopword removal. Source: [bm25s/high_level/README.md]().

### Intended Audience

The library targets Python developers who need lexical retrieval as part of a larger pipeline (RAG, search services, evaluation harnesses) and want a pure-Python implementation that is easy to install. The repository also ships an MCP server so an existing BM25 index can be exposed as a tool to LLM agents. Source: [README.md]().

## Installation

The minimal install requires only NumPy, as declared in `setup.py`:

```python
install_requires=['numpy'],
python_requires=">=3.8",
```

Source: [setup.py](). The `bm25` console entry point is registered in the same file, which is what makes the CLI work after installation. Source: [setup.py]().

### Install Variants

The README documents several install profiles. The basic form is `pip install "bm25s=={version}"`, while richer installs use bracketed extras such as `bm25s[core]`, `bm25s[full]`, and `bm25s[cli]`. The `core` extra is required for examples that load BEIR-format datasets, such as `examples/index_nq.py`. Source: [README.md](), [examples/index_nq.py]().

For users who want the simplest on-ramp, the project recommends the higher-level `BM25` PyPI package, which pulls in `bm25s` plus `PyStemmer` and `rich` automatically. Source: [bm25s/high_level/README.md]().

For MCP server use, the README documents a dedicated extra:

```bash
uv pip install "bm25s[mcp]"
# or locally
uv pip install -e ".[mcp]"
```

Source: [README.md]().

### CLI Installation

The CLI is exposed as the `bm25` command. The terminal module's docstring shows the intended usage pattern: `bm25 index documents.csv -o my_index` and `bm25 search -i my_index "what is machine learning?"`. Source: [bm25s/terminal/__init__.py]().

## Corpus Format

The corpus is the collection of documents that `bm25s` will index. The library accepts several input shapes, which is a frequent source of confusion. Issue #158 ("Format of corpus and API reference") specifically calls out an inconsistency between the documentation and the quickstart regarding whether a corpus should be a list of dictionaries or a list of strings. Source: [bm25s/__init__.py](), [README.md]().

### Accepted Shapes

The constructor-level validation in `bm25s/__init__.py` enumerates the supported corpus types: a list of lists of tokens, an object exposing `ids` and `vocab` attributes, or a tuple of two lists (unique token IDs and a per-document list of token IDs). When the input is a generic iterable, the constructor inspects the inner type: if every element is an `int`, the corpus is interpreted as pre-tokenized IDs; otherwise it is treated as a list of token lists. Source: [bm25s/__init__.py]().

In practice, the most common shapes are:

| Shape | Meaning |
|---|---|
| `List[str]` | One document per string. Strings can later be tokenized with `bm25s.tokenize` or a custom `Tokenizer`. |
| `List[Dict]` or `List[Tuple]` | Structured documents preserved through retrieval; useful for keeping `id`, `title`, and `text` together. |
| `List[List[str]]` | Pre-tokenized documents; the tokenizer is skipped. |
| Tuple `(unique_ids, per_doc_ids)` | Compact pre-tokenized form using integer IDs. |

Source: [bm25s/__init__.py](), [examples/save_and_reload_end_to_end.py]().

### File-Based Loading

For users starting from a file, `bm25s.high_level` provides a `load` helper. The implementation branches on the file extension and supports `.txt`, `.json`, `.jsonl`, and `.csv`. For structured formats the caller can pass a `document_column` argument to select which key (JSON) or column (CSV) holds the document text; when omitted, the first key/column is used. Source: [bm25s/high_level/__init__.py]().

The high-level `index` function pairs with `load` to produce a ready-to-search `BM25Search` object:

```python
import bm25s.high_level as bm25
corpus = bm25.load("tests/data/dummy.csv", document_column="text")
retriever = bm25.index(corpus)
results = retriever.search(["your query here"], k=5)
```

Source: [README.md]().

### How the Corpus Is Saved

When a structured corpus is persisted to disk, the save path in `bm25s/__init__.py` normalizes each entry: strings become `{"id": i, "text": doc}` dictionaries; dictionaries, lists, and tuples are kept as-is; other types trigger a warning and are skipped. The serialized form is JSONL, and a companion memory-mapped index (`corpus.mmindex`) is written to enable fast random access on reload. Source: [bm25s/__init__.py]().

```mermaid
flowchart LR
  A[Raw files<br/>csv / json / jsonl / txt] --> B[bm25s.high_level.load]
  C[List[str]] --> D{bm25s.BM25 constructor}
  E[List[Dict]] --> D
  F[List[List[str]]] --> D
  G[(unique_ids, per_doc_ids)] --> D
  B --> D
  D --> E2[Retriever.index<br/>+ Tokenizer]
  E2 --> F2[Sparse CSC score matrix]
  F2 --> G2[retriever.retrieve / retriever.search]
```

### Loading a Saved Index

The end-to-end example demonstrates the reload path. After `retriever.save(...)` and `tokenizer.save_vocab(...)`, the index is restored with `bm25s.BM25.load(save_dir, load_corpus=True)`, after which the original strings (or dictionaries) come back through `retriever.retrieve(..., corpus=...)`. The example also shows the need to instantiate a new `Tokenizer` and call `load_vocab` before re-tokenizing queries. Source: [examples/save_and_reload_end_to_end.py]().

## Common Pitfalls

- **Tokenization is your responsibility for the low-level API.** The `BM25` constructor expects an already-tokenized corpus in most modes. Use `bm25s.tokenize` or a `Tokenizer` instance first, or use the high-level `BM25.index` helper which handles tokenization for you. Source: [bm25s/__init__.py](), [bm25s/high_level/__init__.py]().
- **The Windows `resource` warning.** As reported in issues #178 and #186, importing `bm25s` on Windows used to log `resource module not available on Windows` to stdout, which broke MCP health checks. This was downgraded from `warning` to `debug` level in release 0.3.9. Source: [README.md]() (release notes section).
- **Hugging Face Hub usage.** For sharing or loading pre-built indexes, the `BM25HF` wrapper in `bm25s.hf` provides `load_from_hub` and `save` methods; consult the module-level docstring for the canonical pair of API calls. Source: [bm25s/hf.py]().

## See Also

- Tokenization, scoring variants (Robertson, ATIRE, BM25L, BM25+, Lucene), and the eager scoring model — covered in the README's "Variants" section. Source: [README.md]().
- The CLI surface (`bm25 index`, `bm25 search`, interactive index picker) — see [bm25s/terminal/__init__.py]().
- High-level wrapper for 1-line indexing — see [bm25s/high_level/README.md]() and [bm25s/high_level/__init__.py]().
- End-to-end save/reload pattern — see [examples/save_and_reload_end_to_end.py]().
- Indexing large corpora (e.g. Natural Questions, 2M documents) — see [examples/index_nq.py]().

---

<a id='page-2'></a>

## Core BM25 API: Scoring, Variants, and Retrieval

### Related Pages

Related topics: [Tokenization, Indexing, and Persistence](#page-3), [Backends, Integrations, and CLI/MCP](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [bm25s/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py)
- [README.md](https://github.com/xhluca/bm25s/blob/main/README.md)
- [bm25s/high_level/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/__init__.py)
- [bm25s/high_level/README.md](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/README.md)
- [bm25s/terminal/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/terminal/__init__.py)
- [bm25s/hf.py](https://github.com/xhluca/bm25s/blob/main/bm25s/hf.py)
- [examples/index_with_metadata.py](https://github.com/xhluca/bm25s/blob/main/examples/index_with_metadata.py)
- [examples/save_and_reload_end_to_end.py](https://github.com/xhluca/bm25s/blob/main/examples/save_and_reload_end_to_end.py)
- [setup.py](https://github.com/xhluca/bm25s/blob/main/setup.py)
</details>

# Core BM25 API: Scoring, Variants, and Retrieval

## Overview and Scope

The `bm25s` library exposes its core retrieval capability through the `bm25s.BM25` class, located in [bm25s/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py). Its design goal, as stated in the [README.md](https://github.com/xhluca/bm25s/blob/main/README.md), is to provide an "ultrafast implementation of BM25 in pure Python, powered by Numpy" that pre-computes per-token scores into a sparse matrix, allowing query-time scoring to be reduced to sparse matrix operations. This eager scoring strategy is the central performance claim of the library, distinguishing it from lazy implementations such as `rank-bm25`.

The API is intentionally small: a constructor that captures BM25 hyperparameters, an `index()` method that consumes pre-tokenized input, and a `retrieve()` method that returns ranked results. A separate high-level facade, `BM25Search` in [bm25s/high_level/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/__init__.py), wraps this class with sensible defaults for one-line indexing.

## BM25 Scoring Variants

`bm25s` supports five scoring variants selectable via the `method` argument. The README documents these variants citing [Kamphuis et al. 2020](https://link.springer.com/chapter/10.1007/978-3-030-45442-5_4): Robertson (the original, with `idf>=0` enforced), ATIRE, BM25L, BM25+, and Lucene. The default is `method="lucene"`, matching Lucene's BM25 implementation exactly.

| Variant | `method` value | Extra parameter | Notes |
|---|---|---|---|
| Original Robertson | `"robertson"` | none | IDF clamped to non-negative; `k1`, `b` configurable |
| ATIRE | `"atire"` | none | Variant with modified IDF behavior |
| BM25L | `"bm25l"` | `delta` (default 0.5) | Designed for better handling of long documents |
| BM25+ | `"bm25+"` | `delta` (default 0.5) | Adds a lower-bound term to avoid saturation |
| Lucene | `"lucene"` | none | Default; matches Elasticsearch scoring |

The IDF scorer can be chosen independently of the term-frequency scorer. Source: [README.md:74-92](https://github.com/xhluca/bm25s/blob/main/README.md). The constructor also exposes `k1`, `b`, `delta`, `idf_method`, `backend` (`"numpy"` or `"numba"`), and `csc_backend` (`"scipy"` or `"numpy"`) parameters; see [bm25s/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py) for the full signature.

The scoring pipeline begins with `_select_idf_scorer(self.idf_method)` to compute IDF values, followed by `_build_scores_and_indices_for_matrix` which iterates over each document-token pair and computes the per-token BM25 contribution. Source: [bm25s/__init__.py:243-275](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py).

## Indexing and Sparse Matrix Construction

The `index()` method accepts three corpus representations, dispatched via `_infer_corpus_object`: raw token lists, a `(unique_token_ids, vocab_dict)` tuple, or a `Tokenized` object exposing `ids` and `vocab` attributes. When `corpus_tokens` is provided without an explicit vocabulary, the vocabulary dictionary is built automatically from the tokens. Source: [bm25s/__init__.py:298-355](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py).

Eager score computation proceeds in two steps:

1. Per-token BM25 contributions are flattened into `(scores_flat, doc_idx, vocab_idx)` triples.
2. These triples are assembled into a CSC (Compressed Sparse Column) sparse matrix of shape `(n_docs, n_vocab)`, where each column holds the BM25 score of one vocabulary term across all documents.

```mermaid
flowchart LR
    A[corpus_token_ids] --> B[_build_scores_and_indices_for_matrix]
    B --> C[scores_flat, doc_idx, vocab_idx]
    C --> D{backend}
    D -->|scipy| E[sp.csc_matrix]
    D -->|numpy| F[np_csc assembly]
    E --> G[scores dict: data, indices, indptr, num_docs, vocab]
    F --> G
    G --> H[retrieve]
```

The `csc_backend` parameter controls whether SciPy or a custom NumPy CSC builder is used; both produce the same `{"data", "indices", "indptr", "num_docs", "vocab"}` structure consumed by `retrieve()`. Source: [bm25s/__init__.py:282-296](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py).

## Retrieval API and Corpus Format

The `retrieve()` method accepts a tokenized query, an optional `corpus` parameter for displaying results, and a `k` parameter for the top-k cutoff. The `return_as` argument controls output shape: `"tuple"` returns `(documents, scores)`, while `"documents"` returns only the documents as a numpy array of shape `(n_queries, k)`. Source: [README.md:106-128](https://github.com/xhluca/bm25s/blob/main/README.md).

A frequent point of confusion, raised in community issue [#158](https://github.com/xhluca/bm25s/issues/158), is that the BM25 constructor accepts either a list of strings or a list of dictionaries as its `corpus` argument. The `corpus` is purely for display purposes — indexing is always done positionally by `retriever.index(corpus_tokens)`. As demonstrated in [examples/index_with_metadata.py](https://github.com/xhluca/bm25s/blob/main/examples/index_with_metadata.py), you can pass a list of `{"text": ..., "metadata": ...}` dictionaries as the `corpus` parameter to the BM25 constructor while indexing the tokenized text separately, then retrieve dictionaries as results.

The full end-to-end save and reload flow, including tokenizer state, is shown in [examples/save_and_reload_end_to_end.py](https://github.com/xhluca/bm25s/blob/main/examples/save_and_reload_end_to_end.py).

## Higher-Level Wrappers

Two convenience layers sit above the core API:

- `BM25Search` in [bm25s/high_level/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/__init__.py) enforces English stemming via PyStemmer, applies default stopword removal and lowercasing, and uses `backend="numba"` with `auto_compile=False`. It is documented as requiring numba for speedup, stemming, and stopword removal. Source: [bm25s/high_level/__init__.py:8-46](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/__init__.py).
- The CLI entry point, registered as `console_scripts` in [setup.py](https://github.com/xhluca/bm25s/blob/main/setup.py), exposes `bm25 index` and `bm25 search` subcommands backed by [bm25s/terminal/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/terminal/__init__.py), which supports `.csv`, `.txt`, `.json`, and `.jsonl` input files and stores indices under `~/.bm25s/indices/`.
- `BM25HF` in [bm25s/hf.py](https://github.com/xhluca/bm25s/blob/main/bm25s/hf.py) extends the core class with `load_from_hub` and `save_to_hub` methods for Hugging Face Hub persistence.

## See Also

- [Tokenization Pipeline and Vocabulary Management](#)
- [Eager Sparse Scoring: Internals and Performance](#)
- [BEIR Benchmarking and Evaluation Utilities](#)

---

<a id='page-3'></a>

## Tokenization, Indexing, and Persistence

### Related Pages

Related topics: [Core BM25 API: Scoring, Variants, and Retrieval](#page-2), [Backends, Integrations, and CLI/MCP](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/xhluca/bm25s/blob/main/README.md)
- [examples/save_and_reload_end_to_end.py](https://github.com/xhluca/bm25s/blob/main/examples/save_and_reload_end_to_end.py)
- [bm25s/high_level/README.md](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/README.md)
- [bm25s/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py)
- [bm25s/high_level/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/__init__.py)
- [bm25s/terminal/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/terminal/__init__.py)
- [examples/index_nq.py](https://github.com/xhluca/bm25s/blob/main/examples/index_nq.py)
- [bm25s/scoring.py](https://github.com/xhluca/bm25s/blob/main/bm25s/scoring.py)
</details>

# Tokenization, Indexing, and Persistence

The `bm25s` library turns raw text into a searchable BM25 index through three cooperating stages: **tokenization** (text to token ids), **indexing** (token ids to a sparse scoring matrix), and **persistence** (saving and reloading that matrix along with vocabulary and corpus metadata). This page documents the supported input formats, the configurable knobs in each stage, and the on-disk layout produced by `BM25.save` / `BM25.load`.

## Tokenization

### Corpus Format

The `BM25` constructor and `bm25s.index` accept multiple corpus shapes, which is a frequent source of confusion (see community discussion #158). According to the docstring of the corpus-detection helper in `bm25s/__init__.py`, a corpus can be:

- A **list of dictionaries** (e.g. `{"id": i, "text": doc}`); string entries are converted into `{"id": i, "text": doc}` when saved.
- An object exposing `ids` and `vocab` attributes (the `Tokenized` namedtuple).
- A tuple `(unique_token_ids, per_doc_token_ids)`.
- A list of token id lists (ints) or a list of token string lists.

Source: [bm25s/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py)

The high-level wrapper also loads plain files: `bm25s.high_level.load` supports `.txt` (one doc per line), `.json`, `.jsonl`, and `.csv`, with an optional `document_column` argument that selects which key/column becomes the document text. Source: [bm25s/high_level/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/__init__.py)

### Tokenizer Configuration

The `Tokenizer` class in `bm25s.tokenization` accepts a `splitter`, a `stemmer` callable, a stopword list (or the string `"english"`), a `lower` flag, and a vocabulary dictionary. The high-level API instantiates an English stemmer by default and applies both stopword removal and lowercasing:

```python
tokenizer_kwargs_default = dict(stemmer=stemmer, stopwords="english", lower=True)
```

Source: [bm25s/high_level/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/__init__.py)

For very large corpora, the README recommends using the tokenizer in **generator mode** so memory is bounded by `yield`ing one document at a time, then feeding the resulting token-id tuples into `BM25.retrieve`. Source: [README.md](https://github.com/xhluca/bm25s/blob/main/README.md)

## Indexing

### Index Construction

Once documents are tokenized, `BM25.index(corpus_tokens)` builds the BM25 statistics. Internally, `bm25s/scoring.py` walks the corpus once to count document frequencies per unique token, then assembles an **eager sparse CSC matrix** of precomputed term-frequency contributions. This design is what makes query-time scoring extremely fast — only an aggregation over query tokens is needed.

Source: [bm25s/scoring.py](https://github.com/xhluca/bm25s/blob/main/bm25s/scoring.py)

### Variants and Parameters

The `BM25` constructor exposes `k1`, `b`, `delta`, `method`, `idf_method`, `dtype`, `int_dtype`, `backend`, `csc_backend`, and `auto_compile`. Supported scoring variants are `robertson`, `atire`, `bm25l`, `bm25+`, and `lucene` (the default). The `delta` parameter only affects `bm25l` and `bm25+`; the README notes Lucene's exact BM25 implementation is the default and recommends `k1` between 1.2 and 2.0 with `b=0.75`.

Source: [README.md](https://github.com/xhluca/bm25s/blob/main/README.md) and [bm25s/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py)

### Backends

Two retrieval backends are available: `numpy` (the default, dependency-light) and `numba` (JIT-compiled, ~2× faster on larger datasets per the v0.2.0 release notes). Setting `backend="auto"` selects numba when installed. The `csc_backend` parameter chooses between `numpy` and `scipy` for building the sparse matrix; the README notes that once numba is active, the CSC-backend choice is negligible. Source: [README.md](https://github.com/xhluca/bm25s/blob/main/README.md)

## Persistence

### Saving an Index

`BM25.save(save_dir)` writes a directory containing, at minimum, `params.index.json` (used by the CLI to recognize a valid index, see `bm25s/terminal/__init__.py`) and the sparse matrix files. The companion `Tokenizer.save_vocab(save_dir)` writes the vocabulary mapping; `Tokenizer.save_stopwords(save_dir)` writes the stopword list. The end-to-end example demonstrates the full sequence:

```python
retriever.save("bm25s_index_readme")
tokenizer.save_vocab(save_dir="bm25s_index_readme")
```

Source: [examples/save_and_reload_end_to_end.py](https://github.com/xhluca/bm25s/blob/main/examples/save_and_reload_end_to_end.py)

For larger workloads such as Natural Questions, the same pattern is used and `bm25s.utils.benchmark.get_max_memory_usage()` reports peak memory. Source: [examples/index_nq.py](https://github.com/xhluca/bm25s/blob/main/examples/index_nq.py)

### Reloading

```python
reloaded_retriever = bm25s.BM25.load("bm25s_index_readme", load_corpus=True)
reloaded_tokenizer = Tokenizer(splitter=lambda x: x.split())
reloaded_tokenizer.load_vocab("bm25s_index_readme")
```

The tokenizer must be reconstructed with the **same splitter** used at index time; otherwise token positions will not align with the saved vocabulary. Source: [examples/save_and_reload_end_to_end.py](https://github.com/xhluca/bm25s/blob/main/examples/save_and_reload_end_to_end.py)

### CLI and the User Directory

The CLI (entry point: `bm25`) exposes indexing, searching, and MCP launch. Without `-o`, indices default to `<filename>_index` next to the source file; the `-u` flag instead saves to `~/.bm25s/indices/<name>`. The CLI validates an index by checking for the presence of `params.index.json`. Source: [bm25s/terminal/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/terminal/__init__.py)

```mermaid
flowchart LR
  A[Raw text files] --> B[bm25 index / Tokenizer.tokenize]
  B --> C[Tokenized ids + vocab]
  C --> D[BM25.index → sparse CSC]
  D --> E[save: params + matrix + vocab]
  E --> F[BM25.load + Tokenizer.load_vocab]
  F --> G[retrieve / search]
```

## Common Failure Modes

- **Corpus shape mismatch**: passing a list of plain strings to `BM25(corpus=...)` works at retrieval time but is serialized as `{"id": i, "text": doc}` when saved; consumers expecting the original string will see a dict instead (community issue #158).
- **Splitter mismatch on reload**: a reloaded tokenizer must reuse the exact splitter, or token ids will not round-trip.
- **Stale vocab on query tokenization**: queries must call `tokenize(..., update_vocab=False)` so unknown query tokens are not silently added to the vocabulary and shift future ids. Source: [examples/save_and_reload_end_to_end.py](https://github.com/xhluca/bm25s/blob/main/examples/save_and_reload_end_to_end.py)
- **Windows / MCP health checks**: the `resource` module is Unix-only; importing `bm25s` on Windows previously emitted a `logger.warning` that broke MCP health probes. This was downgraded to `logger.debug` in release 0.3.9 (issues #178, #186; PR #187).

## See Also

- README — quickstart, variants, and CLI examples. Source: [README.md](https://github.com/xhluca/bm25s/blob/main/README.md)
- High-level wrapper (`BM25Search`) — one-line indexing and searching. Source: [bm25s/high_level/README.md](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/README.md) and [bm25s/high_level/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/__init__.py)
- Scoring internals — sparse matrix construction. Source: [bm25s/scoring.py](https://github.com/xhluca/bm25s/blob/main/bm25s/scoring.py)
- MCP server — exposing an index as an LLM tool. Source: [README.md](https://github.com/xhluca/bm25s/blob/main/README.md)

---

<a id='page-4'></a>

## Backends, Integrations, and CLI/MCP

### Related Pages

Related topics: [Core BM25 API: Scoring, Variants, and Retrieval](#page-2), [Tokenization, Indexing, and Persistence](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [bm25s/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py)
- [bm25s/cli.py](https://github.com/xhluca/bm25s/blob/main/bm25s/cli.py)
- [bm25s/hf.py](https://github.com/xhluca/bm25s/blob/main/bm25s/hf.py)
- [bm25s/high_level/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/__init__.py)
- [bm25s/terminal/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/terminal/__init__.py)
- [bm25s/high_level/README.md](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/README.md)
- [README.md](https://github.com/xhluca/bm25s/blob/main/README.md)
- [setup.py](https://github.com/xhluca/bm25s/blob/main/setup.py)
- [examples/save_and_reload_end_to_end.py](https://github.com/xhluca/bm25s/blob/main/examples/save_and_reload_end_to_end.py)
- [examples/index_nq.py](https://github.com/xhluca/bm25s/blob/main/examples/index_nq.py)
</details>

# Backends, Integrations, and CLI/MCP

This page documents the backends, external integrations, command-line interface, and Model Context Protocol (MCP) server that surround the core `bm25s` retrieval library. While the `BM25` class itself focuses on scoring, the layers covered here determine *where* scores are computed, *how* indices are persisted or shared, and *which* user-facing surfaces (CLI, MCP, Hugging Face Hub) are exposed. Together they turn a fast scoring kernel into a tool that can be driven from a terminal, served to an LLM agent, or distributed via the Hub.

## 1. Computation Backends

The `BM25` constructor and the high-level wrapper accept a `backend` parameter that selects the implementation used for score computation. The high-level helper in [`bm25s/high_level/__init__.py`](https://github.com/xhluca/bm25s/blob/main/bm25s/high_level/__init__.py) applies the following defaults when building a retriever:

```python
bm25_kwargs_default = dict(
    backend="numba", csc_backend="numpy", auto_compile=False
)
```

Source: [bm25s/high_level/__init__.py:24-26]()

The `backend="numba"` selection enables the JIT-compiled scoring path that provides roughly 2x speedup on larger datasets, as noted in the project README. The `csc_backend="numpy"` parameter controls the CSC (compressed sparse column) matrix representation used to store eagerly precomputed term scores; switching it changes how sparse matrix operations are dispatched but does not change the BM25 formula. An `auto_compile=False` flag prevents silent recompilation between runs, which keeps startup behavior predictable when an index is loaded repeatedly (for example, inside the CLI or the MCP server). Source: [README.md](https://github.com/xhluca/bm25s/blob/main/README.md) and [bm25s/high_level/__init__.py:24-26]()

The high-level API also forces `Stemmer.Stemmer("english")` and English stopwords by default, and currently raises `NotImplementedError` for any other language. Source: [bm25s/high_level/__init__.py:18-21]()

## 2. Hugging Face Hub Integration

`bm25s` ships a thin wrapper class `BM25HF` for loading and saving indices to the Hugging Face Hub. Source: [bm25s/hf.py](https://github.com/xhluca/bm25s/blob/main/bm25s/hf.py)

The integration exposes two main entry points:

- `BM25HF.load_from_hub("{username}/{repo_name}")` — loads a previously published index.
- `BM25HF.save_to_hub(...)` — publishes a local index.

After loading, retrieval proceeds via the standard `retriever.retrieve(bm25s.tokenize(query), k=3)` interface, so the Hub integration is a thin I/O shim around the core `BM25` class. Source: [bm25s/hf.py](https://github.com/xhluca/bm25s/blob/main/bm25s/hf.py) Users who want to ship a prebuilt retrieval index alongside a model card on the Hub can use this path to avoid re-indexing on the consumer side, and the `examples/retrieve_from_hf.py` example referenced in the README demonstrates loading an index alongside its corpus. Source: [README.md](https://github.com/xhluca/bm25s/blob/main/README.md)

## 3. Command-Line Interface

The CLI is registered as a console-script entry point named `bm25` in [`setup.py`](https://github.com/xhluca/bm25s/blob/main/setup.py):

```python
entry_points={
    "console_scripts": [
        "bm25=bm25s.cli:main",
    ],
},
```

Source: [setup.py:7-9]()

The entry point resolves to `bm25s.cli.main`, which uses `argparse` to dispatch to subcommands. Source: [bm25s/cli.py](https://github.com/xhluca/bm25s/blob/main/bm25s/cli.py) The currently registered top-level subcommands are:

| Subcommand | Purpose |
|------------|---------|
| `index` | Build an index from a CSV, TXT, JSON, or JSONL file |
| `search` | Query a previously built index |
| `mcp` | Launch the MCP server (see Section 4) |

The `index` subcommand accepts a positional `file`, optional `-o/--output`, `-c/--column` (for CSV/JSON/JSONL), and `-u/--user` (which stores the index under `~/.bm25s/indices/`). Source: [bm25s/cli.py](https://github.com/xhluca/bm25s/blob/main/bm25s/cli.py) The interactive `bm25 search -u` mode calls `select_index_interactive()` from [`bm25s/terminal/__init__.py`](https://github.com/xhluca/bm25s/blob/main/bm25s/terminal/__init__.py), which lists any directory under `~/.bm25s/indices/` that contains a `params.index.json` file as a valid index. Source: [bm25s/terminal/__init__.py:18-30]()

```mermaid
flowchart LR
    A[Corpus file or list] --> B[bm25s.index or BM25]
    B -- backend selection --> C[Numba / NumPy / SciPy]
    B --> D[Index on disk]
    D --> E[BM25.load or save]
    D --> F[bm25 CLI: index or search]
    D --> G[BM25HF Hub I/O]
    D --> H[bm25 mcp launch]
    H --> I[MCP clients / LLM agents]
```

## 4. MCP Server and Common Pitfalls

`bm25s` ships a built-in Model Context Protocol (MCP) server so an index can be exposed to LLM agents and other MCP-compatible clients. The server is launched through the CLI:

```bash
uv pip install "bm25s[mcp]"
bm25 mcp launch --port 8000 --index-dir /path/to/your/index
```

Source: [README.md](https://github.com/xhluca/bm25s/blob/main/README.md) and [bm25s/cli.py](https://github.com/xhluca/bm25s/blob/main/bm25s/cli.py)

The `mcp launch` subcommand requires two flags: `-p/--port` (default `8000`) and `-d/--index-dir` (required). Source: [bm25s/cli.py](https://github.com/xhluca/bm25s/blob/main/bm25s/cli.py)

**Community-reported pitfalls:**

- **Windows resource module warning.** Prior to v0.3.9, importing `bm25s` on Windows emitted `resource module not available on Windows` to stdout (not just the log stream), which broke MCP health checks on Windows. The v0.3.9 release downgraded this message to debug level. Source: community issues #178 and #186, and the v0.3.9 changelog in [README.md](https://github.com/xhluca/bm25s/blob/main/README.md)
- **Corpus format ambiguity.** Issue #158 notes that the `__init__.py` docstring and the quickstart disagree on whether the corpus is a list of dicts or a list of strings; the high-level `bm25s.index` accepts both shapes and the underlying `_get_corpus_type` helper disambiguates based on the first element. Source: [bm25s/__init__.py](https://github.com/xhluca/bm25s/blob/main/bm25s/__init__.py)
- **Attribution.** The scoring approach and API shape originate from `bm25_pt` (issue #80); when describing internals, this lineage is worth acknowledging.

## See Also

- Tokenization and the `Tokenizer` class
- Memory-mapped retrieval (`mmap` mode) and batched reloading
- BM25 variants (`robertson`, `atire`, `bm25l`, `bm25+`, `lucene`)
- High-level `BM25` package on PyPI for a 1-line API

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: xhluca/bm25s

Summary: Found 11 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/xhluca/bm25s/issues/184

## 2. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/xhluca/bm25s/issues/178

## 3. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/xhluca/bm25s/issues/173

## 4. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/xhluca/bm25s/issues/186

## 5. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/xhluca/bm25s

## 6. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/xhluca/bm25s/issues/158

## 7. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/xhluca/bm25s

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/xhluca/bm25s

## 9. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/xhluca/bm25s

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/xhluca/bm25s

## 11. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/xhluca/bm25s

<!-- canonical_name: xhluca/bm25s; human_manual_source: deepwiki_human_wiki -->
