bm25s Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

bm25s

Fast BM25 search in Python, powered by Numpy and Numba

Overview, Installation, and Corpus Format

Related topics: Core BM25 API: Scoring, Variants, and Retrieval

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Intended Audience

Continue reading this section for the full explanation and source context.

Section Install Variants

Continue reading this section for the full explanation and source context.

Section CLI Installation

Continue reading this section for the full explanation and source context.

Overview, Installation, and Corpus Format

What is bm25s?

bm25s is a Python library that implements the BM25 ranking function for text retrieval. BM25 is the same algorithm used by search engines such as Elasticsearch to score documents against a query. The library positions itself as a fast, dependency-light alternative to other Python BM25 implementations such as rank-bm25. According to the project description, bm25s is implemented in pure Python and uses SciPy sparse matrices to store eagerly computed scores, which the authors claim yields "orders of magnitude" speedup at query time. Source: README.md.

The library has no required dependencies beyond NumPy, and adds SciPy, stemming (PyStemmer), and numba only as optional extras. The README explicitly states that there are "no dependencies on Java or PyTorch". Source: README.md.

The high-level companion package, published on PyPI as BM25, wraps bm25s and exposes a 1-line API. The bm25s/high_level module is described as a "simple to use search interface, enabling 1-line indexing and 1-line searching", and by default enables numba compilation, stemming, and stopword removal. Source: bm25s/high_level/README.md.

Intended Audience

The library targets Python developers who need lexical retrieval as part of a larger pipeline (RAG, search services, evaluation harnesses) and want a pure-Python implementation that is easy to install. The repository also ships an MCP server so an existing BM25 index can be exposed as a tool to LLM agents. Source: README.md.

Installation

The minimal install requires only NumPy, as declared in setup.py:

install_requires=['numpy'],
python_requires=">=3.8",

Source: setup.py. The bm25 console entry point is registered in the same file, which is what makes the CLI work after installation. Source: setup.py.

Install Variants

The README documents several install profiles. The basic form is pip install "bm25s=={version}", while richer installs use bracketed extras such as bm25s[core], bm25s[full], and bm25s[cli]. The core extra is required for examples that load BEIR-format datasets, such as examples/index_nq.py. Source: README.md, examples/index_nq.py.

For users who want the simplest on-ramp, the project recommends the higher-level BM25 PyPI package, which pulls in bm25s plus PyStemmer and rich automatically. Source: bm25s/high_level/README.md.

For MCP server use, the README documents a dedicated extra:

uv pip install "bm25s[mcp]"
# or locally
uv pip install -e ".[mcp]"

Source: README.md.

CLI Installation

The CLI is exposed as the bm25 command. The terminal module's docstring shows the intended usage pattern: bm25 index documents.csv -o my_index and bm25 search -i my_index "what is machine learning?". Source: bm25s/terminal/__init__.py.

Corpus Format

The corpus is the collection of documents that bm25s will index. The library accepts several input shapes, which is a frequent source of confusion. Issue #158 ("Format of corpus and API reference") specifically calls out an inconsistency between the documentation and the quickstart regarding whether a corpus should be a list of dictionaries or a list of strings. Source: bm25s/__init__.py, README.md.

Accepted Shapes

The constructor-level validation in bm25s/__init__.py enumerates the supported corpus types: a list of lists of tokens, an object exposing ids and vocab attributes, or a tuple of two lists (unique token IDs and a per-document list of token IDs). When the input is a generic iterable, the constructor inspects the inner type: if every element is an int, the corpus is interpreted as pre-tokenized IDs; otherwise it is treated as a list of token lists. Source: bm25s/__init__.py.

In practice, the most common shapes are:

Shape	Meaning
`List[str]`	One document per string. Strings can later be tokenized with `bm25s.tokenize` or a custom `Tokenizer`.
`List[Dict]` or `List[Tuple]`	Structured documents preserved through retrieval; useful for keeping `id`, `title`, and `text` together.
`List[List[str]]`	Pre-tokenized documents; the tokenizer is skipped.
Tuple `(unique_ids, per_doc_ids)`	Compact pre-tokenized form using integer IDs.

Source: bm25s/__init__.py, examples/save_and_reload_end_to_end.py.

File-Based Loading

For users starting from a file, bm25s.high_level provides a load helper. The implementation branches on the file extension and supports .txt, .json, .jsonl, and .csv. For structured formats the caller can pass a document_column argument to select which key (JSON) or column (CSV) holds the document text; when omitted, the first key/column is used. Source: bm25s/high_level/__init__.py.

The high-level index function pairs with load to produce a ready-to-search BM25Search object:

import bm25s.high_level as bm25
corpus = bm25.load("tests/data/dummy.csv", document_column="text")
retriever = bm25.index(corpus)
results = retriever.search(["your query here"], k=5)

Source: README.md.

How the Corpus Is Saved

When a structured corpus is persisted to disk, the save path in bm25s/__init__.py normalizes each entry: strings become {"id": i, "text": doc} dictionaries; dictionaries, lists, and tuples are kept as-is; other types trigger a warning and are skipped. The serialized form is JSONL, and a companion memory-mapped index (corpus.mmindex) is written to enable fast random access on reload. Source: bm25s/__init__.py.

flowchart LR
  A[Raw files<br/>csv / json / jsonl / txt] --> B[bm25s.high_level.load]
  C[List[str]] --> D{bm25s.BM25 constructor}
  E[List[Dict]] --> D
  F[List[List[str]]] --> D
  G[(unique_ids, per_doc_ids)] --> D
  B --> D
  D --> E2[Retriever.index<br/>+ Tokenizer]
  E2 --> F2[Sparse CSC score matrix]
  F2 --> G2[retriever.retrieve / retriever.search]

Loading a Saved Index

The end-to-end example demonstrates the reload path. After retriever.save(...) and tokenizer.save_vocab(...), the index is restored with bm25s.BM25.load(save_dir, load_corpus=True), after which the original strings (or dictionaries) come back through retriever.retrieve(..., corpus=...). The example also shows the need to instantiate a new Tokenizer and call load_vocab before re-tokenizing queries. Source: examples/save_and_reload_end_to_end.py.

Common Pitfalls

Tokenization is your responsibility for the low-level API. The BM25 constructor expects an already-tokenized corpus in most modes. Use bm25s.tokenize or a Tokenizer instance first, or use the high-level BM25.index helper which handles tokenization for you. Source: bm25s/__init__.py, bm25s/high_level/__init__.py.
The Windows resource warning. As reported in issues #178 and #186, importing bm25s on Windows used to log resource module not available on Windows to stdout, which broke MCP health checks. This was downgraded from warning to debug level in release 0.3.9. Source: README.md (release notes section).
Hugging Face Hub usage. For sharing or loading pre-built indexes, the BM25HF wrapper in bm25s.hf provides load_from_hub and save methods; consult the module-level docstring for the canonical pair of API calls. Source: bm25s/hf.py.

Core BM25 API: Scoring, Variants, and Retrieval

Related topics: Tokenization, Indexing, and Persistence, Backends, Integrations, and CLI/MCP

Section Related Pages

Continue reading this section for the full explanation and source context.

Core BM25 API: Scoring, Variants, and Retrieval

Overview and Scope

The bm25s library exposes its core retrieval capability through the bm25s.BM25 class, located in bm25s/__init__.py. Its design goal, as stated in the README.md, is to provide an "ultrafast implementation of BM25 in pure Python, powered by Numpy" that pre-computes per-token scores into a sparse matrix, allowing query-time scoring to be reduced to sparse matrix operations. This eager scoring strategy is the central performance claim of the library, distinguishing it from lazy implementations such as rank-bm25.

The API is intentionally small: a constructor that captures BM25 hyperparameters, an index() method that consumes pre-tokenized input, and a retrieve() method that returns ranked results. A separate high-level facade, BM25Search in bm25s/high_level/__init__.py, wraps this class with sensible defaults for one-line indexing.

BM25 Scoring Variants

bm25s supports five scoring variants selectable via the method argument. The README documents these variants citing Kamphuis et al. 2020: Robertson (the original, with idf>=0 enforced), ATIRE, BM25L, BM25+, and Lucene. The default is method="lucene", matching Lucene's BM25 implementation exactly.

Variant	`method` value	Extra parameter	Notes
Original Robertson	`"robertson"`	none	IDF clamped to non-negative; `k1`, `b` configurable
ATIRE	`"atire"`	none	Variant with modified IDF behavior
BM25L	`"bm25l"`	`delta` (default 0.5)	Designed for better handling of long documents
BM25+	`"bm25+"`	`delta` (default 0.5)	Adds a lower-bound term to avoid saturation
Lucene	`"lucene"`	none	Default; matches Elasticsearch scoring

The IDF scorer can be chosen independently of the term-frequency scorer. Source: README.md:74-92. The constructor also exposes k1, b, delta, idf_method, backend ("numpy" or "numba"), and csc_backend ("scipy" or "numpy") parameters; see bm25s/__init__.py for the full signature.

The scoring pipeline begins with _select_idf_scorer(self.idf_method) to compute IDF values, followed by _build_scores_and_indices_for_matrix which iterates over each document-token pair and computes the per-token BM25 contribution. Source: bm25s/__init__.py:243-275.

Indexing and Sparse Matrix Construction

The index() method accepts three corpus representations, dispatched via _infer_corpus_object: raw token lists, a (unique_token_ids, vocab_dict) tuple, or a Tokenized object exposing ids and vocab attributes. When corpus_tokens is provided without an explicit vocabulary, the vocabulary dictionary is built automatically from the tokens. Source: bm25s/__init__.py:298-355.

Eager score computation proceeds in two steps:

Per-token BM25 contributions are flattened into (scores_flat, doc_idx, vocab_idx) triples.
These triples are assembled into a CSC (Compressed Sparse Column) sparse matrix of shape (n_docs, n_vocab), where each column holds the BM25 score of one vocabulary term across all documents.

flowchart LR
    A[corpus_token_ids] --> B[_build_scores_and_indices_for_matrix]
    B --> C[scores_flat, doc_idx, vocab_idx]
    C --> D{backend}
    D -->|scipy| E[sp.csc_matrix]
    D -->|numpy| F[np_csc assembly]
    E --> G[scores dict: data, indices, indptr, num_docs, vocab]
    F --> G
    G --> H[retrieve]

The csc_backend parameter controls whether SciPy or a custom NumPy CSC builder is used; both produce the same {"data", "indices", "indptr", "num_docs", "vocab"} structure consumed by retrieve(). Source: bm25s/__init__.py:282-296.

Retrieval API and Corpus Format

The retrieve() method accepts a tokenized query, an optional corpus parameter for displaying results, and a k parameter for the top-k cutoff. The return_as argument controls output shape: "tuple" returns (documents, scores), while "documents" returns only the documents as a numpy array of shape (n_queries, k). Source: README.md:106-128.

A frequent point of confusion, raised in community issue #158, is that the BM25 constructor accepts either a list of strings or a list of dictionaries as its corpus argument. The corpus is purely for display purposes — indexing is always done positionally by retriever.index(corpus_tokens). As demonstrated in examples/index_with_metadata.py, you can pass a list of {"text": ..., "metadata": ...} dictionaries as the corpus parameter to the BM25 constructor while indexing the tokenized text separately, then retrieve dictionaries as results.

The full end-to-end save and reload flow, including tokenizer state, is shown in examples/save_and_reload_end_to_end.py.

Higher-Level Wrappers

Two convenience layers sit above the core API:

BM25Search in bm25s/high_level/__init__.py enforces English stemming via PyStemmer, applies default stopword removal and lowercasing, and uses backend="numba" with auto_compile=False. It is documented as requiring numba for speedup, stemming, and stopword removal. Source: bm25s/high_level/__init__.py:8-46.
The CLI entry point, registered as console_scripts in setup.py, exposes bm25 index and bm25 search subcommands backed by bm25s/terminal/__init__.py, which supports .csv, .txt, .json, and .jsonl input files and stores indices under ~/.bm25s/indices/.
BM25HF in bm25s/hf.py extends the core class with load_from_hub and save_to_hub methods for Hugging Face Hub persistence.

Tokenization, Indexing, and Persistence

Related topics: Core BM25 API: Scoring, Variants, and Retrieval, Backends, Integrations, and CLI/MCP

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Corpus Format

Continue reading this section for the full explanation and source context.

Section Tokenizer Configuration

Continue reading this section for the full explanation and source context.

Section Index Construction

Continue reading this section for the full explanation and source context.

Tokenization, Indexing, and Persistence

The bm25s library turns raw text into a searchable BM25 index through three cooperating stages: tokenization (text to token ids), indexing (token ids to a sparse scoring matrix), and persistence (saving and reloading that matrix along with vocabulary and corpus metadata). This page documents the supported input formats, the configurable knobs in each stage, and the on-disk layout produced by BM25.save / BM25.load.

Tokenization

Corpus Format

The BM25 constructor and bm25s.index accept multiple corpus shapes, which is a frequent source of confusion (see community discussion #158). According to the docstring of the corpus-detection helper in bm25s/__init__.py, a corpus can be:

A list of dictionaries (e.g. {"id": i, "text": doc}); string entries are converted into {"id": i, "text": doc} when saved.
An object exposing ids and vocab attributes (the Tokenized namedtuple).
A tuple (unique_token_ids, per_doc_token_ids).
A list of token id lists (ints) or a list of token string lists.

Source: bm25s/__init__.py

The high-level wrapper also loads plain files: bm25s.high_level.load supports .txt (one doc per line), .json, .jsonl, and .csv, with an optional document_column argument that selects which key/column becomes the document text. Source: bm25s/high_level/__init__.py

Tokenizer Configuration

The Tokenizer class in bm25s.tokenization accepts a splitter, a stemmer callable, a stopword list (or the string "english"), a lower flag, and a vocabulary dictionary. The high-level API instantiates an English stemmer by default and applies both stopword removal and lowercasing:

tokenizer_kwargs_default = dict(stemmer=stemmer, stopwords="english", lower=True)

Source: bm25s/high_level/__init__.py

For very large corpora, the README recommends using the tokenizer in generator mode so memory is bounded by yielding one document at a time, then feeding the resulting token-id tuples into BM25.retrieve. Source: README.md

Indexing

Index Construction

Once documents are tokenized, BM25.index(corpus_tokens) builds the BM25 statistics. Internally, bm25s/scoring.py walks the corpus once to count document frequencies per unique token, then assembles an eager sparse CSC matrix of precomputed term-frequency contributions. This design is what makes query-time scoring extremely fast — only an aggregation over query tokens is needed.

Source: bm25s/scoring.py

Variants and Parameters

The BM25 constructor exposes k1, b, delta, method, idf_method, dtype, int_dtype, backend, csc_backend, and auto_compile. Supported scoring variants are robertson, atire, bm25l, bm25+, and lucene (the default). The delta parameter only affects bm25l and bm25+; the README notes Lucene's exact BM25 implementation is the default and recommends k1 between 1.2 and 2.0 with b=0.75.

Source: README.md and bm25s/__init__.py

Backends

Two retrieval backends are available: numpy (the default, dependency-light) and numba (JIT-compiled, ~2× faster on larger datasets per the v0.2.0 release notes). Setting backend="auto" selects numba when installed. The csc_backend parameter chooses between numpy and scipy for building the sparse matrix; the README notes that once numba is active, the CSC-backend choice is negligible. Source: README.md

Persistence

Saving an Index

BM25.save(save_dir) writes a directory containing, at minimum, params.index.json (used by the CLI to recognize a valid index, see bm25s/terminal/__init__.py) and the sparse matrix files. The companion Tokenizer.save_vocab(save_dir) writes the vocabulary mapping; Tokenizer.save_stopwords(save_dir) writes the stopword list. The end-to-end example demonstrates the full sequence:

retriever.save("bm25s_index_readme")
tokenizer.save_vocab(save_dir="bm25s_index_readme")

Source: examples/save_and_reload_end_to_end.py

For larger workloads such as Natural Questions, the same pattern is used and bm25s.utils.benchmark.get_max_memory_usage() reports peak memory. Source: examples/index_nq.py

Reloading

reloaded_retriever = bm25s.BM25.load("bm25s_index_readme", load_corpus=True)
reloaded_tokenizer = Tokenizer(splitter=lambda x: x.split())
reloaded_tokenizer.load_vocab("bm25s_index_readme")

The tokenizer must be reconstructed with the same splitter used at index time; otherwise token positions will not align with the saved vocabulary. Source: examples/save_and_reload_end_to_end.py

CLI and the User Directory

The CLI (entry point: bm25) exposes indexing, searching, and MCP launch. Without -o, indices default to <filename>_index next to the source file; the -u flag instead saves to ~/.bm25s/indices/<name>. The CLI validates an index by checking for the presence of params.index.json. Source: bm25s/terminal/__init__.py

flowchart LR
  A[Raw text files] --> B[bm25 index / Tokenizer.tokenize]
  B --> C[Tokenized ids + vocab]
  C --> D[BM25.index → sparse CSC]
  D --> E[save: params + matrix + vocab]
  E --> F[BM25.load + Tokenizer.load_vocab]
  F --> G[retrieve / search]

Common Failure Modes

Corpus shape mismatch: passing a list of plain strings to BM25(corpus=...) works at retrieval time but is serialized as {"id": i, "text": doc} when saved; consumers expecting the original string will see a dict instead (community issue #158).
Splitter mismatch on reload: a reloaded tokenizer must reuse the exact splitter, or token ids will not round-trip.
Stale vocab on query tokenization: queries must call tokenize(..., update_vocab=False) so unknown query tokens are not silently added to the vocabulary and shift future ids. Source: examples/save_and_reload_end_to_end.py
Windows / MCP health checks: the resource module is Unix-only; importing bm25s on Windows previously emitted a logger.warning that broke MCP health probes. This was downgraded to logger.debug in release 0.3.9 (issues #178, #186; PR #187).

Backends, Integrations, and CLI/MCP

Related topics: Core BM25 API: Scoring, Variants, and Retrieval, Tokenization, Indexing, and Persistence

Section Related Pages

Continue reading this section for the full explanation and source context.

Backends, Integrations, and CLI/MCP

This page documents the backends, external integrations, command-line interface, and Model Context Protocol (MCP) server that surround the core bm25s retrieval library. While the BM25 class itself focuses on scoring, the layers covered here determine *where* scores are computed, *how* indices are persisted or shared, and *which* user-facing surfaces (CLI, MCP, Hugging Face Hub) are exposed. Together they turn a fast scoring kernel into a tool that can be driven from a terminal, served to an LLM agent, or distributed via the Hub.

1. Computation Backends

The BM25 constructor and the high-level wrapper accept a backend parameter that selects the implementation used for score computation. The high-level helper in bm25s/high_level/__init__.py applies the following defaults when building a retriever:

bm25_kwargs_default = dict(
    backend="numba", csc_backend="numpy", auto_compile=False
)

Source: bm25s/high_level/__init__.py:24-26

The backend="numba" selection enables the JIT-compiled scoring path that provides roughly 2x speedup on larger datasets, as noted in the project README. The csc_backend="numpy" parameter controls the CSC (compressed sparse column) matrix representation used to store eagerly precomputed term scores; switching it changes how sparse matrix operations are dispatched but does not change the BM25 formula. An auto_compile=False flag prevents silent recompilation between runs, which keeps startup behavior predictable when an index is loaded repeatedly (for example, inside the CLI or the MCP server). Source: README.md and bm25s/high_level/__init__.py:24-26

The high-level API also forces Stemmer.Stemmer("english") and English stopwords by default, and currently raises NotImplementedError for any other language. Source: bm25s/high_level/__init__.py:18-21

2. Hugging Face Hub Integration

bm25s ships a thin wrapper class BM25HF for loading and saving indices to the Hugging Face Hub. Source: bm25s/hf.py

The integration exposes two main entry points:

BM25HF.load_from_hub("{username}/{repo_name}") — loads a previously published index.
BM25HF.save_to_hub(...) — publishes a local index.

After loading, retrieval proceeds via the standard retriever.retrieve(bm25s.tokenize(query), k=3) interface, so the Hub integration is a thin I/O shim around the core BM25 class. Source: bm25s/hf.py Users who want to ship a prebuilt retrieval index alongside a model card on the Hub can use this path to avoid re-indexing on the consumer side, and the examples/retrieve_from_hf.py example referenced in the README demonstrates loading an index alongside its corpus. Source: README.md

3. Command-Line Interface

The CLI is registered as a console-script entry point named bm25 in setup.py:

entry_points={
    "console_scripts": [
        "bm25=bm25s.cli:main",
    ],
},

Source: setup.py:7-9

The entry point resolves to bm25s.cli.main, which uses argparse to dispatch to subcommands. Source: bm25s/cli.py The currently registered top-level subcommands are:

Subcommand	Purpose
`index`	Build an index from a CSV, TXT, JSON, or JSONL file
`search`	Query a previously built index
`mcp`	Launch the MCP server (see Section 4)

The index subcommand accepts a positional file, optional -o/--output, -c/--column (for CSV/JSON/JSONL), and -u/--user (which stores the index under ~/.bm25s/indices/). Source: bm25s/cli.py The interactive bm25 search -u mode calls select_index_interactive() from bm25s/terminal/__init__.py, which lists any directory under ~/.bm25s/indices/ that contains a params.index.json file as a valid index. Source: bm25s/terminal/__init__.py:18-30

flowchart LR
    A[Corpus file or list] --> B[bm25s.index or BM25]
    B -- backend selection --> C[Numba / NumPy / SciPy]
    B --> D[Index on disk]
    D --> E[BM25.load or save]
    D --> F[bm25 CLI: index or search]
    D --> G[BM25HF Hub I/O]
    D --> H[bm25 mcp launch]
    H --> I[MCP clients / LLM agents]

4. MCP Server and Common Pitfalls

bm25s ships a built-in Model Context Protocol (MCP) server so an index can be exposed to LLM agents and other MCP-compatible clients. The server is launched through the CLI:

uv pip install "bm25s[mcp]"
bm25 mcp launch --port 8000 --index-dir /path/to/your/index

Source: README.md and bm25s/cli.py

The mcp launch subcommand requires two flags: -p/--port (default 8000) and -d/--index-dir (required). Source: bm25s/cli.py

Community-reported pitfalls:

Windows resource module warning. Prior to v0.3.9, importing bm25s on Windows emitted resource module not available on Windows to stdout (not just the log stream), which broke MCP health checks on Windows. The v0.3.9 release downgraded this message to debug level. Source: community issues #178 and #186, and the v0.3.9 changelog in README.md
Corpus format ambiguity. Issue #158 notes that the __init__.py docstring and the quickstart disagree on whether the corpus is a list of dicts or a list of strings; the high-level bm25s.index accepts both shapes and the underlying _get_corpus_type helper disambiguates based on the first element. Source: bm25s/__init__.py
Attribution. The scoring approach and API shape originate from bm25_pt (issue #80); when describing internals, this lineage is worth acknowledging.

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Configuration risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 11 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/xhluca/bm25s/issues/184

2. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/xhluca/bm25s/issues/178

3. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/xhluca/bm25s/issues/173

4. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/xhluca/bm25s/issues/186

5. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/xhluca/bm25s

6. Runtime risk: Runtime risk requires verification

Severity: medium
Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/xhluca/bm25s/issues/158

7. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/xhluca/bm25s

8. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/xhluca/bm25s

9. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | https://github.com/xhluca/bm25s

10. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/xhluca/bm25s

11. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/xhluca/bm25s

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using bm25s with real data or production workflows.

logger.warning on every import breaks MCP health checks on Windows - github / github_issue
Version mismatch in version.py: Recommend adaptive version strategy - github / github_issue
Format of corpus and API reference - github / github_issue
resource module not available on Windows printed to stdout - github / github_issue
possibly adding Korean stopwords kindly - github / github_issue
jax import guard should also check for RuntimeError - github / github_issue
Usage: When and Where to use this library? - github / github_issue
0.3.9 - github / github_release
0.3.8 - github / github_release
0.3.8.pre1 - github / github_release
0.3.7 - github / github_release
0.3.6 - github / github_release

Source: Project Pack community evidence and pitfall evidence

bm25s

Overview, Installation, and Corpus Format

Related Pages

Overview, Installation, and Corpus Format

What is bm25s?

Intended Audience

Installation

Install Variants

CLI Installation

Corpus Format

Accepted Shapes

File-Based Loading

How the Corpus Is Saved

Loading a Saved Index

Common Pitfalls

See Also

Core BM25 API: Scoring, Variants, and Retrieval

Related Pages

Core BM25 API: Scoring, Variants, and Retrieval

Overview and Scope

BM25 Scoring Variants

Indexing and Sparse Matrix Construction

Retrieval API and Corpus Format

Higher-Level Wrappers

See Also

Tokenization, Indexing, and Persistence

Related Pages

Tokenization, Indexing, and Persistence

Tokenization

Corpus Format

Tokenizer Configuration

Indexing

Index Construction

Variants and Parameters

Backends

Persistence

Saving an Index

Reloading

CLI and the User Directory

Common Failure Modes

See Also

Backends, Integrations, and CLI/MCP

Related Pages

Backends, Integrations, and CLI/MCP

1. Computation Backends

2. Hugging Face Hub Integration

3. Command-Line Interface

4. MCP Server and Common Pitfalls

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Installation risk: Installation risk requires verification

2. Installation risk: Installation risk requires verification

3. Installation risk: Installation risk requires verification

4. Configuration risk: Configuration risk requires verification

5. Capability evidence risk: Capability evidence risk requires verification

6. Runtime risk: Runtime risk requires verification

7. Maintenance risk: Maintenance risk requires verification

8. Security or permission risk: Security or permission risk requires verification

9. Security or permission risk: Security or permission risk requires verification

10. Maintenance risk: Maintenance risk requires verification

11. Maintenance risk: Maintenance risk requires verification

Community Discussion Evidence

Community Discussion Evidence