fastembed Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

fastembed

Fast, Accurate, Lightweight Python library to make State of the Art Embedding

Overview & Dense Text Embeddings

Related topics: Sparse Embeddings (SPLADE, BM25, MiniCOIL, BM42), Late Interaction, Multimodal & Image Embeddings, Custom Models, GPU Support, Rerankers & Known Issues

Section Related Pages

Continue reading this section for the full explanation and source context.

Section OnnxTextEmbedding (Flag / BGE family)

Continue reading this section for the full explanation and source context.

Section Pooled and Pooled-Normalized Embeddings

Continue reading this section for the full explanation and source context.

Section Late-Interaction (Multi-Vector) Embeddings

Continue reading this section for the full explanation and source context.

Overview & Dense Text Embeddings

What is FastEmbed?

FastEmbed is a lightweight, fast Python library for embedding generation, distributed as the package fastembed and developed primarily by Qdrant. Its design pillars are stated in the project README:

Light — minimal external dependencies, no PyTorch download, no GPU requirement; it runs on ONNX Runtime and is suitable for serverless runtimes such as AWS Lambda.
Fast — ONNX Runtime is faster than PyTorch for inference, and data parallelism is used for encoding large datasets.
Accurate — competitive with closed-source embedders, and the library continuously adds new open models.
Supported models — dense text, sparse text, late-interaction text/multimodal, image, and rerankers. Source: README.md.

The documentation site, configured in mkdocs.yml, is built with the Material theme and uses mkdocstrings to auto-render API references from Python docstrings. Contributions follow the workflow in CONTRIBUTING.md — bug reports should include the exact FastEmbed version (obtainable via python -c "import fastembed; print(fastembed.__version__)"), the OS, and a minimal reproducer.

High-Level Architecture

FastEmbed is organized as a small set of embedding families, each implemented as a class that derives from a per-family base class. All families share a common ModelManagement machinery that handles model discovery, download, and caching.

graph TD
    A[fastembed API] --> B[TextEmbedding / SparseTextEmbedding / LateInteractionTextEmbedding / ImageEmbedding / TextCrossEncoder]
    B --> C[ModelManagement base]
    C --> D[HuggingFace Hub]
    C --> E[GCS / custom URL tar.gz]
    C --> F[ONNX Runtime session]
    F --> G[Dense / Sparse / Multi-vector output]

The dense text path is anchored by TextEmbeddingBase in fastembed/text/text_embedding_base.py, which exposes the three public entry points: embed, passage_embed, and query_embed. The base class stores model_name, cache_dir, threads, and an internal _embedding_size, and defines the contract that all dense backends must fulfill.

Dense Text Embedding Variants

Dense text embeddings are split into three concrete implementations. Each ships a supported_*_models registry and an OnnxTextModel worker that runs the actual ONNX inference.

OnnxTextEmbedding (Flag / BGE family)

OnnxTextEmbedding in fastembed/text/onnx_embedding.py is the default dense backend and powers the well-known BGE line (BAAI/bge-small-en-v1.5, BAAI/bge-base-en-v1.5, BAAI/bge-large-en-v1.5, BAAI/bge-small-zh-v1.5, snowflake/snowflake-arctic-embed-xs, mixedbread-ai/mxbai-embed-large-v1, etc.). The constructor signature is the canonical one for the library:

model_name (default "BAAI/bge-small-en-v1.5")
cache_dir, threads, providers
cuda / device_ids / device_id for GPU selection
lazy_load, specific_model_path, local_files_only

Source: fastembed/text/onnx_embedding.py.

Pooled and Pooled-Normalized Embeddings

PooledEmbedding in fastembed/text/pooled_embedding.py hosts Sentence-Transformers-style models that require mean pooling, including sentence-transformers/paraphrase-multilingual-mpnet-base-v2 and intfloat/multilingual-e5-large. Mean pooling and (optionally) L2 normalization are applied post-ONNX.

PooledNormalizedEmbedding in fastembed/text/pooled_normalized_embedding.py covers long-context Jina v2 models and GTE models where pooling is followed by L2 normalization so that inner product equals cosine similarity. Models include jinaai/jina-embeddings-v2-small-en/-base-en (8192-token context), the multilingual Jina variants (-base-de, -base-zh, -base-es, -base-code), and thenlper/gte-base / gte-large.

Late-Interaction (Multi-Vector) Embeddings

Late-interaction models (ColBERT, ColPali, Jina-ColBERT, ModernVBERT) live under fastembed/late_interaction/. LateInteractionTextEmbeddingBase in fastembed/late_interaction/late_interaction_embedding_base.py is the shared contract: it provides query_embed and passage_embed and yields per-token vectors. The ColBERT implementation in fastembed/late_interaction/colbert.py post-processes ONNX output by zeroing attention at padding and [MASK] positions, then L2-normalizing each token vector — an approach the v0.7.2 release made batch-correct (see v0.7.2 release notes). TokensEmbeddingWorker in fastembed/late_interaction/token_embeddings.py yields the per-token slice selected by the attention mask.

Selected Supported Dense Models

Model	Dim	Notes
`BAAI/bge-small-en-v1.5`	384	Default; `model_optimized.onnx` from `qdrant/bge-small-en-v1.5-onnx-q`
`BAAI/bge-base-en-v1.5`	768	Optimized artifact on GCS
`BAAI/bge-large-en-v1.5`	1024	Larger, unoptimized `model.onnx`
`snowflake/snowflake-arctic-embed-xs`	384	Apache-2.0, prefixes recommended
`mixedbread-ai/mxbai-embed-large-v1`	1024	Apache-2.0, prefixes required
`jinaai/jina-embeddings-v2-small-en`	512	8192-token context, L2-normalized
`jinaai/jina-embeddings-v2-base-en`	768	8192-token context, L2-normalized
`thenlper/gte-base` / `gte-large`	768 / 1024	English-only, no prefixes needed
`intfloat/multilingual-e5-large`	1024	Mean pooling; needs query prefix
`colbert-ir/colbertv2.0`	96	Late-interaction token vectors

Sources: fastembed/text/onnx_embedding.py, fastembed/text/pooled_embedding.py, fastembed/text/pooled_normalized_embedding.py, fastembed/late_interaction/colbert.py.

Usage and Community Notes

The canonical "hello world" from the README embeds a list of strings with TextEmbedding(model_name="BAAI/bge-small-en-v1.5") and iterates model.embed(documents). Custom models can be registered with TextEmbedding.add_custom_model(...) passing a ModelSource(hf=..., url=...), dim, and model_file — this is the same path used to load private tarballs or to re-quantize an already-supported model.

Community requests highlight a few realities: support for BAAI/bge-m3 (dense + sparse + ColBERT) is a long-standing ask (issue #107, issue #348), Google’s EmbeddingGemma has been requested (issue #559), and the intfloat/multilingual-e5-small model — which is not in the default registry — can still be added via the custom-model API (issue #123). Operating-system-specific issues, such as segmentation faults when loading Qdrant/bm25 on macOS Python 3.14 (issue #630) and SparseTextEmbedding crashes on Linux Python 3.14.2 (issue #618), trace back to upstream ONNX Runtime support and were partially addressed in v0.8.0 (release notes).

A common failure mode to be aware of: a model with pooling set incorrectly raises an exception starting in v0.7.1, instead of silently producing wrong vectors — another example of the library moving toward explicit, model-aware error reporting. Source: v0.7.1 release notes.

Sparse Embeddings (SPLADE, BM25, MiniCOIL, BM42)

Related topics: Overview & Dense Text Embeddings, Late Interaction, Multimodal & Image Embeddings

Section Related Pages

Continue reading this section for the full explanation and source context.

Section SPLADE (spladepp.py)

Continue reading this section for the full explanation and source context.

Section BM25 (bm25.py)

Continue reading this section for the full explanation and source context.

Section BM42 (bm42.py)

Continue reading this section for the full explanation and source context.

Sparse Embeddings (SPLADE, BM25, MiniCOIL, BM42)

Overview and Purpose

FastEmbed's sparse embedding subsystem produces high-dimensional vectors with mostly zero entries, where each non-zero coordinate corresponds to a vocabulary token weighted by its importance to the input text. Unlike dense embeddings (fixed-length float vectors), sparse embeddings expose interpretable term-weight pairs that are well-suited for hybrid lexical–semantic retrieval with vector databases such as Qdrant. The module is exposed through the top-level entry point SparseTextEmbedding and returns values typed as SparseEmbedding, which carry parallel indices and values arrays.

Four families of sparse encoders are supported:

Family	Source file	Representative model	Approach
SPLADE	fastembed/sparse/splade_pp.py	`prithivida/Splade_PP_en_v1`	ONNX transformer with ReLU + log + max pooling
BM25	fastembed/sparse/bm25.py	`Qdrant/bm25`	Classical statistical scorer with stemming and stopword filtering
BM42	fastembed/sparse/bm42.py	`Qdrant/bm42-all-minilm-l6-v2-attentions`	Learned BM25-style extension backed by a MiniLM attention head
MiniCOIL	fastembed/sparse/minicoil.py	`Qdrant/minicoil-v1`	4-d token projection combined with BM25 IDF weighting

Architecture and Data Flow

The sparse subsystem uses a dispatch class — SparseTextEmbedding — that selects the appropriate backend implementation at construction time based on model_name. Each backend inherits from SparseTextEmbeddingBase and either wraps an ONNX model via OnnxTextModel or implements a tokenizer-driven algorithm (BM25). The flow below illustrates how an embed() call is routed from the public API to a backend-specific encoder.

flowchart LR
    A[User code:<br/>SparseTextEmbedding] --> B[__init__ selects backend<br/>by model_name]
    B --> C{Backend type}
    C -->|SPLADE / BM42| D[OnnxTextModel<br/>tokenize → ONNX session → post-process]
    C -->|MiniCOIL| E[OnnxTextModel + IDF<br/>computed on Qdrant side]
    C -->|BM25| F[Bm25 token pipeline<br/>stem + stopword filter]
    D --> G[Iterable[SparseEmbedding]]
    E --> G
    F --> G

A typical SparseEmbedding carries indices (token IDs or vocabulary positions) and values (weights). The requires_idf=True flag in model descriptions signals that the downstream vector index must apply the IDF modifier — this is the case for both Qdrant/bm25 (fastembed/sparse/bm25.py) and Qdrant/minicoil-v1 (fastembed/sparse/minicoil.py).

Backend Implementations

SPLADE (splade_pp.py)

The SpladePP class runs a symmetric ONNX encoder for both queries and documents. Its post-processing applies log(1 + ReLU(x)), multiplies by the attention mask, and takes the per-vocab max across tokens before yielding sparse vectors:

# Source: fastembed/sparse/splade_pp.py
relu_log = np.log(1 + np.maximum(output.model_output, 0))
weighted_log = relu_log * np.expand_dims(output.attention_mask, axis=-1)
scores = np.max(weighted_log, axis=1)

The supported model list (exposed via _list_supported_models) currently contains only prithivida/Splade_PP_en_v1, a symmetric variant. Community issue #648 highlights a gap: inference-free SPLADE (IF-SPLADE), where document vectors are precomputed offline and only the tokenizer is invoked at query time, is not yet supported.

BM25 (bm25.py)

Bm25 does not load an ONNX graph (model_file="mock.file"); instead it tokenizes input with optional language-specific stemmers and stopword lists. The requires_idf=True attribute indicates that the IDF component must be computed by the vector index. A model_file="mock.file" placeholder keeps the loader interface uniform. The docstring spells out the BM25 formula and the role of k, b, and avg_len. Note that the BM25 model ships additional per-language text files (f"{lang}.txt") loaded from Hugging Face.

BM42 (bm42.py)

BM42 extends BM25 with a learned attention head sourced from Qdrant/bm42-all-minilm-l6-v2-attentions. It is constructed similarly to other ONNX backends, and its worker is initialised through _get_worker_class. BM42 sits between the purely statistical BM25 and the fully neural SPLADE family: it improves term weighting without committing to a full transformer vocabulary expansion.

MiniCOIL (minicoil.py)

MiniCOIL is the newest entry, introduced in v0.7.0 (release notes). Each vocabulary token is projected to a 4-d component and then re-weighted by corpus token frequency, combining BM25's exact-match behaviour with semantic awareness inherited from jinaai/jina-embeddings-v2-small-en-tokens. The implementation loads three additional files beyond the ONNX graph: STOPWORDS_FILE, MINICOIL_MODEL_FILE, and MINICOIL_VOCAB_FILE.

Usage Pattern

The public API is consistent across backends:

from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="Qdrant/bm42-all-minilm-l6-v2-attentions")
embeddings = list(model.embed(documents))
# Each element is a SparseEmbedding(indices=[...], values=[...])

Construction accepts the standard arguments — cache_dir, threads, providers, cuda/device_ids, lazy_load, and specific_model_path — inherited from the ONNX base class (README.md). The parallel keyword enables data-parallel encoding for offline batch jobs. Custom models can be registered with SparseTextEmbedding.add_custom_model(...), mirroring the dense-text workflow.

Configuration Reference

Parameter	Purpose	Notes
`model_name`	Selects backend implementation	Must match an entry in the backend's `supported_*_models`
`cache_dir`	ONNX / tokenizer cache location	Falls back to `FASTEMBED_CACHE_PATH` env var
`cuda`	GPU execution	Per release v0.8.0, defaults to auto-detect
`providers`	Explicit ONNX provider list	Triggers a user warning when combined with `cuda=True`
`lazy_load`	Defer model materialisation	Useful in serverless environments
`parallel`	Worker count for offline jobs	`0` = all cores, `None` = no data parallelism
`requires_idf`	Hints that IDF must be applied by the index	Set on BM25 and MiniCOIL descriptions

Known Limitations and Community-Reported Issues

Inference-free SPLADE: Symmetric SPLADE is the only supported variant; IF-SPLADE with asymmetric doc-side encoding is tracked in #648.
Python 3.14 crashes: Initializing SparseTextEmbedding on Python 3.14.2 (Linux/macOS) can trigger segmentation faults — see #618 and #630. The v0.8.0 release (release notes) adjusted onnxruntime and pillow versions to improve 3.14 compatibility, and #576 tracks broader 3.14 support.
BGE-M3 hybrid outputs: Long-standing requests for dense + sparse + ColBERT in a single model (#107, #348) remain open; BGE-M3 is not yet wired into the sparse backend.
pillow CVE: Downstream pin conflicts affecting sparse-bearing deployments are tracked in #606.
Offline mode: As of v0.7.4, the HF_HUB_OFFLINE environment variable is respected, preventing network calls when models are already cached (v0.7.4 release notes).

Late Interaction, Multimodal & Image Embeddings

Related topics: Overview & Dense Text Embeddings, Sparse Embeddings (SPLADE, BM25, MiniCOIL, BM42), Custom Models, GPU Support, Rerankers & Known Issues

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Architecture and Registry

Continue reading this section for the full explanation and source context.

Section ColBERT-style Backends

Continue reading this section for the full explanation and source context.

Section Jina ColBERT v2

Continue reading this section for the full explanation and source context.

Late Interaction, Multimodal & Image Embeddings

Overview

FastEmbed is a lightweight embedding library that runs models via ONNX Runtime, with no PyTorch dependency. Alongside standard dense text embeddings, FastEmbed ships three specialized embedding families: late-interaction text (ColBERT-style token-level scoring), late-interaction multimodal (image+text in a shared latent space), and image-only embeddings. All three are exposed at the top level of the package, as listed in fastembed/__init__.py:

from fastembed import TextEmbedding, SparseTextEmbedding, ImageEmbedding
from fastembed import LateInteractionTextEmbedding, LateInteractionMultimodalEmbedding

Late-interaction and multimodal models produce multi-vector outputs (one embedding per token or per image patch) rather than a single fixed-size vector. This enables late-interaction retrieval via MaxSim scoring, as described in README.md under the "Why FastEmbed?" section, which is the technique Qdrant uses to re-rank efficiently.

Late Interaction Text Embeddings

Architecture and Registry

The LateInteractionTextEmbedding class is a thin registry that dispatches to a concrete backend model. Source: fastembed/late_interaction/late_interaction_text_embedding.py shows it maintains an EMBEDDINGS_REGISTRY of backends and instantiates the right one based on the requested model_name.

from fastembed import LateInteractionTextEmbedding

model = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
embeddings = model.embed(["what is the capital of france?"], is_query=True)

ColBERT-style Backends

The most prominent backend is Colbert, defined in fastembed/late_interaction/colbert.py. The class declares two marker token IDs (QUERY_MARKER_TOKEN_ID = 1, DOCUMENT_MARKER_TOKEN_ID = 2) and a MIN_QUERY_LENGTH of 31, reflecting the model's training convention of padding short queries with [MASK].

Post-processing differs by side. For queries the raw ONNX output is yielded as-is. For documents, the implementation zeroes out the attention mask at skip/pad positions, multiplies the model output by the mask, computes an L2 norm along the last axis, and clamps it to avoid division by zero. Source: fastembed/late_interaction/colbert.py — _post_process_onnx_output.

Jina ColBERT v2

The newer JinaColBERT backend in fastembed/late_interaction/jina_colbert.py uses bidirectional attention, which the implementation notes is empirically better for retrieval. A community request for additional ColBERT-family models such as LateOn (#641) reflects the community interest in extending this backend surface.

Late Interaction Multimodal Embeddings

Registry and Supported Models

LateInteractionMultimodalEmbedding is the multimodal counterpart. Source: fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py shows its EMBEDDINGS_REGISTRY contains two backends: ColPali and ColModernVBERT. The constructor walks this registry, finds the backend whose supported model list matches the requested name, and delegates initialization.

ColPali

ColPali produces ColBERT-compatible multi-vector embeddings aligned to an image's latent space, supporting document-image retrieval. It was introduced in v0.6.0 along with the new LateInteractionMultimodalEmbedding class (per the v0.6.0 changelog).

ColModernVBERT

ColModernVBERT was added in v0.8.0. Source: fastembed/late_interaction_multimodal/colmodernvbert.py describes it as "the late-interaction version of ModernVBERT, CPU friendly, English, 2025" with dim=128 and size_in_GB=1.0. It preprends a visual prompt prefix:

VISUAL_PROMPT_PREFIX = (
    "<|begin_of_text|>User:<image>Describe the image.<end_of_utterance>\nAssistant:"
)

flowchart LR
    A[Text/Image Input] --> B[Multimodal ONNX Encoder]
    B --> C[Per-token / Per-patch Vectors]
    C --> D[Qdrant MaxSim Scoring]

Image Embeddings

ImageEmbedding (defined in fastembed/image/onnx_embedding.py) is the dense, single-vector family for images. The supported backends visible in the source include:

Model	Dim	Size (GB)	Source
`Qdrant/resnet50-onnx`	2048	—	`Qdrant/resnet50-onnx`
`Qdrant/Unicom-ViT-B-16`	768	0.82	`Qdrant/Unicom-ViT-B-16`
`Qdrant/Unicom-ViT-B-32`	512	0.48	`Qdrant/Unicom-ViT-B-32`
`jinaai/jina-clip-v1`	768	0.34	`jinaai/jina-clip-v1`

The jina-clip-v1 entry is notable because it is multimodal at the model level (text+image) but is exposed via both TextEmbedding and ImageEmbedding paths using the appropriate ONNX subgraph (onnx/text_model.onnx or onnx/vision_model.onnx). Source: fastembed/image/onnx_embedding.py.

Configuration and Known Limitations

All three embedding families accept a common set of constructor kwargs: model_name, cache_dir, threads, providers, cuda, device_ids, lazy_load, and specific_model_path. The cuda parameter defaults to Device.AUTO, meaning FastEmbed will use CUDA automatically when available (changed in v0.8.0; previously required an explicit cuda=True).

Python 3.14 compatibility is currently broken at the ONNX Runtime level, causing segmentation faults during initialization of SparseTextEmbedding (e.g. Qdrant/bm25) and likely other models that load the runtime. This is tracked in issues #576, #618, and #630. v0.8.0 partially addresses this by adjusting onnxruntime and pillow pins, but full support is gated on upstream ONNX Runtime releases.

Offline / cached usage is supported via the HF_HUB_OFFLINE environment variable, added in v0.8.0 (#614). When a model is already in the local cache, no network calls are made (#577).

Custom Models, GPU Support, Rerankers & Known Issues

Related topics: Overview & Dense Text Embeddings, Sparse Embeddings (SPLADE, BM25, MiniCOIL, BM42), Late Interaction, Multimodal & Image Embeddings

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Custom Model Sources

Continue reading this section for the full explanation and source context.

Section ONNX Session Options

Continue reading this section for the full explanation and source context.

Section Multi-GPU Usage

Continue reading this section for the full explanation and source context.

Custom Models, GPU Support, Rerankers & Known Issues

Overview

FastEmbed is a lightweight Python library for generating embeddings using ONNX Runtime, designed to avoid heavy PyTorch dependencies while remaining fast and accurate. Beyond its large catalog of pre-supported models, FastEmbed provides extension points for users to plug in their own ONNX-exported models, run inference on GPUs, and use cross-encoder rerankers. This page documents these extension points and the known runtime issues reported by the community.

The high-level architecture for extending FastEmbed follows a consistent pattern: a model is described by a DenseModelDescription containing its source location, dimensions, model file path, and metadata; that description is registered with the embedding class, which exposes a uniform embed() interface regardless of modality. Source: fastembed/text/text_embedding.py:55-110.

flowchart LR
    A[User Code] --> B[TextEmbedding / SparseTextEmbedding / LateInteractionTextEmbedding]
    B --> C{Model registered?}
    C -- yes, built-in --> D[Load from HF Hub or URL]
    C -- yes, custom --> E[Load from local path or tar.gz]
    D --> F[ONNX Runtime Session]
    E --> F
    F --> G[CPU / CUDA / TensorRT Provider]
    G --> H[Post-processing]
    H --> I[NumpyArray or SparseEmbedding]

Custom Models

FastEmbed supports registering custom ONNX text models at runtime through TextEmbedding.add_custom_model(). The method validates that the model name is not already registered, then constructs a DenseModelDescription with required fields including sources, dim, pooling, and normalization, and registers it via CustomTextEmbedding.add_model(). Source: fastembed/text/text_embedding.py:65-110.

For multimodal late-interaction scenarios, the same DenseModelDescription schema is used to describe vision-language models. Source: fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py:75-90.

Custom Model Sources

The ModelSource object can point to either a Hugging Face Hub repository (hf=) or a downloadable tarball (url=). As of v0.6.1 the older archive layout was deprecated in favor of model_name.tar.gz to simplify adding custom models. Source: v0.6.1 release notes.

from fastembed import TextEmbedding

TextEmbedding.add_custom_model(
    model="my-org/my-onnx-model",
    pooling=PoolingType.MEAN,
    normalization=True,
    sources=ModelSource(hf="my-org/my-onnx-model"),
    dim=768,
    model_file="onnx/model.onnx",
    description="Custom fine-tuned embedding model",
)

Source: fastembed/text/text_embedding.py:65-95.

GPU and Provider Configuration

FastEmbed supports hardware acceleration through ONNX Runtime execution providers. The __init__ signature accepts providers: Sequence[OnnxProvider] for explicit provider selection and cuda: bool | Device = Device.AUTO for CUDA auto-detection. Source: fastembed/text/onnx_embedding.py:100-130.

Starting with v0.8.0, FastEmbed automatically uses CUDA when an available GPU is detected, rather than requiring users to explicitly pass cuda=True. Source: v0.8.0 release notes.

ONNX Session Options

The v0.7.4 release exposed the enable_cpu_mem_arena ONNX session option to control onnxruntime memory allocation, which is useful for managing memory in constrained environments. Source: v0.7.4 release notes.

When both providers and cuda are specified simultaneously, FastEmbed emits a warning to prevent ambiguous configurations. Source: v0.5.0 release notes.

Multi-GPU Usage

For systems with multiple GPUs, the device_ids parameter can be passed to distribute inference across devices. Source: fastembed/text/onnx_embedding.py:110-115.

Rerankers

Cross-encoder rerankers are supported as a separate model class, with the ability to add custom rerankers introduced in v0.6.1. The rerank module mirrors the embedding module's structure: a base class manages model lifecycle and workers, while concrete ONNX implementations handle inference. The dense encoder passes token-level embeddings through post-processing that masks padding tokens and normalizes vectors. Source: fastembed/late_interaction/colbert.py:60-90.

Late-interaction rerankers like ColBERT yield per-token embeddings rather than pooled vectors, enabling MaxSim scoring at query time. The base class exposes passage_embed() and query_embed() as model-specific entry points. Source: fastembed/late_interaction/late_interaction_embedding_base.py:40-75.

Known Issues and Limitations

Python 3.14 Segmentation Faults

Users on Python 3.14.2 have reported segmentation faults (exit code 139) when initializing SparseTextEmbedding, particularly with Qdrant/bm25 on macOS. Source: issue #618, issue #630. The root cause is upstream: onnxruntime did not support Python 3.14 until a later release. Source: issue #576.

The v0.8.0 release addressed this by pinning compatible onnxruntime and pillow versions to better support Python 3.14. Source: v0.8.0 release notes.

Pillow CVE Constraint

The v0.7.4 release pinned pillow<12.0 to avoid security issues, but this constraint blocked downstream consumers needing Pillow 12.x for CVE-2026-25990 fixes. The constraint was relaxed on main to >=10.3.0,<13.0.0. Source: issue #606, v0.8.0 release notes.

Unsupported Model Variants

Inference-free SPLADE (IF-SPLADE) is not yet supported. The current SPLADE model (prithivida/Splade_PP_en_v1) is symmetric, running the ONNX encoder on both documents and queries. Asymmetric doc-side encoding with tokenizer-only queries is not implemented. Source: issue #648.

The MiniCOIL sparse embedding model requires IDF weighting; without it, the model degrades to BM25-like behavior. Source: fastembed/sparse/minicoil.py:30-60.

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 10 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

Severity: high
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/qdrant/fastembed/issues/618

2. Installation risk: Installation risk requires verification

Severity: high
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/qdrant/fastembed/issues/630

3. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/qdrant/fastembed/issues/641

4. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/qdrant/fastembed

5. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/qdrant/fastembed

6. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/qdrant/fastembed

7. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | https://github.com/qdrant/fastembed

8. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/qdrant/fastembed/issues/648

9. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/qdrant/fastembed

10. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/qdrant/fastembed

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using fastembed with real data or production workflows.

Support inference-free SPLADE models (asymmetric doc-side encoding, toke - github / github_issue
[[Model]: LateOn](https://github.com/qdrant/fastembed/issues/641) - github / github_issue
[[Bug]: Segmentation Fault or AssertionError during initialization on Pyt](https://github.com/qdrant/fastembed/issues/618) - github / github_issue
[[Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14](https://github.com/qdrant/fastembed/issues/630) - github / github_issue
[[Feature]: Add python3.14 support](https://github.com/qdrant/fastembed/issues/576) - github / github_issue
v0.8.0 - github / github_release
v0.7.4 - github / github_release
v0.7.2 - github / github_release
v0.7.1 - github / github_release
v0.7.0 - github / github_release
v0.6.1 - github / github_release
Capability evidence risk requires verification - GitHub / issue

Source: Project Pack community evidence and pitfall evidence

fastembed

Overview & Dense Text Embeddings

Related Pages

Overview & Dense Text Embeddings

What is FastEmbed?

High-Level Architecture

Dense Text Embedding Variants

OnnxTextEmbedding (Flag / BGE family)

Pooled and Pooled-Normalized Embeddings

Late-Interaction (Multi-Vector) Embeddings

Selected Supported Dense Models

Usage and Community Notes

See Also

Sparse Embeddings (SPLADE, BM25, MiniCOIL, BM42)

Related Pages

Sparse Embeddings (SPLADE, BM25, MiniCOIL, BM42)

Overview and Purpose

Architecture and Data Flow

Backend Implementations

SPLADE (splade_pp.py)

BM25 (bm25.py)

BM42 (bm42.py)

MiniCOIL (minicoil.py)

Usage Pattern

Configuration Reference

Known Limitations and Community-Reported Issues

See Also

Late Interaction, Multimodal & Image Embeddings

Related Pages

Late Interaction, Multimodal & Image Embeddings

Overview

Late Interaction Text Embeddings

Architecture and Registry

ColBERT-style Backends

Jina ColBERT v2

Late Interaction Multimodal Embeddings

Registry and Supported Models

ColPali

ColModernVBERT

Image Embeddings

Configuration and Known Limitations

See Also

Custom Models, GPU Support, Rerankers & Known Issues

Related Pages

Custom Models, GPU Support, Rerankers & Known Issues

Overview

Custom Models

Custom Model Sources

GPU and Provider Configuration

ONNX Session Options

Multi-GPU Usage

Rerankers

Known Issues and Limitations

Python 3.14 Segmentation Faults

Pillow CVE Constraint

Unsupported Model Variants

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Installation risk: Installation risk requires verification

2. Installation risk: Installation risk requires verification

3. Installation risk: Installation risk requires verification

4. Capability evidence risk: Capability evidence risk requires verification

5. Maintenance risk: Maintenance risk requires verification

6. Security or permission risk: Security or permission risk requires verification

7. Security or permission risk: Security or permission risk requires verification

8. Security or permission risk: Security or permission risk requires verification

9. Maintenance risk: Maintenance risk requires verification

10. Maintenance risk: Maintenance risk requires verification

Community Discussion Evidence

Community Discussion Evidence