Doramagic Project Pack · Human Manual
fastembed
FastEmbed serves as an embedding generation engine optimized for production use cases, particularly in vector search applications. The library emphasizes:
Introduction to FastEmbed
Related topics: Installation Guide, System Architecture, Quick Start Guide
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Installation Guide, System Architecture, Quick Start Guide
Introduction to FastEmbed
FastEmbed is a lightweight, high-performance Python library designed for generating text and image embeddings using ONNX-based models. It provides a unified API for dense, sparse, and late-interaction embeddings with support for multiple embedding types and cross-encoder re-ranking models.
Overview
FastEmbed serves as an embedding generation engine optimized for production use cases, particularly in vector search applications. The library emphasizes:
- Performance: Leverages ONNX Runtime for efficient CPU inference
- Lightweight: Minimal dependencies and small model sizes
- Flexibility: Supports dense, sparse, and multimodal embeddings
- Ease of use: Simple Python API for common embedding workflows
Sources: README.md
Architecture
FastEmbed follows a modular architecture with separate components for different embedding types and processing stages.
graph TD
A[FastEmbed API] --> B[Text Embedding]
A --> C[Image Embedding]
A --> D[Sparse Embedding]
A --> E[Cross Encoder]
B --> B1[OnnxEmbedding]
B --> B2[PooledEmbedding]
B --> B3[PooledNormalizedEmbedding]
C --> C1[OnnxImageModel]
D --> D1[BM25]
D --> D2[SPLADE++]
D --> D3[MiniCOIL]
E --> E1[OnnxCrossEncoderModel]
B1 & B2 & B3 --> F[ONNX Runtime]
C1 --> F
E1 --> F
D1 & D2 & D3 --> G[Tokenization Engine]Core Components
| Component | Purpose | File Location |
|---|---|---|
TextEmbedding | Dense text embeddings | fastembed/text/ |
ImageEmbedding | Image embeddings | fastembed/image/ |
SparseTextEmbedding | Sparse embeddings (BM25, SPLADE) | fastembed/sparse/ |
TextCrossEncoder | Re-ranking models | fastembed/rerank/ |
LateInteractionTextEmbedding | Late interaction embeddings | fastembed/postprocess/ |
Sources: fastembed/text/onnx_embedding.py
Supported Models
FastEmbed supports an extensive collection of pre-converted ONNX models across multiple categories.
Text Embedding Models
Text embeddings are categorized by their pooling strategy and normalization approach.
#### Dense Models (Unimodal)
| Model | Dimension | Languages | Max Tokens | License |
|---|---|---|---|---|
BAAI/bge-base-en-v1.5 | 768 | English | 512 | MIT |
BAAI/bge-large-en-v1.5 | 1024 | English | 512 | MIT |
BAAI/bge-small-en-v1.5 | 384 | English | 512 | MIT |
thenlper/gte-base | 768 | English | 512 | MIT |
thenlper/gte-large | 1024 | English | 512 | MIT |
snowflake/snowflake-arctic-embed-m | 768 | English | 512 | Apache-2.0 |
snowflake/snowflake-arctic-embed-l | 1024 | English | 512 | Apache-2.0 |
Sources: fastembed/text/onnx_embedding.py:1-50
#### Multilingual Models
| Model | Dimension | Languages | Max Tokens |
|---|---|---|---|
intfloat/multilingual-e5-small | 384 | ~100 | 512 |
intfloat/multilingual-e5-large | 1024 | ~100 | 512 |
sentence-transformers/paraphrase-multilingual-mpnet-base-v2 | 768 | ~50 | 384 |
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | 384 | ~50 | 512 |
Sources: fastembed/text/pooled_embedding.py
#### Jina AI Models
| Model | Dimension | Languages | Max Tokens |
|---|---|---|---|
jinaai/jina-embeddings-v2-base-en | 768 | English | 8192 |
jinaai/jina-embeddings-v2-small-en | 512 | English | 8192 |
jinaai/jina-embeddings-v2-base-zh | 768 | Chinese/English | 8192 |
jinaai/jina-embeddings-v2-base-de | 768 | German/English | 8192 |
jinaai/jina-embeddings-v2-base-code | 768 | 30 languages | 8192 |
Sources: fastembed/text/pooled_normalized_embedding.py
Sparse Embedding Models
Sparse embeddings provide interpretable vectors with non-zero values only at specific token positions.
| Model | Type | Language | Description |
|---|---|---|---|
prithivida/Splade_PP_en_v1 | SPLADE++ | English | Sparse lexical + semantic |
Qdrant/minicoil-v1 | MiniCOIL | English | Semantic + keyword match |
Qdrant/bm25 | BM25 | Multilingual | Traditional BM25 ranking |
Sources: fastembed/sparse/bm25.py
The MiniCOIL model combines semantic understanding with exact keyword matching:
class MiniCOIL(SparseTextEmbeddingBase, OnnxTextModel[SparseEmbedding]):
"""
MiniCOIL is a sparse embedding model, that resolves semantic meaning of the words,
while keeping exact keyword match behavior.
Each vocabulary token is converted into 4d component of a sparse vector,
which is then weighted by the token frequency in the corpus.
If the token is not found in the corpus, it is treated exactly like in BM25.
"""
Sources: fastembed/sparse/minicoil.py
Image Embedding Models
| Model | Dimension | Type | License |
|---|---|---|---|
Qdrant/resnet50-onnx | 2048 | Image | Apache-2.0 |
Qdrant/Unicom-ViT-B-16 | 768 | Multimodal | Apache-2.0 |
Qdrant/Unicom-ViT-B-32 | 512 | Multimodal | Apache-2.0 |
jinaai/jina-clip-v1 | 768 | Multimodal | Apache-2.0 |
Sources: fastembed/image/onnx_embedding.py
Cross Encoder Reranking Models
| Model | Description | License | Size |
|---|---|---|---|
Xenova/ms-marco-MiniLM-L-12-v2 | MS MARCO passage ranking | Apache-2.0 | 0.12 GB |
BAAI/bge-reranker-base | BGE reranker base | MIT | 1.04 GB |
jinaai/jina-reranker-v1-tiny-en | Fast reranking, 8K context | Apache-2.0 | 0.13 GB |
jinaai/jina-reranker-v1-turbo-en | Fast reranking, 8K context | Apache-2.0 | 0.15 GB |
jinaai/jina-reranker-v2-base-multilingual | Multilingual, 1K context | CC-BY-NC-4.0 | 1.11 GB |
Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py
Usage Patterns
Text Embedding Generation
from fastembed import TextEmbedding
# Initialize with default model
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
# Generate embeddings
documents = [
"passage: The capital of France is Paris",
"query: What is the capital of France?"
]
embeddings = list(model.embed(documents))
# Returns list of numpy arrays
Sources: README.md
Sparse Embedding with SPLADE++
from fastembed import SparseTextEmbedding
model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
embeddings = list(model.embed(documents))
# Returns:
# [
# SparseEmbedding(indices=[ 17, 123, 919, ... ], values=[0.71, 0.22, 0.39, ...]),
# SparseEmbedding(indices=[ 38, 12, 91, ... ], values=[0.11, 0.22, 0.39, ...])
# ]
Sources: README.md
Custom Model Configuration
from fastembed import TextEmbedding, PoolingType
from fastembed.common import ModelSource
# Use custom model with specific configuration
model = TextEmbedding(
model_name="intfloat/multilingual-e5-small",
pooler_type=PoolingType.MEAN,
normalization=True,
sources=ModelSource(hf="intfloat/multilingual-e5-small"),
dim=384,
model_file="onnx/model.onnx"
)
embeddings = list(model.embed(documents))
Sources: README.md
Cross Encoder Reranking
from fastembed import TextCrossEncoder
model = TextCrossEncoder(model_name="BAAI/bge-reranker-base")
# Score query-document pairs
query = "What is the capital of France?"
documents = ["Paris is the capital of France.", "Berlin is the capital of Germany."]
scores = list(model.rerank(query, documents))
Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py
Post-Processing: MUVERA
FastEmbed includes MUVERA (Multi-Vector Reduction Algorithm) for converting late-interaction embeddings (like ColBERT) to fixed-dimensional encodings.
from fastembed import LateInteractionTextEmbedding
from fastembed.postprocess import Muvera
model = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
muvera = Muvera.from_multivector_model(
model=model,
k_sim=6,
dim_proj=32
)
# Convert late-interaction embeddings to fixed dimension
embeddings = np.array(list(model.embed(["sample text"])))
fde = muvera.process_document(embeddings[0])
Sources: fastembed/postprocess/muvera.py
Model Caching
FastEmbed automatically caches downloaded models to avoid repeated downloads.
| Setting | Environment Variable | Default Location |
|---|---|---|
| Cache Path | FASTEMBED_CACHE_PATH | System temp directory |
Models are stored in ONNX format after download and converted to optimized formats on first use.
Prefix Requirements
Some models require specific text prefixes for query and document inputs:
| Prefix Requirement | Models | Example |
|---|---|---|
| Necessary | E5, Nomic, Snowflake Arctic, BGE-small | Query: query: ..., Document: passage: ... |
| Not Necessary | Jina, BGE (base/large), GTE | Plain text input |
Sources: fastembed/text/onnx_embedding.py
Quick Reference
Model Selection Guide
| Use Case | Recommended Model |
|---|---|
| English semantic search | BAAI/bge-small-en-v1.5 |
| High-quality English | BAAI/bge-large-en-v1.5 |
| Multilingual (100+ languages) | intfloat/multilingual-e5-large |
| Long documents (8K tokens) | jinaai/jina-embeddings-v2-base-en |
| Fast re-ranking | jinaai/jina-reranker-v1-tiny-en |
| Sparse lexical + semantic | prithivida/Splade_PP_en_v1 |
API Quick Reference
| Class | Import | Primary Method |
|---|---|---|
| Dense Text | from fastembed import TextEmbedding | .embed(documents) |
| Sparse Text | from fastembed import SparseTextEmbedding | .embed(documents) |
| Image | from fastembed import ImageEmbedding | .embed(images) |
| Cross Encoder | from fastembed import TextCrossEncoder | .rerank(query, documents) |
| Late Interaction | from fastembed import LateInteractionTextEmbedding | .embed(documents) |
Further Documentation
Sources: README.md
Installation Guide
Related topics: Introduction to FastEmbed, GPU Support and Acceleration
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Introduction to FastEmbed, GPU Support and Acceleration
Installation Guide
FastEmbed is a lightweight, fast, and accurate embedding library developed by Qdrant. This guide covers all aspects of installing and configuring FastEmbed for various use cases including CPU inference, GPU acceleration, and integration with Qdrant vector database.
Prerequisites
System Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| Python Version | 3.9+ | 3.10+ |
| RAM | 4 GB | 8 GB+ |
| Disk Space | 2 GB | 5 GB+ |
| GPU (Optional) | CUDA 11.8+ | CUDA 12.x |
Verify your Python version before installation:
python --version
Sources: README.md:1-100
Package Variants
FastEmbed is distributed in two package variants:
| Package | Description | Use Case |
|---|---|---|
fastembed | CPU-only version | General purpose embedding generation |
fastembed-gpu | GPU-accelerated version | High-throughput production workloads |
CPU Installation
Install the standard CPU version using pip:
pip install fastembed
Sources: README.md Sources: RELEASE.md:1-15
GPU Installation
For CUDA-enabled GPU acceleration:
pip install fastembed-gpu
The GPU package automatically includes the CUDA Execution Provider for ONNX Runtime, enabling significantly faster inference on NVIDIA GPUs.
Sources: RELEASE.md:1-15 Sources: README.md
Installation Verification
Verify successful installation:
from fastembed import TextEmbedding
# List available models
print(TextEmbedding.list_supported_models())
# Initialize a model
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
print("FastEmbed installed successfully!")
Supported Embedding Models
FastEmbed supports multiple embedding modalities organized into the following categories:
Dense Text Embeddings
Dense text embeddings provide fixed-dimensional vector representations for text. Supported models include:
| Model | Dimension | Languages | Token Limit | License |
|---|---|---|---|---|
BAAI/bge-small-en-v1.5 | 384 | English | 512 | MIT |
BAAI/bge-base-en-v1.5 | 768 | English | 512 | MIT |
BAAI/bge-large-en-v1.5 | 1024 | English | 512 | MIT |
jinaai/jina-embeddings-v2-base-en | 768 | English | 8192 | Apache 2.0 |
jinaai/jina-embeddings-v2-small-en | 512 | English | 8192 | Apache 2.0 |
sentence-transformers/all-MiniLM-L6-v2 | 384 | English | 256 | Apache 2.0 |
mixedbread-ai/mxbai-embed-large-v1 | 1024 | English | 512 | Apache 2.0 |
Sources: fastembed/text/onnx_embedding.py
Multilingual Models
| Model | Dimension | Languages | Token Limit | License |
|---|---|---|---|---|
intfloat/multilingual-e5-small | 384 | ~100 | 512 | MIT |
intfloat/multilingual-e5-large | 1024 | ~100 | 512 | MIT |
sentence-transformers/paraphrase-multilingual-mpnet-base-v2 | 768 | ~50 | 384 | Apache 2.0 |
jinaai/jina-embeddings-v2-base-de | 768 | German, English | 8192 | Apache 2.0 |
jinaai/jina-embeddings-v2-base-zh | 768 | Chinese, English | 8192 | Apache 2.0 |
jinaai/jina-embeddings-v2-base-es | 768 | Spanish, English | 8192 | Apache 2.0 |
Sources: fastembed/text/pooled_normalized_embedding.py Sources: fastembed/text/pooled_embedding.py
Sparse Embeddings
Sparse embeddings represent text using high-dimensional sparse vectors, useful for keyword-based retrieval.
| Model | Type | License |
|---|---|---|
prithivida/Splade_PP_en_v1 | SPLADE++ | Apache 2.0 |
Qdrant/bm25 | BM25 | Apache 2.0 |
Sources: fastembed/sparse/bm25.py Sources: README.md
Image Embeddings
| Model | Dimension | Type | License |
|---|---|---|---|
Qdrant/resnet50-onnx | 2048 | Image | Apache 2.0 |
Qdrant/Unicom-ViT-B-16 | 768 | Multimodal | Apache 2.0 |
Qdrant/Unicom-ViT-B-32 | 512 | Multimodal | Apache 2.0 |
jinaai/jina-clip-v1 | 768 | Multimodal (text&image) | Apache 2.0 |
Sources: fastembed/image/onnx_embedding.py
Reranking Models
Cross-encoder models for re-ranking search results:
| Model | Context Length | License |
|---|---|---|
Xenova/ms-marco-MiniLM-L-12-v2 | - | Apache 2.0 |
BAAI/bge-reranker-base | - | MIT |
jinaai/jina-reranker-v1-tiny-en | 8K | Apache 2.0 |
jinaai/jina-reranker-v1-turbo-en | 8K | Apache 2.0 |
jinaai/jina-reranker-v2-base-multilingual | 1K | CC-BY-NC-4.0 |
Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py
GPU Configuration
CUDA Provider Setup
Enable GPU acceleration by specifying the CUDA execution provider:
from fastembed import TextEmbedding
from fastembed.common import OnnxProvider
model = TextEmbedding(
model_name="BAAI/bge-small-en-v1.5",
providers=["CUDAExecutionProvider"]
)
print("The model BAAI/bge-small-en-v1.5 is ready to use on a GPU.")
Sources: README.md
GPU Installation Workflow
graph TD
A[Install fastembed-gpu] --> B{Check CUDA Version}
B -->|CUDA 11.8+| C[Install Compatible Driver]
B -->|CUDA < 11.8| D[Upgrade CUDA]
C --> E[Verify ONNX Runtime GPU Support]
E --> F[Import TextEmbedding]
F --> G[Configure Providers]
G --> H[GPU Inference Ready]Qdrant Integration
FastEmbed integrates seamlessly with the Qdrant vector database for production deployments.
Installation with Qdrant Client
# Standard Qdrant with FastEmbed
pip install qdrant-client[fastembed]
# GPU-accelerated FastEmbed with Qdrant
pip install qdrant-client[fastembed-gpu]
On zsh shells, use quotes:
pip install 'qdrant-client[fastembed]'
Sources: README.md
Complete Qdrant Integration Example
from qdrant_client import QdrantClient, models
# Initialize the client
client = QdrantClient("localhost", port=6333) # For production
# client = QdrantClient(":memory:") # For experimentation
model_name = "sentence-transformers/all-MiniLM-L6-v2"
payload = [
{"document": "Qdrant has Langchain integrations", "source": "Langchain-docs"},
{"document": "Qdrant also has Llama Index integrations", "source": "LlamaIndex-docs"},
]
docs = [models.Document(text=data["document"], model=model_name) for data in payload]
ids = [42, 2]
client.create_collection(
"demo_collection",
vectors_config=models.VectorParams(
size=client.get_embedding_size(model_name), distance=models.Distance.COSINE
)
)
client.upload_collection(
collection_name="demo_collection",
vectors=docs,
ids=ids,
payload=payload,
)
search_result = client.query_points(
collection_name="demo_collection",
query=docs[0],
limit=5,
)
Sources: README.md
Cache Configuration
Default Cache Location
Models are cached in fastembed_cache within the system's temp directory by default. This location can be customized using environment variables.
FASTEMBED_CACHE_PATH
Set a custom cache directory:
export FASTEMBED_CACHE_PATH=/path/to/custom/cache
The cache directory structure follows Hugging Face's conventions, with models organized by their source repository.
Model Loading Behavior
graph TD
A[Import FastEmbed Module] --> B{Model in Cache?}
B -->|Yes| C[Load from Cache]
B -->|No| D[Download from Source]
D --> E{HuggingFace Available?}
E -->|Yes| F[Download from HF Hub]
E -->|No| G[Use URL Source]
C --> H[Initialize ONNX Session]
E --> H
G --> H
H --> I[Model Ready for Inference]Usage Examples
Dense Text Embedding
from fastembed import TextEmbedding
documents = [
"passage: FastEmbed is a fast embedding library",
"query: What is FastEmbed?",
"passage: Qdrant is a vector database",
]
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
embeddings = list(model.embed(documents))
print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding dimension: {embeddings[0].shape}")
Sparse Text Embedding (SPLADE++)
from fastembed import SparseTextEmbedding
model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
embeddings = list(model.embed(documents))
# Output format:
# [
# SparseEmbedding(indices=[ 17, 123, 919, ... ], values=[0.71, 0.22, 0.39, ...]),
# SparseEmbedding(indices=[ 38, 12, 91, ... ], values=[0.11, 0.22, 0.39, ...])
# ]
Sources: README.md
Image Embedding
from fastembed import ImageEmbedding
from PIL import Image
model = ImageEmbedding(model_name="Qdrant/resnet50-onnx")
images = [Image.open("path/to/image.jpg")]
embeddings = list(model.embed(images))
Custom Model Source
Load a supported model with custom configuration:
from fastembed import TextEmbedding
from fastembed.common import ModelSource, PoolingType
model = TextEmbedding(
model_name="custom-model",
pool_type=PoolingType.MEAN,
normalization=True,
sources=ModelSource(hf="intfloat/multilingual-e5-small"),
dim=384,
model_file="onnx/model.onnx"
)
embeddings = list(model.embed(documents))
Sources: README.md
Late Interaction Models
Late interaction models like ColBERT enable more sophisticated similarity matching:
from fastembed import LateInteractionTextEmbedding
model = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
embeddings = list(model.embed(documents))
Post-Processing with Muvera
The Muvera post-processor enables Fixed Dimensional Encoding (FDE) for multi-vector models:
from fastembed import LateInteractionTextEmbedding
from fastembed.postprocess import Muvera
import numpy as np
model = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
muvera = Muvera.from_multivector_model(
model=model,
k_sim=6,
dim_proj=32
)
embeddings = np.array(list(model.embed(["sample text"])))
fde = muvera.process_document(embeddings[0])
Sources: fastembed/postprocess/muvera.py
Troubleshooting
Common Installation Issues
| Issue | Solution |
|---|---|
ModuleNotFoundError: No module named 'fastembed' | Run pip install fastembed |
| CUDA not available | Install fastembed-gpu and verify NVIDIA driver |
| Model download fails | Check network connectivity and HuggingFace access |
| Out of memory | Reduce batch size or use smaller model variant |
Checking Installed Version
import fastembed
print(fastembed.__version__)
Verifying GPU Availability
import onnxruntime as ort
print(f"Available providers: {ort.get_available_providers()}")
Advanced Configuration
Lazy Loading
Enable lazy model loading for memory-efficient initialization:
from fastembed import TextEmbedding
model = TextEmbedding(
model_name="BAAI/bge-small-en-v1.5",
lazy_load=True # Model loads on first inference call
)
Thread Configuration
Optimize CPU thread usage:
from fastembed import TextEmbedding
model = TextEmbedding(
model_name="BAAI/bge-small-en-v1.5",
threads=4 # Limit to 4 threads
)
Device Selection
from fastembed.common import Device
model = TextEmbedding(
model_name="BAAI/bge-small-en-v1.5",
cuda=True # or Device.AUTO for automatic detection
)
Documentation Resources
For more information, refer to the official documentation:
Sources: mkdocs.yml
Sources: README.md:1-100
Quick Start Guide
Related topics: Introduction to FastEmbed, Text Embedding Module, Image Embedding Module
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Introduction to FastEmbed, Text Embedding Module, Image Embedding Module
Quick Start Guide
FastEmbed is a lightweight, fast, and accurate embedding library developed by Qdrant. It provides text and image embeddings using ONNX-based models optimized for production deployment. This guide covers the essential steps to get started with FastEmbed for your embedding needs.
Overview
FastEmbed enables developers to generate high-quality vector embeddings for text and images with minimal configuration. The library supports multiple embedding types including dense embeddings, sparse embeddings, and cross-encoder reranking models.
graph TD
A[FastEmbed Library] --> B[Text Embeddings]
A --> C[Image Embeddings]
A --> D[Sparse Embeddings]
A --> E[Reranking Models]
B --> B1[Dense Models]
B --> B2[Pooled Models]
D --> D1[SPLADE++]
D --> D2[BM25]
D --> D3[MiniCOIL]Installation
Install FastEmbed using pip:
pip install fastembed
For CUDA acceleration support:
pip install fastembed[cuda]
Basic Text Embedding
Loading a Model
Import and initialize a text embedding model:
from fastembed import TextEmbedding
# Use the default model (BAAI/bge-small-en-v1.5)
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
Generating Embeddings
documents = [
"passage: A man is eating food.",
"passage: A man is eating a piece of broccoli.",
"passage: A man is eating pasta.",
"passage: A woman is cutting vegetables.",
]
embeddings = list(model.embed(documents))
The prefix "passage:" is required for some models like BAAI/bge-small-en-v1.5 and snowflake/snowflake-arctic-embed-xs to indicate the text type. Sources: README.md
Supported Models
Text Embedding Models
| Model Name | Dimension | Languages | Max Tokens | License | Size (GB) |
|---|---|---|---|---|---|
| BAAI/bge-small-en-v1.5 | 384 | English | 512 | MIT | 0.067 |
| BAAI/bge-base-en-v1.5 | 768 | English | 512 | MIT | 0.21 |
| BAAI/bge-large-en-v1.5 | 1024 | English | 512 | MIT | 1.20 |
| intfloat/multilingual-e5-small | 384 | Multilingual (~100) | 512 | MIT | 0.09 |
| sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | 384 | Multilingual (~50) | 512 | Apache 2.0 | 0.22 |
| jinaai/jina-embeddings-v2-base-en | 768 | English | 8192 | Apache 2.0 | 0.52 |
| snowflake/snowflake-arctic-embed-xs | 384 | English | 512 | Apache 2.0 | 0.09 |
| snowflake/snowflake-arctic-embed-m-long | 768 | English | 2048 | Apache 2.0 | 0.54 |
Sources: fastembed/text/onnx_embedding.py
Pooled Normalized Embeddings
The PooledNormalizedEmbedding class applies mean pooling to the model output and normalizes the result:
from fastembed.text import PooledNormalizedEmbedding
model = PooledNormalizedEmbedding(model_name="jinaai/jina-embeddings-v2-base-en")
The post-processing combines mean pooling with L2 normalization:
def _post_process_onnx_output(self, output: OnnxOutputContext, **kwargs: Any) -> Iterable[NumpyArray]:
embeddings = output.model_output
attn_mask = output.attention_mask
return normalize(self.mean_pooling(embeddings, attn_mask))
Sources: fastembed/text/pooled_normalized_embedding.py:78-84
Image Embeddings
FastEmbed supports multimodal models for image embeddings:
from fastembed.image import OnnxImageEmbedding
model = OnnxImageEmbedding(model_name="Qdrant/Unicom-ViT-B-16")
Supported Image Models
| Model Name | Dimension | Type | License | Size (GB) |
|---|---|---|---|---|
| Qdrant/Unicom-ViT-B-16 | 768 | Multimodal (text&image) | Apache 2.0 | 0.82 |
| Qdrant/Unicom-ViT-B-32 | 512 | Multimodal (text&image) | Apache 2.0 | 0.48 |
| jinaai/jina-clip-v1 | 768 | Multimodal (text&image) | Apache 2.0 | 0.34 |
Sources: fastembed/image/onnx_embedding.py
Sparse Embeddings
Sparse embeddings provide interpretable, non-dense vector representations useful for keyword-aware semantic search.
SPLADE++
SPLADE++ is a sparse embedding model that resolves semantic meaning while preserving keyword match behavior:
from fastembed import SparseTextEmbedding
model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
embeddings = list(model.embed(documents))
Returns sparse vectors with indices and values:
[
SparseEmbedding(indices=[17, 123, 919, ...], values=[0.71, 0.22, 0.39, ...]),
SparseEmbedding(indices=[38, 12, 91, ...], values=[0.11, 0.22, 0.39, ...])
]
BM25
Traditional BM25 implemented as sparse embeddings:
from fastembed.sparse import Bm25
model = Bm25(language="en")
BM25 formula:
score(q, d) = SUM[ IDF(q_i) * (f(q_i, d) * (k + 1)) / (f(q_i, d) + k * (1 - b + b * (|d| / avg_len))) ]
Sources: fastembed/sparse/bm25.py:47-52
MiniCOIL
MiniCOIL combines semantic embeddings with exact keyword matching:
from fastembed.sparse import MiniCOIL
model = MiniCOIL(model_name="Qdrant/minicoil-v1")
Reranking Models
Cross-encoder reranking improves search results by re-scoring candidate documents:
from fastembed import TextCrossEncoder
model = TextCrossEncoder(model_name="jinaai/jina-reranker-v1-turbo-en")
Supported Reranker Models
| Model Name | License | Size (GB) | Context Length |
|---|---|---|---|
| jinaai/jina-reranker-v1-turbo-en | Apache 2.0 | 0.15 | 1K |
| jinaai/jina-reranker-v2-base-multilingual | CC BY-NC 4.0 | 1.11 | 1K (sliding window) |
Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py
Common Configuration Options
Initialization Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name | str | "BAAI/bge-small-en-v1.5" | Name of the model to use |
cache_dir | str or None | None | Cache directory path |
threads | int or None | None | Number of threads for ONNX execution |
providers | Sequence[OnnxProvider] | None | ONNX execution providers (CPU, CUDA, etc.) |
cuda | bool or Device | Device.AUTO | Enable CUDA acceleration |
device_ids | list[int] | None | Specific GPU device IDs |
lazy_load | bool | False | Defer model loading until first use |
Sources: fastembed/text/onnx_embedding.py:123-136
Complete Example Workflow
graph LR
A[Input Documents] --> B[TextEmbedding]
B --> C[Embedding Vectors]
C --> D[Vector Database]
D --> E[Similarity Search]
E --> F[Candidate Results]
F --> G[TextCrossEncoder]
G --> H[Reranked Results]from fastembed import TextEmbedding, SparseTextEmbedding, TextCrossEncoder
# 1. Load models
text_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
sparse_model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
reranker = TextCrossEncoder(model_name="jinaai/jina-reranker-v1-turbo-en")
# 2. Generate dense embeddings
documents = ["query: fastembed is fast", "passage: FastEmbed provides efficient embeddings"]
dense_embeddings = list(text_model.embed(documents))
# 3. Generate sparse embeddings
sparse_embeddings = list(sparse_model.embed(documents))
# 4. Rerank results
query = "query: What is FastEmbed?"
results = reranker.rerank(query=query, documents=documents, top_k=2)
Language-Specific Models
For non-English content, use multilingual models:
| Language | Recommended Model |
|---|---|
| Chinese | BAAI/bge-small-zh-v1.5, jinaai/jina-embeddings-v2-base-zh |
| German | jinaai/jina-embeddings-v2-base-de |
| Spanish | jinaai/jina-embeddings-v2-base-es |
| Code | jinaai/jina-embeddings-v2-base-code |
| Multilingual | intfloat/multilingual-e5-small, sentence-transformers/paraphrase-multilingual-mpnet-base-v2 |
Prefix Requirements
Some models require specific prefixes to distinguish query and passage text:
| Model | Prefix Required | Example |
|---|---|---|
| BAAI/bge-small-en-v1.5 | Yes | "query: ...", "passage: ..." |
| intfloat/multilingual-e5-small | Yes | "query: ...", "passage: ..." |
| snowflake/snowflake-arctic-embed-xs | Yes | "query: ...", "passage: ..." |
| jinaai/jina-embeddings-v2-base-en | Not necessary | Plain text |
Sources: fastembed/text/onnx_embedding.py
Next Steps
- Explore the API Reference for detailed method documentation
- Learn about Model Selection to choose the right embedding model
- Read Advanced Usage for performance optimization
- Check Examples for real-world use cases
Sources: fastembed/text/onnx_embedding.py
System Architecture
Related topics: Introduction to FastEmbed, ONNX Model Infrastructure
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Introduction to FastEmbed, ONNX Model Infrastructure
System Architecture
FastEmbed is a lightweight, fast text and image embedding library built on ONNX Runtime. The architecture is designed around modularity, enabling multiple embedding types (dense, sparse, late-interaction) through a unified interface while leveraging ONNX for cross-platform inference optimization.
Architecture Overview
FastEmbed follows a layered architecture with clear separation of concerns:
graph TD
subgraph "Public API Layer"
A["TextEmbedding<br/>ImageEmbedding<br/>SparseTextEmbedding<br/>LateInteractionTextEmbedding"]
end
subgraph "Embedding Base Classes"
B["TextEmbeddingBase"]
C["ImageEmbeddingBase"]
D["SparseTextEmbeddingBase"]
E["LateInteractionTextEmbeddingBase"]
end
subgraph "ONNX Abstraction Layer"
F["OnnxTextModel"]
G["OnnxImageModel"]
end
subgraph "Model Management"
H["ModelManagement"]
I["OnnxModel"]
end
subgraph "Parallel Processing"
J["ParallelProcessor"]
K["EmbeddingWorker<br/>OnnxTextEmbeddingWorker"]
end
subgraph "ONNX Runtime"
L["InferenceSession"]
end
A --> B
A --> C
A --> D
A --> E
B --> F
C --> G
F --> I
G --> I
I --> L
H --> I
J --> K
K --> FCore Components
Base Class Hierarchy
FastEmbed implements embedding models through inheritance hierarchies that separate concerns between the embedding interface and the ONNX runtime integration.
#### Text Embedding Base
The TextEmbeddingBase class provides the foundation for all text embedding models. It defines the abstract interface that concrete implementations must follow.
| Method | Purpose | Source |
|---|---|---|
embed() | Generate embeddings for input documents | text_embedding_base.py |
_list_supported_models() | Return list of supported model descriptions | text_embedding_base.py |
_post_process_onnx_output() | Transform raw ONNX output to final embeddings | text_embedding_base.py |
#### Image Embedding Base
The ImageEmbeddingBase class mirrors the text embedding architecture for image inputs, supporting multimodal models like Jina CLIP.
| Property | Type | Description | |
|---|---|---|---|
model_name | str | HuggingFace model identifier | |
cache_dir | str \ | None | Local cache directory for model files |
lazy_load | bool | Defer model loading until first use |
ONNX Model Abstraction
The OnnxModel class serves as the bridge between the embedding interface and ONNX Runtime execution.
classDiagram
class OnnxModel~T~ {
+str model_name
+str cache_dir
+InferenceSession inference_session
+load_model() Any
+run(input_feed)~T~
}
class OnnxTextModel~T~ {
+mean_pooling(output, attention_mask)
+encode(input_data) Iterable~T~
}
class OnnxImageModel~T~ {
+preprocess_image(image) Tensor
+encode(images) Iterable~T~
}
OnnxModel <|-- OnnxTextModel
OnnxModel <|-- OnnxImageModelThe ONNX model abstraction provides:
- Model Loading: Downloads and caches ONNX models from HuggingFace or custom URLs
- Session Management: Creates and configures ONNX Runtime inference sessions with specified providers (CPU, CUDA, TensorRT)
- Input/Output Handling: Manages input preprocessing and output postprocessing
Embedding Types
FastEmbed supports multiple embedding paradigms through specialized classes.
Dense Embeddings
Dense embeddings convert inputs into fixed-dimensional continuous vectors. The OnnxTextEmbedding class provides dense text embeddings using models like BGE, GTE, and Jina embeddings.
graph LR
A["Input Text"] --> B["Tokenization"]
B --> C["ONNX Inference"]
C --> D["mean_pooling"]
D --> E["L2 Normalization"]
E --> F["Dense Vector"]Key supported models:
| Model | Dimension | Context Length | Prefix Required |
|---|---|---|---|
| BAAI/bge-small-en-v1.5 | 384 | 512 | No |
| BAAI/bge-base-en-v1.5 | 768 | 512 | No |
| BAAI/bge-large-en-v1.5 | 1024 | 512 | No |
| jinaai/jina-embeddings-v2-base-en | 768 | 8192 | No |
| intfloat/multilingual-e5-small | 384 | 512 | Yes |
Pooled Normalized Embeddings
The PooledNormalizedEmbedding class extends PooledEmbedding with built-in mean pooling and L2 normalization.
# From pooled_normalized_embedding.py
class PooledNormalizedEmbedding(PooledEmbedding):
def _post_process_onnx_output(
self, output: OnnxOutputContext, **kwargs: Any
) -> Iterable[NumpyArray]:
if output.attention_mask is None:
raise ValueError("attention_mask must be provided for document post-processing")
embeddings = output.model_output
attn_mask = output.attention_mask
return normalize(self.mean_pooling(embeddings, attn_mask))
Sparse Embeddings
Sparse embeddings represent documents as sparse vectors with non-zero values only at relevant token positions. FastEmbed provides two sparse embedding approaches:
#### BM25
BM25 is implemented as a traditional sparse embedding model with IDF weighting. The formula used:
score(q, d) = Σ[ IDF(q_i) * (f(q_i, d) * (k + 1)) / (f(q_i, d) + k * (1 - b + b * (|d| / avg_len))) ]
Where:
IDF(q_i)is the inverse document frequencyf(q_i, d)is the term frequencyk,bare hyperparameters controlling saturation and length normalization
#### SPLADE++
SPLADE++ models use neural networks to generate sparse embeddings that combine semantic understanding with exact keyword matching.
Late Interaction Embeddings
Late interaction models like ColBERT generate multiple token-level embeddings that can be compared efficiently during retrieval. These are often combined with postprocessors like MUVERA for fixed-dimensional encoding.
Model Management
Model Source Configuration
Models can be loaded from multiple sources defined through the ModelSource class:
sources=ModelSource(
hf="jinaai/jina-embeddings-v2-base-en", # HuggingFace Hub
url="https://storage.googleapis.com/...", # Direct URL download
_deprecated_tar_struct=True # Legacy tar format
)
Cache Strategy
The ModelManagement class handles model caching and lazy loading:
- Cache Directory: Configurable via
FASTEMBED_CACHE_PATHenvironment variable orcache_dirparameter - Lazy Loading: Models are not loaded until first inference when
lazy_load=True - Provider Selection: Automatic provider selection (CUDA > CPU) with fallback
graph TD
A["Model Request"] --> B{"Cache Hit?"}
B -->|Yes| C["Load from Cache"]
B -->|No| D["Download Model"]
D --> E["Store in Cache"]
C --> F["Initialize ONNX Session"]
E --> F
F --> G["Ready for Inference"]Parallel Processing
FastEmbed uses a worker-based parallel processing architecture for efficient batch inference.
ParallelProcessor
The ParallelProcessor class manages a pool of workers for parallel embedding generation:
graph TD
subgraph "Main Process"
A["ParallelProcessor"]
B["Input Queue"]
end
subgraph "Worker Pool"
C["Worker 1"]
D["Worker 2"]
E["Worker N"]
end
B --> C
B --> D
B --> E
C --> F["Result Queue"]
D --> F
E --> FWorker Classes
Workers inherit from EmbeddingWorker or specialized variants:
| Worker Class | Purpose |
|---|---|
EmbeddingWorker | Base worker for generic embeddings |
OnnxTextEmbeddingWorker | Text embedding with ONNX inference |
PooledNormalizedEmbeddingWorker | Worker with pooling and normalization |
Workers implement the init_embedding() method to initialize the embedding model:
def init_embedding(
self,
model_name: str,
cache_dir: str | None = None,
threads: int | None = None,
providers: Sequence[OnnxProvider] | None = None,
cuda: bool | Device = Device.AUTO,
device_ids: list[int] | None = None,
lazy_load: bool = False,
device_id: int | None = None,
specific_model_path: str | None = None,
**kwargs: Any,
):
Cross-Encoder Reranking
The cross-encoder reranking system follows a separate architectural path:
graph LR
A["Query"] --> D["Cross-Encoder"]
B["Candidate Doc"] --> D
D --> E["Relevance Scores"]The OnnxTextCrossEncoder class provides reranking through:
class OnnxTextCrossEncoder(TextCrossEncoderBase, OnnxCrossEncoderModel):
@classmethod
def _list_supported_models(cls) -> list[BaseModelDescription]:
return supported_onnx_models
Supported reranker models:
| Model | Context Length | Languages |
|---|---|---|
| jinaai/jina-reranker-v1-turbo-en | 1K | English |
| jinaai/jina-reranker-v2-base-multilingual | 1K (sliding window) | Multilingual |
Device and Provider Management
FastEmbed supports multiple execution providers with automatic device selection:
# Device selection logic (simplified)
if cuda and Device.CUDA available:
use CUDA provider
elif cuda and Device.ONNX_CPU fallback:
use CPU provider
else:
use default provider
Supported Providers
| Provider | Device | Use Case |
|---|---|---|
| CPUExecutionProvider | CPU | General purpose, no GPU |
| CUDAExecutionProvider | NVIDIA GPU | Fast inference with CUDA |
| TensorRTExecutionProvider | NVIDIA GPU | Optimized batch inference |
Configuration Options
Model Initialization Parameters
| Parameter | Type | Default | Description | |
|---|---|---|---|---|
model_name | str | "BAAI/bge-small-en-v1.5" | Model identifier | |
cache_dir | str \ | None | None | Cache directory path |
threads | int \ | None | None | Thread count for CPU execution |
providers | Sequence[OnnxProvider] \ | None | None | ONNX execution providers |
cuda | bool \ | Device | Device.AUTO | CUDA device selection |
device_ids | list[int] \ | None | None | Specific GPU device IDs |
lazy_load | bool | False | Defer loading until first use | |
device_id | int \ | None | None | Single device ID assignment |
specific_model_path | str \ | None | None | Override model file path |
Data Flow
Text Embedding Pipeline
sequenceDiagram
participant Client
participant TextEmbedding
participant OnnxTextModel
participant ONNXRuntime
Client->>TextEmbedding: embed(documents)
TextEmbedding->>TextEmbedding: preprocess(documents)
TextEmbedding->>OnnxTextModel: encode(preprocessed)
OnnxTextModel->>ONNXRuntime: run(input_feed)
ONNXRuntime-->>OnnxTextModel: raw_output
OnnxTextModel->>OnnxTextModel: mean_pooling + normalize
OnnxTextModel-->>TextEmbedding: embeddings
TextEmbedding-->>Client: Iterable[NDArray]Model Description Schema
Each model is described by a DenseModelDescription or BaseModelDescription:
DenseModelDescription(
model="BAAI/bge-small-en-v1.5",
dim=384,
description="Text embeddings, Unimodal (text), English, 512 input tokens truncation",
license="mit",
size_in_GB=0.067,
sources=ModelSource(hf="qdrant/bge-small-en-v1.5-onnx-q"),
model_file="model_optimized.onnx",
)
| Field | Type | Description |
|---|---|---|
model | str | HuggingFace model identifier |
dim | int | Embedding dimension |
description | str | Human-readable model description |
license | str | Model license |
size_in_GB | float | Model file size |
sources | ModelSource | Download sources configuration |
model_file | str | ONNX model file name |
additional_files | list[str] | Extra files (vocab, configs) |
requires_idf | bool | IDF file requirement for sparse models |
Post-Processing
MUVERA Post-Processor
The Muvera post-processor converts late-interaction (multi-vector) embeddings to fixed-dimensional representations:
graph TD
A["Multi-Vector Embeddings"] --> B["Muvera Processing"]
B --> C["Fixed-Dimensional Encoding"]
C --> D["L2 Normalized FDE"]
subgraph "Parameters"
E["k_sim: number of partitions"]
F["dim_proj: projection dimension"]
G["r_reps: repetitions"]
endOutput dimension calculation: r_reps * 2^k_sim * dim_proj
Summary
The FastEmbed architecture demonstrates a well-structured approach to embedding generation:
- Modularity: Clear separation between embedding types, ONNX abstraction, and model management
- Performance: ONNX Runtime integration with automatic provider selection
- Flexibility: Support for dense, sparse, and late-interaction embeddings
- Extensibility: Worker-based parallel processing with lazy loading support
- Portability: Cross-platform ONNX execution with CUDA and TensorRT acceleration
Source: https://github.com/qdrant/fastembed / Human Manual
Text Embedding Module
Related topics: System Architecture, GPU Support and Acceleration
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Architecture, GPU Support and Acceleration
Text Embedding Module
The Text Embedding Module in FastEmbed provides high-performance text vectorization capabilities using ONNX runtime for efficient inference. It supports dense embeddings, pooled embeddings, normalized embeddings, and multimodal (CLIP) text embeddings.
Architecture Overview
The module follows a layered architecture with base classes providing common functionality and specialized implementations for different embedding strategies.
graph TD
Base[TextEmbeddingBase] --> OnnxTextEmbedding
Base[TextEmbeddingBase] --> PooledEmbedding
PooledEmbedding --> PooledNormalizedEmbedding
OnnxTextEmbedding --> CLIPOnnxEmbedding
OnnxTextEmbedding --> PooledEmbedding
Worker[OnnxTextEmbeddingWorker] -.->|init_embedding| OnnxTextEmbedding
CLIPWorker[CLIPEmbeddingWorker] -.->|init_embedding| CLIPOnnxEmbedding
Models[Supported Models] -->|BGE, Jina, GTE, etc| OnnxTextEmbeddingCore Components
| Component | File | Purpose |
|---|---|---|
TextEmbeddingBase | text_embedding.py | Abstract base class defining the embedding interface |
OnnxTextEmbedding | onnx_embedding.py | Main ONNX-based text embedding implementation |
PooledEmbedding | pooled_embedding.py | Mean pooling variant for document embeddings |
PooledNormalizedEmbedding | pooled_normalized_embedding.py | L2-normalized pooled embeddings |
CLIPOnnxEmbedding | clip_embedding.py | CLIP-based multimodal text embeddings |
Supported Models
FastEmbed's Text Embedding Module supports numerous pre-trained models across different languages and use cases.
Model Categories
| Category | Models | Dim | License | Description |
|---|---|---|---|---|
| BGE (Base) | BAAI/bge-base-en-v1.5 | 768 | MIT | English text, 512 tokens |
| BGE Large | BAAI/bge-large-en-v1.5 | 1024 | MIT | High-quality English embeddings |
| BGE Small | BAAI/bge-small-en-v1.5 | 384 | MIT | Lightweight English embeddings |
| Jina | jinaai/jina-embeddings-v2-base-en | 768 | Apache 2.0 | English, 8192 tokens |
| Jina Code | jinaai/jina-embeddings-v2-base-code | 768 | Apache 2.0 | 30 programming languages |
| GTE | thenlper/gte-base | 768 | MIT | General text embeddings |
| Snowflake Arctic | snowflake/snowflake-arctic-embed-m | 768 | Apache 2.0 | Query-optimized embeddings |
| Multilingual E5 | intfloat/multilingual-e5-large | 1024 | MIT | 100 languages support |
Model Selection Criteria
Different models have varying requirements for query/document prefixes:
# Models requiring prefixes
prefix_required = ["intfloat/multilingual-e5-small", "intfloat/multilingual-e5-large"]
# Models not requiring prefixes
prefix_not_required = ["BAAI/bge-base-en-v1.5", "jinaai/jina-embeddings-v2-base-en"]
Sources: fastembed/text/onnx_embedding.py:1-200
Class Hierarchy
TextEmbeddingBase
The abstract base class defining the contract for all text embedding implementations.
class TextEmbeddingBase(EmbeddingModel[list[float]]):
@classmethod
def _list_supported_models(cls) -> list[DenseModelDescription]:
...
def embed(self, documents: Iterable[str]) -> Iterable[list[float]]:
...
Sources: fastembed/text/text_embedding.py
OnnxTextEmbedding
The primary implementation class that leverages ONNX runtime for inference.
class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[NumpyArray]):
def __init__(
self,
model_name: str = "BAAI/bge-small-en-v1.5",
cache_dir: str | None = None,
threads: int | None = None,
providers: Sequence[OnnxProvider] | None = None,
cuda: bool | Device = Device.AUTO,
device_ids: list[int] | None = None,
lazy_load: bool = False,
device_id: int | None = None,
specific_model_path: str | None = None,
**kwargs: Any,
)
#### Constructor Parameters
| Parameter | Type | Default | Description | |
|---|---|---|---|---|
model_name | str | "BAAI/bge-small-en-v1.5" | Name of the model to use | |
cache_dir | `str \ | None` | None | Cache directory path |
threads | `int \ | None` | None | Number of threads for ONNX |
providers | Sequence[OnnxProvider] | None | ONNX execution providers | |
cuda | `bool \ | Device` | Device.AUTO | CUDA acceleration |
device_ids | list[int] | None | GPU device IDs | |
lazy_load | bool | False | Load model on first use |
Sources: fastembed/text/onnx_embedding.py:200-250
Pooling Strategies
The module implements multiple pooling strategies to aggregate token-level embeddings into sentence-level embeddings.
Mean Pooling
Mean pooling computes the average of all token embeddings weighted by the attention mask.
def mean_pooling(self, embeddings: NumpyArray, attention_mask: NumpyArray) -> NumpyArray:
# Expand attention mask to broadcast
attention_mask_expanded = np.expand_dims(attention_mask, -1)
# Sum embeddings where mask is active
sum_embeddings = np.sum(embeddings * attention_mask_expanded, axis=1)
# Count valid tokens
counts = np.sum(attention_mask_expanded, axis=1)
# Return mean
return sum_embeddings / counts
PooledEmbedding
Applies mean pooling after ONNX inference to generate document embeddings.
class PooledEmbedding(OnnxTextEmbedding):
def _post_process_onnx_output(
self, output: OnnxOutputContext, **kwargs: Any
) -> Iterable[NumpyArray]:
embeddings = output.model_output
attn_mask = output.attention_mask
return self.mean_pooling(embeddings, attn_mask)
PooledNormalizedEmbedding
Extends pooling with L2 normalization for cosine similarity optimization.
class PooledNormalizedEmbedding(PooledEmbedding):
def _post_process_onnx_output(
self, output: OnnxOutputContext, **kwargs: Any
) -> Iterable[NumpyArray]:
embeddings = output.model_output
attn_mask = output.attention_mask
return normalize(self.mean_pooling(embeddings, attn_mask))
Sources: fastembed/text/pooled_embedding.py Sources: fastembed/text/pooled_normalized_embedding.py
CLIP Text Embeddings
The CLIP embedding implementation provides multimodal text/image embedding capabilities.
Supported CLIP Models
| Model | Dimension | License | Description |
|---|---|---|---|
Qdrant/clip-ViT-B-32-text | 512 | MIT | CLIP ViT-B/32 text encoder |
jinaai/jina-clip-v1 | 768 | Apache 2.0 | Jina CLIP multimodal |
CLIPOnnxEmbedding
class CLIPOnnxEmbedding(OnnxTextEmbedding):
def _post_process_onnx_output(
self, output: OnnxOutputContext, **kwargs: Any
) -> Iterable[NumpyArray]:
return output.model_output # Direct passthrough, no pooling
Sources: fastembed/text/clip_embedding.py
Inference Workflow
graph LR
A[Input Text] --> B[Tokenization]
B --> C[ONNX Inference]
C --> D{Embedding Type?}
D -->|Standard| E[Direct Output]
D -->|Pooled| F[Mean Pooling]
D -->|Pooled Norm| G[Mean Pooling + Normalize]
E --> H[Final Embeddings]
F --> H
G --> HUsage Examples
Basic Dense Embedding
from fastembed import TextEmbedding
model = TextEmbedding(model_name="BAAI/bge-base-en-v1.5")
documents = [
"The quick brown fox jumps over the lazy dog",
"A journey of a thousand miles begins with a single step"
]
embeddings = list(model.embed(documents))
# Returns: list of 768-dimensional embedding vectors
Pooled Normalized Embedding
from fastembed import PooledNormalizedEmbedding
model = PooledNormalizedEmbedding(model_name="BAAI/bge-base-en-v1.5")
embeddings = list(model.embed(documents))
# Returns: L2-normalized pooled embeddings
With Custom ONNX Providers
from fastembed import TextEmbedding
from fastembed.common import OnnxProvider
model = TextEmbedding(
model_name="BAAI/bge-large-en-v1.5",
providers=[OnnxProvider.CPUExecutionProvider],
threads=8
)
Multilingual E5 with Prefix
from fastembed import TextEmbedding
model = TextEmbedding(model_name="intfloat/multilingual-e5-small")
# E5 models require query prefix
query_text = "query: " + user_query
document_text = "passage: " + document_text
query_embedding = list(model.embed([query_text]))
doc_embedding = list(model.embed([document_text]))
Model Sources Configuration
Models can be loaded from multiple sources:
from fastembed.common import ModelSource
sources=ModelSource(
hf="xenova/jina-embeddings-v2-base-en", # HuggingFace Hub
url="https://storage.googleapis.com/...", # Direct URL
_deprecated_tar_struct=True # Legacy format
)
| Source Type | Priority | Description |
|---|---|---|
hf | Primary | HuggingFace Hub repository |
url | Fallback | Direct download URL |
| Local cache | Cached | Previously downloaded files |
Post-Processing Pipeline
graph TD
subgraph "ONNX Output"
A[model_output] --> B[attention_mask]
end
subgraph "Post-Processing"
B --> C{Masking Required?}
A --> C
C -->|Yes| D[Apply Attention Mask]
D --> E{Mormalization?}
C -->|No| E
E -->|Yes| F[L2 Normalize]
E -->|No| G[Return Raw]
F --> H[Final Output]
G --> H
endConfiguration Constants
| Parameter | Default | Description |
|---|---|---|
default_model | "BAAI/bge-small-en-v1.5" | Fallback model |
pooling_type | PoolingType.MEAN | Pooling strategy |
normalization | True | L2 normalization flag |
Sources: fastembed/text/onnx_text_model.py
Integration with Qdrant
FastEmbed text embeddings are designed for seamless integration with Qdrant vector database:
from fastembed import TextEmbedding
import qdrant_client
model = TextEmbedding(model_name="BAAI/bge-base-en-v1.5")
embeddings = list(model.embed(documents))
# Upload to Qdrant
client = qdrant_client.QdrantClient()
client.upsert(
collection_name="text_embeddings",
points=[...]
)
Performance Considerations
Token Truncation Limits
| Model Type | Max Tokens | Notes |
|---|---|---|
| BGE Small/Large | 512 | Standard context |
| Jina v2 | 8192 | Extended context |
| Multilingual E5 | 512 | Query-optimized |
| Arctic Embed | 512-2048 | Variable by model |
Hardware Acceleration
The module automatically detects and utilizes available hardware:
- CUDA - GPU acceleration via CUDAExecutionProvider
- CPU - Multi-threaded via CPUExecutionProvider
- CoreML - Apple Silicon support via CoreMLExecutionProvider
Sources: fastembed/text/onnx_embedding.py:250-300
Error Handling
Common Error Cases
# ValueError: attention_mask must be provided for pooled embeddings
model = PooledNormalizedEmbedding(...)
# Must ensure model outputs attention_mask
# ModelNotSupportedError: Unknown model
model = TextEmbedding(model_name="unknown/model")
# Falls back to default model or raises error
Validation Requirements
| Check | Condition | Error Type |
|---|---|---|
attention_mask | Required for pooled | ValueError |
model_name | Must be in supported list | ModelNotSupportedError |
cache_dir | Valid path | OSError |
See Also
- Sparse Text Embeddings - SPLADE++ and BM25
- Image Embeddings - Vision model embeddings
- Reranking Models - Cross-encoder reranking
- Post-Processors - MUVERA and other processors
Image Embedding Module
Related topics: System Architecture, Late Interaction Models
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Architecture, Late Interaction Models
Image Embedding Module
The Image Embedding Module in FastEmbed provides functionality for generating vector representations (embeddings) from images using ONNX-based models. This module enables efficient image similarity search, image clustering, and multimodal retrieval applications.
Architecture Overview
The image embedding system follows a layered architecture pattern, separating the model definitions, ONNX inference logic, and embedding base classes.
graph TD
A[Image Input] --> B[ImageEmbeddingBase]
B --> C[OnnxImageModel]
C --> D[OnnxImageEmbedding]
D --> E[ONNX Runtime]
E --> F[Embedding Vectors]
G[Supported Models] --> D
G -.-> H[ResNet50]
G -.-> I[Unicom-ViT-B-16]
G -.-> J[Unicom-ViT-B-32]
G -.-> K[jinaai/jina-clip-v1]Supported Image Models
The module supports multiple image embedding models with varying dimensions and capabilities.
| Model | Dimension | Type | License | Size (GB) | HF Source |
|---|---|---|---|---|---|
Qdrant/resnet50-onnx | 512 | Image only | apache-2.0 | 0.10 | link |
Qdrant/Unicom-ViT-B-16 | 768 | Multimodal (text&image) | apache-2.0 | 0.82 | link |
Qdrant/Unicom-ViT-B-32 | 512 | Multimodal (text&image) | apache-2.0 | 0.48 | link |
jinaai/jina-clip-v1 | 768 | Multimodal (text&image) | apache-2.0 | 0.34 | link |
Sources: fastembed/image/onnx_embedding.py:1-40
Core Classes
OnnxImageEmbedding
The main class for generating image embeddings extends ImageEmbeddingBase and OnnxImageModel[NumpyArray].
class OnnxImageEmbedding(ImageEmbeddingBase, OnnxImageModel[NumpyArray]):
Class Hierarchy:
graph LR
A[TextEmbeddingBase] -->|inheritance| B[ImageEmbeddingBase]
C[OnnxTextModel] -->|generic| D[OnnxImageModel]
E[OnnxTextEmbedding] -->|reuse pattern| F[OnnxImageEmbedding]The class follows the same architectural pattern as OnnxTextEmbedding, sharing the ONNX inference infrastructure with text embedding models. Sources: fastembed/image/onnx_embedding.py:44-50
Constructor Parameters
| Parameter | Type | Default | Description | |
|---|---|---|---|---|
model_name | str | "Qdrant/clip-ViT-B-32" | Name of the model to use | |
cache_dir | `str \ | None` | None | Path to cache directory |
threads | `int \ | None` | None | Number of threads for inference |
providers | `Sequence[OnnxProvider] \ | None` | None | ONNX execution providers |
cuda | `bool \ | Device` | Device.AUTO | CUDA device configuration |
device_ids | `list[int] \ | None` | None | Specific device IDs |
lazy_load | bool | False | Load model lazily | |
device_id | `int \ | None` | None | Specific device ID |
Multimodal Image Models
CLIP-based Models
The Unicom-ViT-B-16 and Unicom-ViT-B-32 models support both image and text inputs, enabling cross-modal retrieval scenarios.
| Model | Vision Dimension | Description |
|---|---|---|
| Unicom-ViT-B-16 | 768 | More detailed embeddings (16x16 patches) |
| Unicom-ViT-B-32 | 512 | Faster processing (32x32 patches) |
jina-clip-v1
The jinaai/jina-clip-v1 model is a 2024 multimodal model supporting both text and image inputs:
DenseModelDescription(
model="jinaai/jina-clip-v1",
dim=768,
description="Image embeddings, Multimodal (text&image), 2024 year",
license="apache-2.0",
size_in_GB=0.34,
sources=ModelSource(hf="jinaai/jina-clip-v1"),
model_file="onnx/vision_model.onnx",
),
Sources: fastembed/image/onnx_embedding.py:20-28
Model Loading Workflow
sequenceDiagram
participant User
participant OnnxImageEmbedding
participant OnnxImageModel
participant ONNX Runtime
participant HuggingFace
User->>OnnxImageEmbedding: __init__(model_name)
OnnxImageEmbedding->>OnnxImageModel: Load model from cache/HF
OnnxImageModel->>HuggingFace: Download if needed
OnnxImageModel->>ONNX Runtime: Initialize session
ONNX Runtime-->>OnnxImageModel: Session ready
OnnxImageModel-->>OnnxImageEmbedding: Model loaded
User->>OnnxImageEmbedding: embed(image)
OnnxImageEmbedding->>ONNX Runtime: Run inference
ONNX Runtime-->>User: Embedding vectorUsage Examples
Basic Image Embedding
from fastembed import ImageEmbedding
model = ImageEmbedding(model_name="Qdrant/Unicom-ViT-B-32")
embeddings = list(model.embed(["path/to/image.jpg"]))
CLIP Text-Image Retrieval
For multimodal models like jina-clip-v1, you can perform cross-modal retrieval:
from fastembed import ImageEmbedding, TextEmbedding
image_model = ImageEmbedding(model_name="jinaai/jina-clip-v1")
text_model = TextEmbedding(model_name="jinaai/jina-clip-v1")
# Generate embeddings for both modalities
image_emb = list(image_model.embed(["image_path.jpg"]))
text_emb = list(text_model.embed(["search query"]))
# Compute similarity
from numpy import dot
similarity = dot(image_emb[0], text_emb[0])
Integration with Qdrant
Image embeddings generated by this module are designed for use with Qdrant vector database:
# Creating a Qdrant collection with image embeddings
from qdrant_client import QdrantClient
client = QdrantClient("localhost", port=6333)
client.create_collection(
collection_name="images",
vectors_config={
"image": VectorParams(
size=768, # For Unicom-ViT-B-16 or jina-clip-v1
distance=Distance.COSINE
)
}
)
Supported Model Files
Each model specifies its ONNX model file location:
| Model | Model File | Additional Files |
|---|---|---|
| ResNet50 | model.onnx | None |
| Unicom-ViT-B-16 | model.onnx | None |
| Unicom-ViT-B-32 | model.onnx | None |
| jina-clip-v1 | onnx/vision_model.onnx | onnx/text_model.onnx |
Sources: fastembed/image/onnx_embedding.py:10-28
Relationship with Text Embedding Module
The Image Embedding Module shares significant implementation with the Text Embedding Module:
graph TD
A[OnnxTextModel] -->|shared base| B[OnnxImageModel]
A -->|shared base| C[OnnxTextEmbedding]
B -->|shared base| D[OnnxImageEmbedding]
E[supported_onnx_models] -->|text models| C
F[supported_image_models] -->|image models| D
G[CLIPEmbeddingWorker] -.->|reused| H[OnnxTextEmbeddingWorker]This design ensures consistent ONNX inference behavior and reduces code duplication across embedding types. Sources: fastembed/text/onnx_embedding.py
Performance Considerations
| Model | Size (GB) | Embedding Dim | Use Case |
|---|---|---|---|
| ResNet50 | 0.10 | 512 | Fast, lightweight embeddings |
| Unicom-ViT-B-32 | 0.48 | 512 | Balanced speed/quality |
| Unicom-ViT-B-16 | 0.82 | 768 | Higher quality, slower |
| jina-clip-v1 | 0.34 | 768 | Multimodal, 2024 model |
Configuration Options
The module inherits configuration capabilities from the base classes:
# Example with full configuration
model = OnnxImageEmbedding(
model_name="Qdrant/Unicom-ViT-B-16",
cache_dir="~/.cache/fastembed",
providers=["CPUExecutionProvider"], # or "CUDAExecutionProvider"
threads=4,
lazy_load=True
)
Summary
The Image Embedding Module provides:
- 4 supported image models ranging from lightweight to high-quality
- Multimodal support via CLIP-based models for text-image retrieval
- ONNX runtime optimization for efficient inference
- Qdrant integration for vector storage and similarity search
- Consistent API with the text embedding module
Sparse Embedding Models
Related topics: Text Embedding Module, System Architecture
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Text Embedding Module, System Architecture
Sparse Embedding Models
Overview
Sparse embedding models in FastEmbed represent text as high-dimensional sparse vectors where most dimensions have zero values. Unlike dense embeddings where every dimension contributes to meaning, sparse embeddings only store non-zero values for specific tokens or features. This approach combines semantic understanding with exact keyword matching capabilities.
The sparse representation consists of:
- Indices: Token identifiers in the vocabulary
- Values: Weight/scores representing token importance
# Example sparse embedding output
SparseEmbedding(indices=[17, 123, 919, ...], values=[0.71, 0.22, 0.39, ...])
Sources: fastembed/sparse/sparse_text_embedding.py:1-50
Architecture
Class Hierarchy
graph TD
A[SparseTextEmbeddingBase] --> B[MiniCOIL]
A --> C[Bm25]
A --> D[BM42]
A --> E[SPLADE++]
F[OnnxTextModel<br/>SparseEmbedding] --> A
G[OnnxTextEmbeddingWorker] --> FThe sparse embedding system is built on a base class SparseTextEmbeddingBase that extends OnnxTextModel[SparseEmbedding], providing a unified interface for all sparse embedding implementations.
Sources: fastembed/sparse/minicoil.py:30-50
Supported Models
| Model | Type | Language | Size | License | Requires IDF |
|---|---|---|---|---|---|
| SPLADE++ | Sparse/SPLADE | English | 0.22 GB | apache-2.0 | Yes |
| BM25 | Traditional BM25 | Multi-language | 0.01 GB | apache-2.0 | Yes |
| BM42 | Hybrid BM25+Attention | English | 0.04 GB | apache-2.0 | Yes |
| MiniCOIL | Semantic + Keyword | English | 0.09 GB | apache-2.0 | Yes |
Sources: fastembed/sparse/bm25.py:1-30
Core Components
SparseTextEmbeddingBase
The base class for all sparse embedding models defines the common interface and behavior:
class SparseTextEmbeddingBase(OnnxTextModel[SparseEmbedding]):
"""Base class for sparse text embedding models"""
@classmethod
def _list_supported_models(cls) -> list[DenseModelDescription]:
"""Returns list of supported sparse models"""
def _post_process_onnx_output(
self, output: OnnxOutputContext, **kwargs: Any
) -> Iterable[SparseEmbedding]:
"""Post-process ONNX model output to sparse format"""
Sources: fastembed/sparse/sparse_text_embedding.py:1-100
SparseEmbedding Data Model
The SparseEmbedding class represents sparse vectors with two primary attributes:
| Attribute | Type | Description |
|---|---|---|
indices | list[int] | Vocabulary token IDs with non-zero values |
values | list[float] | Corresponding importance weights for each index |
Sources: fastembed/sparse/sparse_text_embedding.py:100-150
Implementation Details
BM25
BM25 (Best Matching 25) is a traditional sparse embedding model that evaluates token importance based on term frequency and inverse document frequency.
Formula:
score(q, d) = SUM[ IDF(q_i) * (f(q_i, d) * (k + 1)) / (f(q_i, d) + k * (1 - b + b * (|d| / avg_len))) ]
| Parameter | Default | Description |
|---|---|---|
k | 1.5 | Term frequency saturation parameter |
b | 0.75 | Length normalization parameter |
avg_len | Computed | Average document length |
WARNING: BM25 is expected to be used with modifier="idf" in the sparse vector index of Qdrant.
Sources: fastembed/sparse/bm25.py:30-80
MiniCOIL
MiniCOIL is a sparse embedding model that combines semantic meaning resolution with exact keyword matching behavior.
Key Characteristics:
- Converts vocabulary tokens into 4-dimensional components of sparse vectors
- Weights tokens by their frequency in the corpus
- Falls back to BM25-like behavior for out-of-vocabulary tokens
class MiniCOIL(SparseTextEmbeddingBase, OnnxTextModel[SparseEmbedding]):
"""
MiniCOIL resolves semantic meaning while keeping exact keyword match behavior.
Each vocabulary token is converted into 4d component of a sparse vector.
"""
Sources: fastembed/sparse/minicoil.py:30-55
BM42
BM42 extends traditional BM25 by incorporating attention weights from transformer models, creating a hybrid sparse representation.
Sources: fastembed/sparse/bm42.py:1-50
SPLADE++
SPLADE++ (SParse Lexical AnD expRessive model) uses a sparse expansion approach where each token can expand to related terms in the vocabulary, enabling semantic matching while maintaining interpretability.
Sources: fastembed/sparse/splade_pp.py:1-50
Data Flow
graph LR
A[Input Text] --> B[Tokenization]
B --> C[ONNX Model Inference]
C --> D[ONNX Output Context]
D --> E[Post-Processing]
E --> F[SparseEmbedding]
G[Vocabulary] --> C
G --> H[SparseVectorsConverter]
H --> ESparseVectorsConverter Utility
The SparseVectorsConverter class handles conversion between different sparse vector formats, particularly for MiniCOIL's word embeddings.
Key Operations:
- Converts sentence embeddings to Qdrant sparse vector format
- Handles out-of-vocabulary (OOV) words with fallback to BM25
- Manages vocabulary word embeddings with 4-dimensional components
# Example input structure
{
"vector": WordEmbedding({
"word": "vector",
"forms": ["vector", "vectors"],
"count": 2,
"word_id": 1231,
"embedding": [0.1, 0.2, 0.3, 0.4]
}),
"axiotic": WordEmbedding({ # OOV word
"word": "axiotic",
"forms": ["axiotics"],
"count": 1,
"word_id": -1,
})
}
Sources: fastembed/sparse/utils/sparse_vectors_converter.py:50-100
Usage Examples
Basic Usage
from fastembed import SparseTextEmbedding
# Initialize with default SPLADE++ model
model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
# Generate sparse embeddings
documents = ["Example text for embedding", "Another document"]
embeddings = list(model.embed(documents))
# Output: [SparseEmbedding(indices=[17, 123, 919, ...], values=[0.71, 0.22, 0.39, ...])]
BM25 Usage
from fastembed import SparseTextEmbedding
model = SparseTextEmbedding(model_name="Qdrant/bm25")
embeddings = list(model.embed(documents))
With Qdrant
Sparse embeddings are designed for use with Qdrant's sparse vector index with modifier="idf":
from qdrant_client import QdrantClient
from fastembed import SparseTextEmbedding
client = QdrantClient()
model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
# Embedding will have indices and values compatible with Qdrant sparse vectors
Sources: README.md:1-100
Configuration Options
| Parameter | Type | Default | Description | |
|---|---|---|---|---|
model_name | str | "prithivida/Splade_PP_en_v1" | Name of the sparse embedding model | |
cache_dir | str \ | None | None | Cache directory path for model files |
threads | int \ | None | None | Number of threads for inference |
providers | Sequence[OnnxProvider] \ | None | None | ONNX execution providers |
lazy_load | bool | False | Whether to load model lazily | |
device_id | int \ | None | None | Specific device ID for execution |
Language Support
| Model | Languages |
|---|---|
| SPLADE++ | English only |
| BM25 | 15+ languages |
| BM42 | English |
| MiniCOIL | English |
Sources: fastembed/sparse/bm25.py:1-25
Requirements
All sparse embedding models require IDF (Inverse Document Frequency) weighting for optimal performance. This is typically handled by the vector database (e.g., Qdrant) during indexing and search operations.
Source: https://github.com/qdrant/fastembed / Human Manual
Late Interaction Models
Related topics: Image Embedding Module, System Architecture, Text Embedding Module
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Image Embedding Module, System Architecture, Text Embedding Module
Late Interaction Models
Overview
Late Interaction Models represent an advanced embedding paradigm that departs from traditional single-vector dense representations. Unlike conventional dense embedding models that compress entire documents into a single embedding vector, late interaction models preserve token-level embeddings and defer the similarity computation until query time. This approach enables granular token-to-token interactions between queries and documents, significantly improving retrieval precision for complex semantic matching tasks.
The FastEmbed library implements two categories of late interaction models:
| Category | Scope | Use Case |
|---|---|---|
| Text-based | Query-document text matching | Semantic search, question answering |
| Multimodal | Text + image joint embedding | Visual document retrieval, image search |
Architecture
Late Interaction vs Traditional Dense Retrieval
graph TD
subgraph "Traditional Dense Retrieval"
A1[Query Text] --> B1[Encoder]
D1[Document] --> E1[Encoder]
B1 --> C1[Single Query Vector]
E1 --> F1[Single Document Vector]
C1 --> G1[Dot Product / Cosine Similarity]
F1 --> G1
end
subgraph "Late Interaction Retrieval"
A2[Query Text] --> B2[Encoder]
D2[Document] --> E2[Encoder]
B2 --> C2[Query Token Embeddings]
E2 --> F2[Document Token Embeddings]
C2 --> G2[Late Interaction Module]
F2 --> G2
G2 --> H2[Max-Sum Similarity]
end
style C1 fill:#ffcccc
style F1 fill:#ffcccc
style C2 fill:#ccffcc
style F2 fill:#ccffccColbert Late Interaction Mechanism
The Colbert model employs a max-similarity strategy where query tokens independently find their most similar document token, and relevance is computed as the sum of these maximum similarities:
graph LR
Q1[Query: "q1 q2 q3"] --> QE[Query Encoder]
D1[Doc: "d1 d2"] --> DE[Document Encoder]
QE --> QT["Q_emb: [v₁, v₂, v₃]"]
DE --> DT["D_emb: [u₁, u₂]"]
QT --> MM[Similarity Matrix]
DT --> MM
MM --> MS[Max Similarity<br/>sim(qᵢ) = maxⱼ S(vᵢ, uⱼ)]
MS --> SR[Score = Σᵢ sim(qᵢ)]Supported Models
Text-based Late Interaction Models
| Model | Dimension | Context Length | Multilingual | License | Size |
|---|---|---|---|---|---|
jinaai/jina-colbert-v2 | 128 | 8192 | Yes | cc-by-nc-4.0 | 2.24 GB |
Multimodal Late Interaction Models
| Model | Modality | Description |
|---|---|---|
vidore/colpali | Text + Image | Vision-Language late interaction |
vidore/colqwen2 | Text + Image | Qwen2-based multimodal |
vidore/colmodernvbert | Text + Image | Modern vision backbone |
Base Classes
LateInteractionTextEmbedding
Abstract base class for text-based late interaction models.
class LateInteractionTextEmbedding(TextEmbeddingBase):
@classmethod
def _list_supported_models(cls) -> list[DenseModelDescription]:
...
@classmethod
def _get_worker_class(cls) -> Type[OnnxTextEmbeddingWorker]:
...
Sources: late_interaction_text_embedding.py:1-50
Colbert Base Class
The Colbert class implements the core late interaction logic:
class Colbert(LateInteractionTextEmbedding):
QUERY_MARKER_TOKEN_ID: int = 1 # Default CLS token
DOCUMENT_MARKER_TOKEN_ID: int = 1 # Default CLS token
MIN_QUERY_LENGTH: int = 32
MASK_TOKEN: str = "[MASK]"
Sources: colbert.py:20-60
Key methods:
| Method | Purpose |
|---|---|
encode_query() | Encode a single query, returning token embeddings |
encode_document() | Encode a single document, returning token embeddings |
score() | Compute late interaction similarity between query and document |
Jina Colbert Implementation
The JinaColbert class extends the base Colbert with model-specific configuration:
class JinaColbert(Colbert):
QUERY_MARKER_TOKEN_ID = 250002
DOCUMENT_MARKER_TOKEN_ID = 250003
MIN_QUERY_LENGTH = 31 # 32 minus 1 for special token
MASK_TOKEN = "<mask>"
Sources: jina_colbert.py:15-19
#### Model Configuration
| Parameter | Value | Description |
|---|---|---|
model | jinaai/jina-colbert-v2 | HuggingFace model identifier |
dim | 128 | Token embedding dimension |
size_in_GB | 2.24 | Model size |
context_length | 8192 | Maximum input tokens |
license | cc-by-nc-4.0 | Model license |
Multimodal Late Interaction
ColPali Model
ColPali extends the late interaction paradigm to visual documents by treating images as sequences of patches:
class ColPali(MultimodalTextImageBase, OnnxMultimodalModel[MultivectorEmbedding]):
@classmethod
def _list_supported_models(cls) -> list[DenseModelDescription]:
return supported_colpali_models
Sources: colpali.py:1-50
ColModernVBert Model
ColModernVBert provides an alternative vision-language architecture with a modern backbone:
class ColModernVBert(MultimodalTextImageBase, OnnxMultimodalModel[MultivectorEmbedding]):
QUERY_MARKER_TOKEN_ID = 0
DOCUMENT_MARKER_TOKEN_ID = 1
MIN_QUERY_LENGTH = 32
MASK_TOKEN = "<mask>"
Sources: colmodernvbert.py:1-50
Architecture: Multimodal Late Interaction
graph TD
subgraph "Query Processing"
QT[Text Query] --> QE[Text Encoder]
QV[Query Image] --> QP[Patch Extraction]
QP --> QI[Query Image Embeddings]
QE --> QT_emb[Query Token Embeddings]
end
subgraph "Document Processing"
DT[Document Text] --> DE[Text Encoder]
DV[Document Image] --> DP[Patch Extraction]
DP --> DI[Doc Image Embeddings]
DE --> DT_emb[Document Token Embeddings]
end
subgraph "Late Interaction"
QT_emb --> LI[Interaction Module]
DT_emb --> LI
QI --> LI
DI --> LI
LI --> SM[Similarity Matrix]
SM --> MS[Max-Sum Pooling]
end
MS --> SC[Relevance Score]Usage Examples
Text-based Late Interaction
from fastembed import LateInteractionTextEmbedding
# Initialize the model
model = LateInteractionTextEmbedding(
model_name="jinaai/jina-colbert-v2"
)
# Encode query and document separately
query_embedding = model.query_embed("What is machine learning?")
doc_embedding = model.doc_embed("Machine learning is a subset of AI...")
# Compute late interaction score
score = model.score(query_embedding, doc_embedding)
Multimodal Late Interaction
from fastembed import LateInteractionTextImageEmbedding
model = LateInteractionTextImageEmbedding(
model_name="vidore/colpali"
)
# Encode image with optional text
image_embedding = model.doc_embed(image=image_bytes)
query_embedding = model.query_embed("Find charts about revenue")
Configuration Options
Common Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name | str | Required | Model identifier |
cache_dir | str | None | Local cache directory |
threads | int | None | CPU threads for inference |
providers | Sequence[OnnxProvider] | None | ONNX execution providers |
lazy_load | bool | False | Defer model loading until first use |
device_id | int | None | Specific device index |
ONNX Providers
| Provider | Description | Priority |
|---|---|---|
CPUExecutionProvider | CPU inference | Default fallback |
CUDAExecutionProvider | NVIDIA GPU | Preferred for speed |
CoreMLExecutionProvider | Apple Silicon | Mobile/iOS |
Post-processing: Muvera
Muvera is a post-processing technique that converts late interaction embeddings (multi-vector) into fixed-dimensional representations:
from fastembed.postprocess import Muvera
# Convert from multi-vector to fixed-dim
muvera = Muvera.from_multivector_model(
model=late_interaction_model,
k_sim=6,
dim_proj=32
)
# Process document
fde = muvera.process_document(multivector_embedding)
Sources: postprocess/muvera.py:1-100
Muvera Configuration
| Parameter | Description | Impact |
|---|---|---|
k_sim | Log₂ of number of buckets | Memory vs precision |
dim_proj | Projection dimension | Output size |
r_reps | Number of repetitions | Robustness |
random_seed | Random seed | Reproducibility |
Output dimension formula: r_reps × 2^k_sim × dim_proj
Performance Considerations
Token Length Limits
| Model | Query Limit | Document Limit |
|---|---|---|
| jina-colbert-v2 | 31 tokens | 8192 tokens |
| colpali | Variable | Variable |
Memory Usage
Late interaction models store per-token embeddings rather than single vectors:
- Traditional dense: N × D memory (N documents, D dimension)
- Late interaction: N × T × D memory (T = avg token count)
This trade-off enables better precision at the cost of increased memory footprint.
Comparison with Dense Retrieval
| Aspect | Dense Retrieval | Late Interaction |
|---|---|---|
| Embedding type | Single vector | Token-level vectors |
| Query speed | O(1) comparison | O(Q × D) interaction |
| Precision | Good for semantic similarity | Excellent for term matching |
| Memory | Lower | Higher |
| Interpretability | Limited | Token-level attribution |
Related Components
| Component | File | Purpose |
|---|---|---|
ColbertEmbeddingWorker | colbert.py | Parallel embedding worker |
OnnxMultimodalModel | onnx_multimodal_model.py | Base for ONNX multimodal |
MultivectorEmbedding | Types | Output type for late interaction |
Muvera | muvera.py | Dimensionality reduction post-process |
References
- Original Colbert paper: Khattab & Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction"
- Model source: jinaai/jina-colbert-v2
ONNX Model Infrastructure
Related topics: System Architecture, GPU Support and Acceleration
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Architecture, GPU Support and Acceleration
ONNX Model Infrastructure
Overview
The ONNX Model Infrastructure is the core runtime layer in FastEmbed that enables efficient execution of embedding models through the ONNX (Open Neural Network Exchange) format. This infrastructure provides a unified abstraction for loading, executing, and post-processing ONNX models across different embedding modalities including text, images, and sparse embeddings.
FastEmbed leverages ONNX Runtime to achieve cross-platform compatibility and optimized inference performance without requiring PyTorch or TensorFlow dependencies. The architecture separates model execution concerns from embedding-specific logic, enabling a clean separation of concerns between the ONNX runtime layer and higher-level embedding abstractions. Sources: fastembed/text/onnx_embedding.py:1-50
Architecture Overview
The ONNX infrastructure follows a class hierarchy pattern where base classes define the runtime contract and concrete implementations provide modality-specific behavior. The architecture is designed around the following core components:
graph TD
A[ONNX Runtime] --> B[OnnxModel Base]
B --> C[OnnxTextModel]
B --> D[OnnxImageModel]
B --> E[OnnxCrossEncoderModel]
C --> F[OnnxTextEmbedding]
C --> G[PooledEmbedding]
C --> H[CLIPOnnxEmbedding]
C --> I[MiniCOIL]
D --> J[OnnxImageEmbedding]
E --> K[OnnxTextCrossEncoder]
L[TextEmbeddingBase] --> F
M[ImageEmbeddingBase] --> J
N[SparseTextEmbeddingBase] --> ICore Base Classes
The infrastructure defines three primary base classes that orchestrate ONNX model execution:
| Class | File | Purpose |
|---|---|---|
OnnxModel | fastembed/common/onnx_model.py | Core runtime for ONNX session management |
OnnxTextModel | fastembed/text/onnx_text_model.py | Text-specific ONNX execution |
OnnxImageModel | fastembed/image/onnx_embedding.py | Image-specific ONNX execution |
Sources: fastembed/common/onnx_model.py:1-100
ONNX Session Management
Model Loading
The OnnxModel base class handles the lifecycle of ONNX model loading and execution. When a model is initialized, it performs the following operations:
- Resolves the model file path from cache or downloads from source
- Configures ONNX Runtime session options (threads, providers)
- Creates an inference session with the specified execution providers
- Validates model inputs and outputs
class OnnxModel(Generic[T]):
def __init__(
self,
model_dir: Path,
model_file: str,
threads: int | None = None,
providers: Sequence[OnnxProvider] | None = None,
cuda: bool | Device = Device.AUTO,
device_id: int | None = None,
**kwargs: Any,
):
self._load_onnx_model(
model_dir=model_dir,
model_file=model_file,
threads=threads,
providers=providers,
cuda=cuda,
device_id=device_id,
)
Sources: fastembed/common/onnx_model.py:50-80
Execution Providers
The infrastructure supports multiple ONNX Runtime execution providers for hardware acceleration:
| Provider | Priority | Use Case |
|---|---|---|
CUDAExecutionProvider | GPU acceleration | NVIDIA GPUs |
CPUExecutionProvider | Fallback | CPU inference |
The Device enum provides automatic device selection:
class Device(Enum):
CPU = "cpu"
CUDA = "cuda"
AUTO = "auto"
Sources: fastembed/common/types.py:1-50
Text Embedding Infrastructure
OnnxTextEmbedding
The OnnxTextEmbedding class is the primary implementation for text embedding generation. It inherits from both TextEmbeddingBase and OnnxTextModel, combining the ONNX runtime with embedding-specific logic.
class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[NumpyArray]):
"""Implementation of the Flag Embedding model."""
@classmethod
def _list_supported_models(cls) -> list[DenseModelDescription]:
return supported_onnx_models
Sources: fastembed/text/onnx_embedding.py:60-85
#### Supported Models
The text embedding infrastructure includes a comprehensive list of supported models:
| Model | Dimension | License | Size (GB) | Token Limit |
|---|---|---|---|---|
BAAI/bge-base-en | 768 | mit | 0.42 | 512 |
BAAI/bge-base-en-v1.5 | 768 | mit | 0.21 | 512 |
BAAI/bge-large-en-v1.5 | 1024 | mit | 1.20 | 512 |
BAAI/bge-small-en-v1.5 | 384 | mit | 0.067 | 512 |
snowflake/snowflake-arctic-embed-m | 768 | apache-2.0 | 0.43 | 512 |
snowflake/snowflake-arctic-embed-m-long | 768 | apache-2.0 | 0.54 | 2048 |
jinaai/jina-clip-v1 | 768 | apache-2.0 | 0.55 | multimodal |
mixedbread-ai/mxbai-embed-large-v1 | 1024 | apache-2.0 | 0.64 | 512 |
Sources: fastembed/text/onnx_embedding.py:30-150
Pooled Embedding Variants
The infrastructure provides specialized pooling strategies through inheritance:
#### PooledNormalizedEmbedding
Applies mean pooling over token embeddings followed by L2 normalization:
class PooledNormalizedEmbedding(PooledEmbedding):
def _post_process_onnx_output(
self, output: OnnxOutputContext, **kwargs: Any
) -> Iterable[NumpyArray]:
embeddings = output.model_output
attn_mask = output.attention_mask
return normalize(self.mean_pooling(embeddings, attn_mask))
Supported models for pooled normalized embeddings include:
jinaai/jina-embeddings-v2-base-en(768 dim, 8192 tokens)jinaai/jina-embeddings-v2-small-en(512 dim, 8192 tokens)thenlper/gte-base(768 dim, 512 tokens)thenlper/gte-large(1024 dim, 512 tokens)
Sources: fastembed/text/pooled_normalized_embedding.py:50-100
#### PooledEmbedding
Standard pooled embedding with mean pooling over token representations:
class PooledEmbedding(OnnxTextEmbedding):
@classmethod
def _get_worker_class(cls) -> Type[OnnxTextEmbeddingWorker]:
return PooledEmbeddingWorker
Supported multilingual and specialized models:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2(384 dim, ~50 languages)sentence-transformers/paraphrase-multilingual-mpnet-base-v2(768 dim, ~50 languages)intfloat/multilingual-e5-large(1024 dim, ~100 languages)
Sources: fastembed/text/pooled_embedding.py:50-150
Image Embedding Infrastructure
OnnxImageEmbedding
Image embedding models inherit from OnnxImageModel and ImageEmbeddingBase:
class OnnxImageEmbedding(ImageEmbeddingBase, OnnxImageModel[NumpyArray]):
def __init__(self, model_name: str, cache_dir: str | None = None, ...):
...
Supported image models:
| Model | Dimension | License | Size (GB) |
|---|---|---|---|
Qdrant/resnet50-onnx | 2048 | apache-2.0 | - |
Qdrant/Unicom-ViT-B-16 | 768 | apache-2.0 | 0.82 |
Qdrant/Unicom-ViT-B-32 | 512 | apache-2.0 | 0.48 |
jinaai/jina-clip-v1 | 768 | apache-2.0 | 0.34 |
Sources: fastembed/image/onnx_embedding.py:30-80
Multimodal Embedding
CLIP Embeddings
The CLIPOnnxEmbedding class provides multimodal (text and image) embedding capabilities:
class CLIPOnnxEmbedding(OnnxTextEmbedding):
@classmethod
def _list_supported_models(cls) -> list[DenseModelDescription]:
return supported_clip_models
def _post_process_onnx_output(
self, output: OnnxOutputContext, **kwargs: Any
) -> Iterable[NumpyArray]:
return output.model_output
Currently supported CLIP model:
Qdrant/clip-ViT-B-32-text(512 dim, 77 tokens)
Sources: fastembed/text/clip_embedding.py:20-50
Late Interaction Multimodal
The ColModernVbert class implements late interaction models with image processing capabilities:
def load_onnx_model(self) -> None:
self._load_onnx_model(...)
# Load image processing configuration
processor_config_path = self._model_dir / "processor_config.json"
self.image_seq_len = processor_config.get("image_seq_len", 64)
self.max_image_size = preprocessor_config.get("max_image_size", {}).get("longest_edge", 512)
Sources: fastembed/late_interaction_multimodal/colmodernvbert.py:50-100
Sparse Embedding Infrastructure
MiniCOIL
The MiniCOIL class implements sparse embedding with semantic resolution:
class MiniCOIL(SparseTextEmbeddingBase, OnnxTextModel[SparseEmbedding]):
"""
MiniCOIL is a sparse embedding model, that resolves semantic meaning of the words,
while keeping exact keyword match behavior.
"""
Each vocabulary token is converted into a 4-dimensional component of a sparse vector, weighted by token frequency in the corpus. If a token is not found in the corpus, it is treated exactly like in BM25.
Supported sparse models:
Qdrant/minicoil-v1(0.09 GB, requires IDF weighting)
Sources: fastembed/sparse/minicoil.py:40-80
Worker Architecture
The infrastructure uses a worker-based pattern for parallel embedding generation:
graph LR
A[Main Thread] --> B[OnnxTextEmbeddingWorker]
B --> C[ONNX Session]
C --> D[Tokenization]
D --> E[Model Inference]
E --> F[Post-processing]
F --> G[Normalized Embeddings]Worker Classes
| Worker Class | Parent | Purpose |
|---|---|---|
OnnxTextEmbeddingWorker | Base | Standard text embedding generation |
PooledEmbeddingWorker | OnnxTextEmbeddingWorker | Mean pooling after inference |
PooledNormalizedEmbeddingWorker | OnnxTextEmbeddingWorker | Pooling + L2 normalization |
CLIPEmbeddingWorker | OnnxTextEmbeddingWorker | CLIP-specific processing |
Sources: fastembed/text/onnx_text_model.py:1-50
Reranking Infrastructure
OnnxTextCrossEncoder
The cross-encoder reranking uses a specialized ONNX model class:
class OnnxTextCrossEncoder(TextCrossEncoderBase, OnnxCrossEncoderModel):
@classmethod
def _list_supported_models(cls) -> list[BaseModelDescription]:
return supported_onnx_models
Supported reranker models:
| Model | License | Size (GB) | Context |
|---|---|---|---|
jinaai/jina-reranker-v1-turbo-en | apache-2.0 | 0.15 | 1K context |
jinaai/jina-reranker-v2-base-multilingual | cc-by-nc-4.0 | 1.11 | 1K context, sliding window |
Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py:30-80
Model Source Configuration
ModelSource
Models can be loaded from multiple sources:
@dataclass
class ModelSource:
hf: str | None = None # HuggingFace Hub
url: str | None = None # Direct URL download
_deprecated_tar_struct: bool = False # Legacy tar format
ModelDescription
The base model description structure:
@dataclass
class DenseModelDescription:
model: str
dim: int
description: str
license: str
size_in_GB: float
sources: ModelSource
model_file: str
additional_files: list[str] | None = None
Sources: fastembed/common/model_description.py:1-80
Inference Workflow
sequenceDiagram
participant User
participant EmbeddingClass
participant OnnxModel
participant ONNXRuntime
User->>EmbeddingClass: embed(texts)
EmbeddingClass->>OnnxModel: preprocess(texts)
OnnxModel->>OnnxModel: tokenize()
OnnxModel->>ONNXRuntime: run(session)
ONNXRuntime-->>OnnxModel: model_output
OnnxModel->>EmbeddingClass: _post_process_onnx_output()
EmbeddingClass->>EmbeddingClass: normalize/pool()
EmbeddingClass-->>User: numpy arraysConfiguration Parameters
Common Parameters
| Parameter | Type | Default | Description | |
|---|---|---|---|---|
model_name | str | model-specific | Name of the model to use | |
cache_dir | `str \ | None` | None | Cache directory for model files |
threads | `int \ | None` | None | Number of threads for ONNX |
providers | `Sequence[OnnxProvider] \ | None` | None | Execution providers |
cuda | `bool \ | Device` | Device.AUTO | CUDA device selection |
device_ids | `list[int] \ | None` | None | Multiple GPU device IDs |
lazy_load | bool | False | Defer model loading | |
device_id | `int \ | None` | None | Specific device ID |
specific_model_path | `str \ | None` | None | Custom model file path |
Sources: fastembed/text/onnx_embedding.py:85-120
Lazy Loading
The infrastructure supports lazy loading for memory-efficient initialization:
def __init__(
self,
lazy_load: bool = False,
...
):
if not lazy_load:
self.load_onnx_model()
When lazy_load=True, the ONNX model is not loaded until the first inference call, reducing startup memory footprint.
Type System
Core Types
| Type | Definition | Usage |
|---|---|---|
NumpyArray | np.ndarray[Any, np.dtype[Any]] | Dense embedding arrays |
SparseEmbedding | Custom sparse representation | Sparse embedding vectors |
OnnxProvider | Execution provider type | CPU, CUDA providers |
Device | Enum | Device selection (CPU/CUDA/AUTO) |
Sources: fastembed/common/types.py:1-100
Summary
The ONNX Model Infrastructure provides a robust, extensible foundation for embedding generation in FastEmbed. Key characteristics include:
- Unified Runtime: Single ONNX execution layer across all embedding modalities
- Hardware Acceleration: Support for CUDA and CPU execution providers
- Model Flexibility: Dynamic model loading from HuggingFace, URLs, or local cache
- Extensible Architecture: Clean inheritance hierarchy for adding new embedding types
- Memory Efficiency: Lazy loading and optimized session management
- Cross-Modal Support: Text, image, sparse, and multimodal embeddings
This infrastructure enables FastEmbed to deliver high-performance embedding generation without external ML framework dependencies, making it suitable for production deployments with varying hardware constraints.
Sources: fastembed/common/onnx_model.py:1-100
GPU Support and Acceleration
Related topics: Installation Guide, ONNX Model Infrastructure
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Installation Guide, ONNX Model Infrastructure
GPU Support and Acceleration
FastEmbed provides comprehensive GPU acceleration support through ONNX Runtime's execution providers, enabling high-performance inference on NVIDIA GPUs. The library offers flexible device management with automatic detection, multi-GPU support for parallel processing, and lazy loading capabilities for efficient resource utilization.
Architecture Overview
FastEmbed's GPU acceleration is built on top of ONNX Runtime, which provides hardware-accelerated inference for ONNX models. All embedding model classes inherit from base ONNX model classes that handle device initialization and session management.
graph TD
A[User Code] --> B[TextEmbedding / ImageEmbedding / CrossEncoder]
B --> C[ONNX Model Base Classes]
C --> D[ONNX Runtime Session]
D --> E{Hardware Acceleration}
E --> F[CUDA Execution Provider]
E --> G[CPU Execution Provider]
E --> H[TensorRT Provider]
F --> H1[NVIDIA GPU]
G --> H2[CPU Fallback]
style F fill:#4CAF50,color:#fff
style H1 fill:#2196F3,color:#fffSupported Model Types
FastEmbed supports GPU acceleration across multiple embedding modalities and processing types.
| Model Type | Class | GPU Support | Description |
|---|---|---|---|
| Text Embeddings | OnnxTextEmbedding | ✅ | Dense text embeddings via ONNX |
| Pooled Embeddings | PooledEmbedding | ✅ | Pooled representation embeddings |
| Normalized Pooled | PooledNormalizedEmbedding | ✅ | L2-normalized pooled embeddings |
| Image Embeddings | OnnxImageEmbedding | ✅ | Vision model embeddings |
| Cross Encoders | OnnxTextCrossEncoder | ✅ | Reranking and relevance scoring |
| Sparse Embeddings | SPLADE models | ✅ | Lexical sparse embeddings |
Sources: fastembed/text/onnx_embedding.py:1-50
Device Configuration
Device Enum
The Device enum defines available compute devices with automatic selection capability.
class Device(Enum):
AUTO = "auto" # Automatically select best available device
CPU = "cpu" # Force CPU execution
CUDA = "cuda" # NVIDIA GPU acceleration
Initialization Parameters
All ONNX embedding classes accept the following GPU-related parameters:
| Parameter | Type | Default | Description | |
|---|---|---|---|---|
cuda | `bool \ | Device` | Device.AUTO | Enable CUDA or specify device type |
providers | Sequence[OnnxProvider] | None | ONNX Runtime providers (mutually exclusive with cuda) | |
device_ids | list[int] | None | GPU device IDs for multi-GPU data parallelism | |
device_id | int | None | Specific device ID for single-process loading | |
lazy_load | bool | False | Defer model loading until first use |
Sources: fastembed/text/onnx_embedding.py:47-57
Constructor Signature
def __init__(
self,
model_name: str = "BAAI/bge-small-en-v1.5",
cache_dir: str | None = None,
threads: int | None = None,
providers: Sequence[OnnxProvider] | None = None,
cuda: bool | Device = Device.AUTO,
device_ids: list[int] | None = None,
lazy_load: bool = False,
device_id: int | None = None,
specific_model_path: str | None = None,
**kwargs: Any,
):
Sources: fastembed/text/onnx_embedding.py:44-66
GPU Initialization Workflow
sequenceDiagram
participant User
participant Embedding as Embedding Class
participant Base as ONNX Model Base
participant Runtime as ONNX Runtime
participant Device as Compute Device
User->>Embedding: Initialize(cuda=True)
Embedding->>Base: super().__init__()
Base->>Device: Auto-detect device
Device-->>Base: Available devices
Base->>Runtime: Create InferenceSession
Runtime->>Device: Load model to GPU
Device-->>Runtime: Model loaded
Runtime-->>Base: Session ready
Base-->>Embedding: Return session
Embedding-->>User: Instance readyMulti-GPU Configuration
Data Parallel Processing
For scenarios requiring distribution across multiple GPUs, FastEmbed supports device ID specification for data-parallel workloads.
from fastembed import TextEmbedding
# Initialize for multi-GPU data parallelism
embedding_model = TextEmbedding(
model_name="BAAI/bge-base-en-v1.5",
cuda=True,
device_ids=[0, 1, 2, 3], # Use 4 GPUs
lazy_load=True # Required for multi-GPU setup
)
Sources: fastembed/text/onnx_embedding.py:52-55
Lazy Loading for Multi-GPU
When using multiple GPUs, lazy_load=True defers model loading until first inference, which is essential for avoiding resource conflicts in multi-process scenarios.
embedding_model = TextEmbedding(
model_name="BAAI/bge-small-en-v1.5",
cuda=True,
device_ids=[0, 1],
lazy_load=True # Load on-demand in worker processes
)
Sources: fastembed/text/onnx_embedding.py:54
ONNX Runtime Providers
Provider Selection
ONNX Runtime supports multiple execution providers. FastEmbed allows explicit provider specification via the providers parameter, which is mutually exclusive with the cuda parameter.
from fastembed import TextEmbedding
# Using explicit provider specification
model = TextEmbedding(
model_name="BAAI/bge-small-en-v1.5",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
Provider Priority
When multiple providers are specified, ONNX Runtime attempts to use them in order of preference, falling back to subsequent providers if the preferred one is unavailable.
graph LR
A[Query] --> B{CUDA Available?}
B -->|Yes| C[CUDAExecutionProvider]
B -->|No| D{CPU Provider Available?}
D -->|Yes| E[CPUExecutionProvider]
D -->|No| F[Error]
C --> G[GPU Inference]
E --> H[CPU Inference]
style C fill:#4CAF50,color:#fff
style E fill:#FF9800,color:#fffGPU Installation
Package Variants
FastEmbed offers separate packages for CPU and GPU operation.
| Package | Command | Use Case |
|---|---|---|
| CPU (default) | pip install fastembed | Standard installations |
| GPU | pip install fastembed-gpu | NVIDIA GPU acceleration |
Sources: README.md:1-20
Qdrant Integration
For vector database workflows with GPU acceleration:
pip install qdrant-client[fastembed-gpu]
from fastembed import TextEmbedding
# GPU-accelerated embedding for Qdrant
model = TextEmbedding(
model_name="BAAI/bge-small-en-v1.5",
providers=["CUDAExecutionProvider"]
)
print("The model BAAI/bge-small-en-v1.5 is ready to use on a GPU.")
Sources: README.md:1-30
Implementation Across Model Classes
OnnxTextEmbedding
The primary text embedding class with full GPU support:
class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[NumpyArray]):
"""Implementation of the Flag Embedding model with ONNX acceleration."""
def __init__(
self,
model_name: str = "BAAI/bge-small-en-v1.5",
cache_dir: str | None = None,
threads: int | None = None,
providers: Sequence[OnnxProvider] | None = None,
cuda: bool | Device = Device.AUTO,
device_ids: list[int] | None = None,
lazy_load: bool = False,
device_id: int | None = None,
specific_model_path: str | None = None,
**kwargs: Any,
):
Sources: fastembed/text/onnx_embedding.py:48-70
OnnxImageEmbedding
Image embeddings also inherit the same GPU acceleration framework:
class OnnxImageEmbedding(ImageEmbeddingBase, OnnxImageModel[NumpyArray]):
def __init__(
self,
model_name: str,
cache_dir: str | None = None,
threads: int | None = None,
providers: Sequence[OnnxProvider] | None = None,
cuda: bool | Device = Device.AUTO,
device_ids: list[int] | None = None,
lazy_load: bool = False,
device_id: int | None = None,
specific_model_path: str | None = None,
**kwargs: Any,
):
Sources: fastembed/image/onnx_embedding.py:1-30
OnnxTextCrossEncoder
Reranking models support GPU acceleration for cross-encoder inference:
class OnnxTextCrossEncoder(TextCrossEncoderBase, OnnxCrossEncoderModel):
def __init__(
self,
model_name: str,
cache_dir: str | None = None,
threads: int | None = None,
providers: Sequence[OnnxProvider] | None = None,
cuda: bool | Device = Device.AUTO,
device_ids: list[int] | None = None,
lazy_load: bool = False,
device_id: int | None = None,
specific_model_path: str | None = None,
**kwargs: Any,
):
Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py:1-50
Unified TextEmbedding Entry Point
The TextEmbedding class provides a unified interface that automatically selects the appropriate embedding type:
class TextEmbedding:
def __init__(
self,
model_name: str = "BAAI/bge-small-en-v1.5",
cache_dir: str | None = None,
threads: int | None = None,
providers: Sequence[OnnxProvider] | None = None,
cuda: bool | Device = Device.AUTO,
device_ids: list[int] | None = None,
lazy_load: bool = False,
**kwargs: Any,
):
super().__init__(model_name, cache_dir, threads, **kwargs)
# Automatically routes to appropriate embedding type
for EMBEDDING_MODEL_TYPE in self.EMBEDDINGS_REGISTRY:
supported_models = EMBEDDING_MODEL_TYPE._list_supported_models()
if any(model_name.lower() == model.model.lower()
for model in supported_models):
self.model = EMBEDDING_MODEL_TYPE(
model_name=model_name,
cache_dir=cache_dir,
threads=threads,
providers=providers,
cuda=cuda,
device_ids=device_ids,
lazy_load=lazy_load,
)
Sources: fastembed/text/text_embedding.py:1-100
Supported Models with GPU Acceleration
Text Embedding Models
| Model | Dimension | License | Size (GB) | Token Limit |
|---|---|---|---|---|
BAAI/bge-small-en-v1.5 | 384 | MIT | 0.067 | 512 |
BAAI/bge-base-en-v1.5 | 768 | MIT | 0.21 | 512 |
BAAI/bge-large-en-v1.5 | 1024 | MIT | 1.20 | 512 |
jinaai/jina-embeddings-v2-base-en | 768 | Apache 2.0 | 0.52 | 8192 |
sentence-transformers/all-MiniLM-L6-v2 | 384 | Apache 2.0 | 0.09 | 256 |
mixedbread-ai/mxbai-embed-large-v1 | 1024 | Apache 2.0 | 0.64 | 512 |
nomic-ai/nomic-embed-text-v1.5 | 768 | Apache 2.0 | 0.13 | 8192 |
Sources: fastembed/text/onnx_embedding.py:1-150, fastembed/text/pooled_embedding.py:1-80
Image Embedding Models
| Model | Dimension | License | Size (GB) |
|---|---|---|---|
Qdrant/Unicom-ViT-B-16 | 768 | Apache 2.0 | 0.82 |
Qdrant/Unicom-ViT-B-32 | 512 | Apache 2.0 | 0.48 |
jinaai/jina-clip-v1 | 768 | Apache 2.0 | 0.55 |
Sources: fastembed/image/onnx_embedding.py:1-50
Reranking Models
| Model | License | Size (GB) |
|---|---|---|
jinaai/jina-reranker-v1-turbo-en | Apache 2.0 | 0.15 |
jinaai/jina-reranker-v2-base-multilingual | CC BY-NC 4.0 | 1.11 |
Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py:1-40
Best Practices
Device Selection
from fastembed.common.types import Device
# Recommended: Automatic detection
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cuda=Device.AUTO)
# Explicit CUDA
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cuda=True)
# Force CPU (for debugging)
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cuda=False)
Multi-GPU Inference
# For batch processing across multiple GPUs
model = TextEmbedding(
model_name="BAAI/bge-base-en-v1.5",
cuda=True,
device_ids=[0, 1], # Parallel GPU usage
lazy_load=True
)
Provider Fallback
# Explicit provider chain with fallback
model = TextEmbedding(
model_name="BAAI/bge-small-en-v1.5",
providers=[
"CUDAExecutionProvider", # Preferred
"CPUExecutionProvider" # Fallback
]
)
Limitations and Considerations
| Aspect | Description |
|---|---|
| Mutual Exclusivity | providers and cuda parameters cannot be used together |
| Device ID Scope | device_ids is for data parallelism; device_id is for single-process loading |
| Lazy Loading | Required for multi-GPU setups to avoid resource conflicts |
| Model Support | All ONNX-exported models support GPU; not all models have ONNX exports |
Sources: fastembed/text/onnx_embedding.py:47-57
Summary
FastEmbed's GPU acceleration framework provides:
- Automatic device detection via the
Device.AUTOenum value - Flexible provider configuration through ONNX Runtime's provider system
- Multi-GPU support with device ID lists for data-parallel workloads
- Lazy loading for efficient multi-process GPU utilization
- Consistent API across text, image, and reranking models
- Seamless fallback to CPU when CUDA is unavailable
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
First-time setup may fail or require extra isolation and rollback planning.
First-time setup may fail or require extra isolation and rollback planning.
Developers may expose sensitive permissions or credentials: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory
Developers may fail before the first successful local run: The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.
Doramagic Pitfall Log
Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.
1. Installation risk: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2
- Severity: high
- Finding: Installation risk is backed by a source signal: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/qdrant/fastembed/issues/618
2. Installation risk: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14
- Severity: high
- Finding: Installation risk is backed by a source signal: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/qdrant/fastembed/issues/630
3. Security or permission risk: Developers should check this security_permissions risk before relying on the project: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory
- Severity: high
- Finding: Developers should check this security_permissions risk before relying on the project: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory
- User impact: Developers may expose sensitive permissions or credentials: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory. Context: Observed when using python
- Evidence: failure_mode_cluster:github_issue | fmev_d3890c2b3360ccb937839f70fd4aa584 | https://github.com/qdrant/fastembed/issues/626 | [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory
4. Installation risk: Developers should check this installation risk before relying on the project: The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.
- Severity: medium
- Finding: Developers should check this installation risk before relying on the project: The dependency
py-rust-stemmerscannot be downloaded in a pure Python environment. - User impact: Developers may fail before the first successful local run: The dependency
py-rust-stemmerscannot be downloaded in a pure Python environment. - Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: The dependency
py-rust-stemmerscannot be downloaded in a pure Python environment.. Context: Observed when using python, docker - Evidence: failure_mode_cluster:github_issue | fmev_16e50a8626aff1576adeb1c0baab4785 | https://github.com/qdrant/fastembed/issues/466 | The dependency
py-rust-stemmerscannot be downloaded in a pure Python environment.
5. Installation risk: Developers should check this installation risk before relying on the project: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2
- Severity: medium
- Finding: Developers should check this installation risk before relying on the project: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2
- User impact: Developers may fail before the first successful local run: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2. Context: Observed when using python, windows, linux
- Evidence: failure_mode_cluster:github_issue | fmev_04529bc774f1c961d4adeb7190edecd7 | https://github.com/qdrant/fastembed/issues/618 | [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2
6. Installation risk: Developers should check this installation risk before relying on the project: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14
- Severity: medium
- Finding: Developers should check this installation risk before relying on the project: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14
- User impact: Developers may fail before the first successful local run: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14. Context: Observed when using python, macos, cuda
- Evidence: failure_mode_cluster:github_issue | fmev_79a43347d96beb6d05eb6bfec2503fb5 | https://github.com/qdrant/fastembed/issues/630 | [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14
7. Installation risk: Developers should check this installation risk before relying on the project: v0.5.1
- Severity: medium
- Finding: Developers should check this installation risk before relying on the project: v0.5.1
- User impact: Upgrade or migration may change expected behavior: v0.5.1
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v0.5.1. Context: Observed when using python
- Evidence: failure_mode_cluster:github_release | fmev_8b37c58c613005c0182d0325aaf032f7 | https://github.com/qdrant/fastembed/releases/tag/v0.5.1 | v0.5.1
8. Installation risk: The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.
- Severity: medium
- Finding: Installation risk is backed by a source signal: The dependency
py-rust-stemmerscannot be downloaded in a pure Python environment.. Treat it as a review item until the current version is checked. - User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/qdrant/fastembed/issues/466
9. Configuration risk: [Bug]: No timeout on model download — requests.get() can hang indefinitely
- Severity: medium
- Finding: Configuration risk is backed by a source signal: [Bug]: No timeout on model download — requests.get() can hang indefinitely. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/qdrant/fastembed/issues/627
10. Capability assumption: README/documentation is current enough for a first validation pass.
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: capability.assumptions | github_repo:666260877 | https://github.com/qdrant/fastembed | README/documentation is current enough for a first validation pass.
11. Project risk: Developers should check this runtime risk before relying on the project: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1
- Severity: medium
- Finding: Developers should check this runtime risk before relying on the project: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1
- User impact: Developers may hit a documented source-backed failure mode: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1. Context: Observed when using python, linux
- Evidence: failure_mode_cluster:github_issue | fmev_17b849ae47ffaf5d18cabbd577f373ca | https://github.com/qdrant/fastembed/issues/603 | [Bug]: Loading models with additional files fails with onnxruntime 1.24.1
12. Project risk: Developers should check this runtime risk before relying on the project: v0.4.2
- Severity: medium
- Finding: Developers should check this runtime risk before relying on the project: v0.4.2
- User impact: Upgrade or migration may change expected behavior: v0.4.2
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v0.4.2. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_release | fmev_607c8ff157108b2b5fb78f55129b60f6 | https://github.com/qdrant/fastembed/releases/tag/v0.4.2 | v0.4.2
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using fastembed with real data or production workflows.
- [[Bug]: Segmentation Fault or AssertionError during initialization on Pyt](https://github.com/qdrant/fastembed/issues/618) - github / github_issue
- [[Bug]: No timeout on model download — requests.get() can hang indefinite](https://github.com/qdrant/fastembed/issues/627) - github / github_issue
- [[Bug]: license error in pypi metadata](https://github.com/qdrant/fastembed/issues/620) - github / github_issue
- [[Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14](https://github.com/qdrant/fastembed/issues/630) - github / github_issue
- [[Bug]: Loading models with additional files fails with onnxruntime 1.24.](https://github.com/qdrant/fastembed/issues/603) - github / github_issue
- The dependency
py-rust-stemmerscannot be downloaded in a pure Python - github / github_issue - [[Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary](https://github.com/qdrant/fastembed/issues/626) - github / github_issue
- v0.8.0 - github / github_release
- v0.7.4 - github / github_release
- v0.7.2 - github / github_release
- v0.7.1 - github / github_release
- v0.7.0 - github / github_release
Source: Project Pack community evidence and pitfall evidence