Doramagic Project Pack · Human Manual

fastembed

FastEmbed serves as an embedding generation engine optimized for production use cases, particularly in vector search applications. The library emphasizes:

Introduction to FastEmbed

Related topics: Installation Guide, System Architecture, Quick Start Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Text Embedding Models

Continue reading this section for the full explanation and source context.

Section Sparse Embedding Models

Continue reading this section for the full explanation and source context.

Related topics: Installation Guide, System Architecture, Quick Start Guide

Introduction to FastEmbed

FastEmbed is a lightweight, high-performance Python library designed for generating text and image embeddings using ONNX-based models. It provides a unified API for dense, sparse, and late-interaction embeddings with support for multiple embedding types and cross-encoder re-ranking models.

Overview

FastEmbed serves as an embedding generation engine optimized for production use cases, particularly in vector search applications. The library emphasizes:

  • Performance: Leverages ONNX Runtime for efficient CPU inference
  • Lightweight: Minimal dependencies and small model sizes
  • Flexibility: Supports dense, sparse, and multimodal embeddings
  • Ease of use: Simple Python API for common embedding workflows

Sources: README.md

Architecture

FastEmbed follows a modular architecture with separate components for different embedding types and processing stages.

graph TD
    A[FastEmbed API] --> B[Text Embedding]
    A --> C[Image Embedding]
    A --> D[Sparse Embedding]
    A --> E[Cross Encoder]
    
    B --> B1[OnnxEmbedding]
    B --> B2[PooledEmbedding]
    B --> B3[PooledNormalizedEmbedding]
    
    C --> C1[OnnxImageModel]
    
    D --> D1[BM25]
    D --> D2[SPLADE++]
    D --> D3[MiniCOIL]
    
    E --> E1[OnnxCrossEncoderModel]
    
    B1 & B2 & B3 --> F[ONNX Runtime]
    C1 --> F
    E1 --> F
    D1 & D2 & D3 --> G[Tokenization Engine]

Core Components

ComponentPurposeFile Location
TextEmbeddingDense text embeddingsfastembed/text/
ImageEmbeddingImage embeddingsfastembed/image/
SparseTextEmbeddingSparse embeddings (BM25, SPLADE)fastembed/sparse/
TextCrossEncoderRe-ranking modelsfastembed/rerank/
LateInteractionTextEmbeddingLate interaction embeddingsfastembed/postprocess/

Sources: fastembed/text/onnx_embedding.py

Supported Models

FastEmbed supports an extensive collection of pre-converted ONNX models across multiple categories.

Text Embedding Models

Text embeddings are categorized by their pooling strategy and normalization approach.

#### Dense Models (Unimodal)

ModelDimensionLanguagesMax TokensLicense
BAAI/bge-base-en-v1.5768English512MIT
BAAI/bge-large-en-v1.51024English512MIT
BAAI/bge-small-en-v1.5384English512MIT
thenlper/gte-base768English512MIT
thenlper/gte-large1024English512MIT
snowflake/snowflake-arctic-embed-m768English512Apache-2.0
snowflake/snowflake-arctic-embed-l1024English512Apache-2.0

Sources: fastembed/text/onnx_embedding.py:1-50

#### Multilingual Models

ModelDimensionLanguagesMax Tokens
intfloat/multilingual-e5-small384~100512
intfloat/multilingual-e5-large1024~100512
sentence-transformers/paraphrase-multilingual-mpnet-base-v2768~50384
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2384~50512

Sources: fastembed/text/pooled_embedding.py

#### Jina AI Models

ModelDimensionLanguagesMax Tokens
jinaai/jina-embeddings-v2-base-en768English8192
jinaai/jina-embeddings-v2-small-en512English8192
jinaai/jina-embeddings-v2-base-zh768Chinese/English8192
jinaai/jina-embeddings-v2-base-de768German/English8192
jinaai/jina-embeddings-v2-base-code76830 languages8192

Sources: fastembed/text/pooled_normalized_embedding.py

Sparse Embedding Models

Sparse embeddings provide interpretable vectors with non-zero values only at specific token positions.

ModelTypeLanguageDescription
prithivida/Splade_PP_en_v1SPLADE++EnglishSparse lexical + semantic
Qdrant/minicoil-v1MiniCOILEnglishSemantic + keyword match
Qdrant/bm25BM25MultilingualTraditional BM25 ranking

Sources: fastembed/sparse/bm25.py

The MiniCOIL model combines semantic understanding with exact keyword matching:

class MiniCOIL(SparseTextEmbeddingBase, OnnxTextModel[SparseEmbedding]):
    """
    MiniCOIL is a sparse embedding model, that resolves semantic meaning of the words,
    while keeping exact keyword match behavior.
    
    Each vocabulary token is converted into 4d component of a sparse vector, 
    which is then weighted by the token frequency in the corpus.
    If the token is not found in the corpus, it is treated exactly like in BM25.
    """

Sources: fastembed/sparse/minicoil.py

Image Embedding Models

ModelDimensionTypeLicense
Qdrant/resnet50-onnx2048ImageApache-2.0
Qdrant/Unicom-ViT-B-16768MultimodalApache-2.0
Qdrant/Unicom-ViT-B-32512MultimodalApache-2.0
jinaai/jina-clip-v1768MultimodalApache-2.0

Sources: fastembed/image/onnx_embedding.py

Cross Encoder Reranking Models

ModelDescriptionLicenseSize
Xenova/ms-marco-MiniLM-L-12-v2MS MARCO passage rankingApache-2.00.12 GB
BAAI/bge-reranker-baseBGE reranker baseMIT1.04 GB
jinaai/jina-reranker-v1-tiny-enFast reranking, 8K contextApache-2.00.13 GB
jinaai/jina-reranker-v1-turbo-enFast reranking, 8K contextApache-2.00.15 GB
jinaai/jina-reranker-v2-base-multilingualMultilingual, 1K contextCC-BY-NC-4.01.11 GB

Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py

Usage Patterns

Text Embedding Generation

from fastembed import TextEmbedding

# Initialize with default model
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Generate embeddings
documents = [
    "passage: The capital of France is Paris",
    "query: What is the capital of France?"
]

embeddings = list(model.embed(documents))
# Returns list of numpy arrays

Sources: README.md

Sparse Embedding with SPLADE++

from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
embeddings = list(model.embed(documents))

# Returns:
# [
#   SparseEmbedding(indices=[ 17, 123, 919, ... ], values=[0.71, 0.22, 0.39, ...]),
#   SparseEmbedding(indices=[ 38,  12,  91, ... ], values=[0.11, 0.22, 0.39, ...])
# ]

Sources: README.md

Custom Model Configuration

from fastembed import TextEmbedding, PoolingType
from fastembed.common import ModelSource

# Use custom model with specific configuration
model = TextEmbedding(
    model_name="intfloat/multilingual-e5-small",
    pooler_type=PoolingType.MEAN,
    normalization=True,
    sources=ModelSource(hf="intfloat/multilingual-e5-small"),
    dim=384,
    model_file="onnx/model.onnx"
)

embeddings = list(model.embed(documents))

Sources: README.md

Cross Encoder Reranking

from fastembed import TextCrossEncoder

model = TextCrossEncoder(model_name="BAAI/bge-reranker-base")

# Score query-document pairs
query = "What is the capital of France?"
documents = ["Paris is the capital of France.", "Berlin is the capital of Germany."]

scores = list(model.rerank(query, documents))

Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py

Post-Processing: MUVERA

FastEmbed includes MUVERA (Multi-Vector Reduction Algorithm) for converting late-interaction embeddings (like ColBERT) to fixed-dimensional encodings.

from fastembed import LateInteractionTextEmbedding
from fastembed.postprocess import Muvera

model = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
muvera = Muvera.from_multivector_model(
    model=model,
    k_sim=6,
    dim_proj=32
)

# Convert late-interaction embeddings to fixed dimension
embeddings = np.array(list(model.embed(["sample text"])))
fde = muvera.process_document(embeddings[0])

Sources: fastembed/postprocess/muvera.py

Model Caching

FastEmbed automatically caches downloaded models to avoid repeated downloads.

SettingEnvironment VariableDefault Location
Cache PathFASTEMBED_CACHE_PATHSystem temp directory

Models are stored in ONNX format after download and converted to optimized formats on first use.

Prefix Requirements

Some models require specific text prefixes for query and document inputs:

Prefix RequirementModelsExample
NecessaryE5, Nomic, Snowflake Arctic, BGE-smallQuery: query: ..., Document: passage: ...
Not NecessaryJina, BGE (base/large), GTEPlain text input

Sources: fastembed/text/onnx_embedding.py

Quick Reference

Model Selection Guide

Use CaseRecommended Model
English semantic searchBAAI/bge-small-en-v1.5
High-quality EnglishBAAI/bge-large-en-v1.5
Multilingual (100+ languages)intfloat/multilingual-e5-large
Long documents (8K tokens)jinaai/jina-embeddings-v2-base-en
Fast re-rankingjinaai/jina-reranker-v1-tiny-en
Sparse lexical + semanticprithivida/Splade_PP_en_v1

API Quick Reference

ClassImportPrimary Method
Dense Textfrom fastembed import TextEmbedding.embed(documents)
Sparse Textfrom fastembed import SparseTextEmbedding.embed(documents)
Imagefrom fastembed import ImageEmbedding.embed(images)
Cross Encoderfrom fastembed import TextCrossEncoder.rerank(query, documents)
Late Interactionfrom fastembed import LateInteractionTextEmbedding.embed(documents)

Further Documentation

Sources: README.md

Installation Guide

Related topics: Introduction to FastEmbed, GPU Support and Acceleration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section System Requirements

Continue reading this section for the full explanation and source context.

Section CPU Installation

Continue reading this section for the full explanation and source context.

Section GPU Installation

Continue reading this section for the full explanation and source context.

Related topics: Introduction to FastEmbed, GPU Support and Acceleration

Installation Guide

FastEmbed is a lightweight, fast, and accurate embedding library developed by Qdrant. This guide covers all aspects of installing and configuring FastEmbed for various use cases including CPU inference, GPU acceleration, and integration with Qdrant vector database.

Prerequisites

System Requirements

RequirementMinimumRecommended
Python Version3.9+3.10+
RAM4 GB8 GB+
Disk Space2 GB5 GB+
GPU (Optional)CUDA 11.8+CUDA 12.x

Verify your Python version before installation:

python --version

Sources: README.md:1-100

Package Variants

FastEmbed is distributed in two package variants:

PackageDescriptionUse Case
fastembedCPU-only versionGeneral purpose embedding generation
fastembed-gpuGPU-accelerated versionHigh-throughput production workloads

CPU Installation

Install the standard CPU version using pip:

pip install fastembed

Sources: README.md Sources: RELEASE.md:1-15

GPU Installation

For CUDA-enabled GPU acceleration:

pip install fastembed-gpu

The GPU package automatically includes the CUDA Execution Provider for ONNX Runtime, enabling significantly faster inference on NVIDIA GPUs.

Sources: RELEASE.md:1-15 Sources: README.md

Installation Verification

Verify successful installation:

from fastembed import TextEmbedding

# List available models
print(TextEmbedding.list_supported_models())

# Initialize a model
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
print("FastEmbed installed successfully!")

Supported Embedding Models

FastEmbed supports multiple embedding modalities organized into the following categories:

Dense Text Embeddings

Dense text embeddings provide fixed-dimensional vector representations for text. Supported models include:

ModelDimensionLanguagesToken LimitLicense
BAAI/bge-small-en-v1.5384English512MIT
BAAI/bge-base-en-v1.5768English512MIT
BAAI/bge-large-en-v1.51024English512MIT
jinaai/jina-embeddings-v2-base-en768English8192Apache 2.0
jinaai/jina-embeddings-v2-small-en512English8192Apache 2.0
sentence-transformers/all-MiniLM-L6-v2384English256Apache 2.0
mixedbread-ai/mxbai-embed-large-v11024English512Apache 2.0

Sources: fastembed/text/onnx_embedding.py

Multilingual Models

ModelDimensionLanguagesToken LimitLicense
intfloat/multilingual-e5-small384~100512MIT
intfloat/multilingual-e5-large1024~100512MIT
sentence-transformers/paraphrase-multilingual-mpnet-base-v2768~50384Apache 2.0
jinaai/jina-embeddings-v2-base-de768German, English8192Apache 2.0
jinaai/jina-embeddings-v2-base-zh768Chinese, English8192Apache 2.0
jinaai/jina-embeddings-v2-base-es768Spanish, English8192Apache 2.0

Sources: fastembed/text/pooled_normalized_embedding.py Sources: fastembed/text/pooled_embedding.py

Sparse Embeddings

Sparse embeddings represent text using high-dimensional sparse vectors, useful for keyword-based retrieval.

ModelTypeLicense
prithivida/Splade_PP_en_v1SPLADE++Apache 2.0
Qdrant/bm25BM25Apache 2.0

Sources: fastembed/sparse/bm25.py Sources: README.md

Image Embeddings

ModelDimensionTypeLicense
Qdrant/resnet50-onnx2048ImageApache 2.0
Qdrant/Unicom-ViT-B-16768MultimodalApache 2.0
Qdrant/Unicom-ViT-B-32512MultimodalApache 2.0
jinaai/jina-clip-v1768Multimodal (text&image)Apache 2.0

Sources: fastembed/image/onnx_embedding.py

Reranking Models

Cross-encoder models for re-ranking search results:

ModelContext LengthLicense
Xenova/ms-marco-MiniLM-L-12-v2-Apache 2.0
BAAI/bge-reranker-base-MIT
jinaai/jina-reranker-v1-tiny-en8KApache 2.0
jinaai/jina-reranker-v1-turbo-en8KApache 2.0
jinaai/jina-reranker-v2-base-multilingual1KCC-BY-NC-4.0

Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py

GPU Configuration

CUDA Provider Setup

Enable GPU acceleration by specifying the CUDA execution provider:

from fastembed import TextEmbedding
from fastembed.common import OnnxProvider

model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    providers=["CUDAExecutionProvider"]
)
print("The model BAAI/bge-small-en-v1.5 is ready to use on a GPU.")

Sources: README.md

GPU Installation Workflow

graph TD
    A[Install fastembed-gpu] --> B{Check CUDA Version}
    B -->|CUDA 11.8+| C[Install Compatible Driver]
    B -->|CUDA < 11.8| D[Upgrade CUDA]
    C --> E[Verify ONNX Runtime GPU Support]
    E --> F[Import TextEmbedding]
    F --> G[Configure Providers]
    G --> H[GPU Inference Ready]

Qdrant Integration

FastEmbed integrates seamlessly with the Qdrant vector database for production deployments.

Installation with Qdrant Client

# Standard Qdrant with FastEmbed
pip install qdrant-client[fastembed]

# GPU-accelerated FastEmbed with Qdrant
pip install qdrant-client[fastembed-gpu]

On zsh shells, use quotes:

pip install 'qdrant-client[fastembed]'

Sources: README.md

Complete Qdrant Integration Example

from qdrant_client import QdrantClient, models

# Initialize the client
client = QdrantClient("localhost", port=6333)  # For production
# client = QdrantClient(":memory:")  # For experimentation

model_name = "sentence-transformers/all-MiniLM-L6-v2"
payload = [
    {"document": "Qdrant has Langchain integrations", "source": "Langchain-docs"},
    {"document": "Qdrant also has Llama Index integrations", "source": "LlamaIndex-docs"},
]
docs = [models.Document(text=data["document"], model=model_name) for data in payload]
ids = [42, 2]

client.create_collection(
    "demo_collection",
    vectors_config=models.VectorParams(
        size=client.get_embedding_size(model_name), distance=models.Distance.COSINE
    )
)

client.upload_collection(
    collection_name="demo_collection",
    vectors=docs,
    ids=ids,
    payload=payload,
)

search_result = client.query_points(
    collection_name="demo_collection",
    query=docs[0],
    limit=5,
)

Sources: README.md

Cache Configuration

Default Cache Location

Models are cached in fastembed_cache within the system's temp directory by default. This location can be customized using environment variables.

FASTEMBED_CACHE_PATH

Set a custom cache directory:

export FASTEMBED_CACHE_PATH=/path/to/custom/cache

The cache directory structure follows Hugging Face's conventions, with models organized by their source repository.

Model Loading Behavior

graph TD
    A[Import FastEmbed Module] --> B{Model in Cache?}
    B -->|Yes| C[Load from Cache]
    B -->|No| D[Download from Source]
    D --> E{HuggingFace Available?}
    E -->|Yes| F[Download from HF Hub]
    E -->|No| G[Use URL Source]
    C --> H[Initialize ONNX Session]
    E --> H
    G --> H
    H --> I[Model Ready for Inference]

Usage Examples

Dense Text Embedding

from fastembed import TextEmbedding

documents = [
    "passage: FastEmbed is a fast embedding library",
    "query: What is FastEmbed?",
    "passage: Qdrant is a vector database",
]

model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
embeddings = list(model.embed(documents))

print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding dimension: {embeddings[0].shape}")

Sparse Text Embedding (SPLADE++)

from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
embeddings = list(model.embed(documents))

# Output format:
# [
#   SparseEmbedding(indices=[ 17, 123, 919, ... ], values=[0.71, 0.22, 0.39, ...]),
#   SparseEmbedding(indices=[ 38,  12,  91, ... ], values=[0.11, 0.22, 0.39, ...])
# ]

Sources: README.md

Image Embedding

from fastembed import ImageEmbedding
from PIL import Image

model = ImageEmbedding(model_name="Qdrant/resnet50-onnx")
images = [Image.open("path/to/image.jpg")]
embeddings = list(model.embed(images))

Custom Model Source

Load a supported model with custom configuration:

from fastembed import TextEmbedding
from fastembed.common import ModelSource, PoolingType

model = TextEmbedding(
    model_name="custom-model",
    pool_type=PoolingType.MEAN,
    normalization=True,
    sources=ModelSource(hf="intfloat/multilingual-e5-small"),
    dim=384,
    model_file="onnx/model.onnx"
)

embeddings = list(model.embed(documents))

Sources: README.md

Late Interaction Models

Late interaction models like ColBERT enable more sophisticated similarity matching:

from fastembed import LateInteractionTextEmbedding

model = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
embeddings = list(model.embed(documents))

Post-Processing with Muvera

The Muvera post-processor enables Fixed Dimensional Encoding (FDE) for multi-vector models:

from fastembed import LateInteractionTextEmbedding
from fastembed.postprocess import Muvera
import numpy as np

model = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
muvera = Muvera.from_multivector_model(
    model=model,
    k_sim=6,
    dim_proj=32
)

embeddings = np.array(list(model.embed(["sample text"])))
fde = muvera.process_document(embeddings[0])

Sources: fastembed/postprocess/muvera.py

Troubleshooting

Common Installation Issues

IssueSolution
ModuleNotFoundError: No module named 'fastembed'Run pip install fastembed
CUDA not availableInstall fastembed-gpu and verify NVIDIA driver
Model download failsCheck network connectivity and HuggingFace access
Out of memoryReduce batch size or use smaller model variant

Checking Installed Version

import fastembed
print(fastembed.__version__)

Verifying GPU Availability

import onnxruntime as ort
print(f"Available providers: {ort.get_available_providers()}")

Advanced Configuration

Lazy Loading

Enable lazy model loading for memory-efficient initialization:

from fastembed import TextEmbedding

model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    lazy_load=True  # Model loads on first inference call
)

Thread Configuration

Optimize CPU thread usage:

from fastembed import TextEmbedding

model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    threads=4  # Limit to 4 threads
)

Device Selection

from fastembed.common import Device

model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    cuda=True  # or Device.AUTO for automatic detection
)

Documentation Resources

For more information, refer to the official documentation:

Sources: mkdocs.yml

Sources: README.md:1-100

Quick Start Guide

Related topics: Introduction to FastEmbed, Text Embedding Module, Image Embedding Module

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Loading a Model

Continue reading this section for the full explanation and source context.

Section Generating Embeddings

Continue reading this section for the full explanation and source context.

Section Text Embedding Models

Continue reading this section for the full explanation and source context.

Related topics: Introduction to FastEmbed, Text Embedding Module, Image Embedding Module

Quick Start Guide

FastEmbed is a lightweight, fast, and accurate embedding library developed by Qdrant. It provides text and image embeddings using ONNX-based models optimized for production deployment. This guide covers the essential steps to get started with FastEmbed for your embedding needs.

Overview

FastEmbed enables developers to generate high-quality vector embeddings for text and images with minimal configuration. The library supports multiple embedding types including dense embeddings, sparse embeddings, and cross-encoder reranking models.

graph TD
    A[FastEmbed Library] --> B[Text Embeddings]
    A --> C[Image Embeddings]
    A --> D[Sparse Embeddings]
    A --> E[Reranking Models]
    
    B --> B1[Dense Models]
    B --> B2[Pooled Models]
    
    D --> D1[SPLADE++]
    D --> D2[BM25]
    D --> D3[MiniCOIL]

Installation

Install FastEmbed using pip:

pip install fastembed

For CUDA acceleration support:

pip install fastembed[cuda]

Basic Text Embedding

Loading a Model

Import and initialize a text embedding model:

from fastembed import TextEmbedding

# Use the default model (BAAI/bge-small-en-v1.5)
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")

Generating Embeddings

documents = [
    "passage: A man is eating food.",
    "passage: A man is eating a piece of broccoli.",
    "passage: A man is eating pasta.",
    "passage: A woman is cutting vegetables.",
]

embeddings = list(model.embed(documents))

The prefix "passage:" is required for some models like BAAI/bge-small-en-v1.5 and snowflake/snowflake-arctic-embed-xs to indicate the text type. Sources: README.md

Supported Models

Text Embedding Models

Model NameDimensionLanguagesMax TokensLicenseSize (GB)
BAAI/bge-small-en-v1.5384English512MIT0.067
BAAI/bge-base-en-v1.5768English512MIT0.21
BAAI/bge-large-en-v1.51024English512MIT1.20
intfloat/multilingual-e5-small384Multilingual (~100)512MIT0.09
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2384Multilingual (~50)512Apache 2.00.22
jinaai/jina-embeddings-v2-base-en768English8192Apache 2.00.52
snowflake/snowflake-arctic-embed-xs384English512Apache 2.00.09
snowflake/snowflake-arctic-embed-m-long768English2048Apache 2.00.54

Sources: fastembed/text/onnx_embedding.py

Pooled Normalized Embeddings

The PooledNormalizedEmbedding class applies mean pooling to the model output and normalizes the result:

from fastembed.text import PooledNormalizedEmbedding

model = PooledNormalizedEmbedding(model_name="jinaai/jina-embeddings-v2-base-en")

The post-processing combines mean pooling with L2 normalization:

def _post_process_onnx_output(self, output: OnnxOutputContext, **kwargs: Any) -> Iterable[NumpyArray]:
    embeddings = output.model_output
    attn_mask = output.attention_mask
    return normalize(self.mean_pooling(embeddings, attn_mask))

Sources: fastembed/text/pooled_normalized_embedding.py:78-84

Image Embeddings

FastEmbed supports multimodal models for image embeddings:

from fastembed.image import OnnxImageEmbedding

model = OnnxImageEmbedding(model_name="Qdrant/Unicom-ViT-B-16")

Supported Image Models

Model NameDimensionTypeLicenseSize (GB)
Qdrant/Unicom-ViT-B-16768Multimodal (text&image)Apache 2.00.82
Qdrant/Unicom-ViT-B-32512Multimodal (text&image)Apache 2.00.48
jinaai/jina-clip-v1768Multimodal (text&image)Apache 2.00.34

Sources: fastembed/image/onnx_embedding.py

Sparse Embeddings

Sparse embeddings provide interpretable, non-dense vector representations useful for keyword-aware semantic search.

SPLADE++

SPLADE++ is a sparse embedding model that resolves semantic meaning while preserving keyword match behavior:

from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
embeddings = list(model.embed(documents))

Returns sparse vectors with indices and values:

[
  SparseEmbedding(indices=[17, 123, 919, ...], values=[0.71, 0.22, 0.39, ...]),
  SparseEmbedding(indices=[38, 12, 91, ...], values=[0.11, 0.22, 0.39, ...])
]

BM25

Traditional BM25 implemented as sparse embeddings:

from fastembed.sparse import Bm25

model = Bm25(language="en")

BM25 formula:

score(q, d) = SUM[ IDF(q_i) * (f(q_i, d) * (k + 1)) / (f(q_i, d) + k * (1 - b + b * (|d| / avg_len))) ]

Sources: fastembed/sparse/bm25.py:47-52

MiniCOIL

MiniCOIL combines semantic embeddings with exact keyword matching:

from fastembed.sparse import MiniCOIL

model = MiniCOIL(model_name="Qdrant/minicoil-v1")

Reranking Models

Cross-encoder reranking improves search results by re-scoring candidate documents:

from fastembed import TextCrossEncoder

model = TextCrossEncoder(model_name="jinaai/jina-reranker-v1-turbo-en")

Supported Reranker Models

Model NameLicenseSize (GB)Context Length
jinaai/jina-reranker-v1-turbo-enApache 2.00.151K
jinaai/jina-reranker-v2-base-multilingualCC BY-NC 4.01.111K (sliding window)

Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py

Common Configuration Options

Initialization Parameters

ParameterTypeDefaultDescription
model_namestr"BAAI/bge-small-en-v1.5"Name of the model to use
cache_dirstr or NoneNoneCache directory path
threadsint or NoneNoneNumber of threads for ONNX execution
providersSequence[OnnxProvider]NoneONNX execution providers (CPU, CUDA, etc.)
cudabool or DeviceDevice.AUTOEnable CUDA acceleration
device_idslist[int]NoneSpecific GPU device IDs
lazy_loadboolFalseDefer model loading until first use

Sources: fastembed/text/onnx_embedding.py:123-136

Complete Example Workflow

graph LR
    A[Input Documents] --> B[TextEmbedding]
    B --> C[Embedding Vectors]
    C --> D[Vector Database]
    D --> E[Similarity Search]
    E --> F[Candidate Results]
    F --> G[TextCrossEncoder]
    G --> H[Reranked Results]
from fastembed import TextEmbedding, SparseTextEmbedding, TextCrossEncoder

# 1. Load models
text_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
sparse_model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
reranker = TextCrossEncoder(model_name="jinaai/jina-reranker-v1-turbo-en")

# 2. Generate dense embeddings
documents = ["query: fastembed is fast", "passage: FastEmbed provides efficient embeddings"]
dense_embeddings = list(text_model.embed(documents))

# 3. Generate sparse embeddings
sparse_embeddings = list(sparse_model.embed(documents))

# 4. Rerank results
query = "query: What is FastEmbed?"
results = reranker.rerank(query=query, documents=documents, top_k=2)

Language-Specific Models

For non-English content, use multilingual models:

LanguageRecommended Model
ChineseBAAI/bge-small-zh-v1.5, jinaai/jina-embeddings-v2-base-zh
Germanjinaai/jina-embeddings-v2-base-de
Spanishjinaai/jina-embeddings-v2-base-es
Codejinaai/jina-embeddings-v2-base-code
Multilingualintfloat/multilingual-e5-small, sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Prefix Requirements

Some models require specific prefixes to distinguish query and passage text:

ModelPrefix RequiredExample
BAAI/bge-small-en-v1.5Yes"query: ...", "passage: ..."
intfloat/multilingual-e5-smallYes"query: ...", "passage: ..."
snowflake/snowflake-arctic-embed-xsYes"query: ...", "passage: ..."
jinaai/jina-embeddings-v2-base-enNot necessaryPlain text

Sources: fastembed/text/onnx_embedding.py

Next Steps

Sources: fastembed/text/onnx_embedding.py

System Architecture

Related topics: Introduction to FastEmbed, ONNX Model Infrastructure

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Base Class Hierarchy

Continue reading this section for the full explanation and source context.

Section ONNX Model Abstraction

Continue reading this section for the full explanation and source context.

Section Dense Embeddings

Continue reading this section for the full explanation and source context.

Related topics: Introduction to FastEmbed, ONNX Model Infrastructure

System Architecture

FastEmbed is a lightweight, fast text and image embedding library built on ONNX Runtime. The architecture is designed around modularity, enabling multiple embedding types (dense, sparse, late-interaction) through a unified interface while leveraging ONNX for cross-platform inference optimization.

Architecture Overview

FastEmbed follows a layered architecture with clear separation of concerns:

graph TD
    subgraph "Public API Layer"
        A["TextEmbedding<br/>ImageEmbedding<br/>SparseTextEmbedding<br/>LateInteractionTextEmbedding"]
    end
    
    subgraph "Embedding Base Classes"
        B["TextEmbeddingBase"]
        C["ImageEmbeddingBase"]
        D["SparseTextEmbeddingBase"]
        E["LateInteractionTextEmbeddingBase"]
    end
    
    subgraph "ONNX Abstraction Layer"
        F["OnnxTextModel"]
        G["OnnxImageModel"]
    end
    
    subgraph "Model Management"
        H["ModelManagement"]
        I["OnnxModel"]
    end
    
    subgraph "Parallel Processing"
        J["ParallelProcessor"]
        K["EmbeddingWorker<br/>OnnxTextEmbeddingWorker"]
    end
    
    subgraph "ONNX Runtime"
        L["InferenceSession"]
    end
    
    A --> B
    A --> C
    A --> D
    A --> E
    B --> F
    C --> G
    F --> I
    G --> I
    I --> L
    H --> I
    J --> K
    K --> F

Core Components

Base Class Hierarchy

FastEmbed implements embedding models through inheritance hierarchies that separate concerns between the embedding interface and the ONNX runtime integration.

#### Text Embedding Base

The TextEmbeddingBase class provides the foundation for all text embedding models. It defines the abstract interface that concrete implementations must follow.

MethodPurposeSource
embed()Generate embeddings for input documentstext_embedding_base.py
_list_supported_models()Return list of supported model descriptionstext_embedding_base.py
_post_process_onnx_output()Transform raw ONNX output to final embeddingstext_embedding_base.py

#### Image Embedding Base

The ImageEmbeddingBase class mirrors the text embedding architecture for image inputs, supporting multimodal models like Jina CLIP.

PropertyTypeDescription
model_namestrHuggingFace model identifier
cache_dirstr \NoneLocal cache directory for model files
lazy_loadboolDefer model loading until first use

ONNX Model Abstraction

The OnnxModel class serves as the bridge between the embedding interface and ONNX Runtime execution.

classDiagram
    class OnnxModel~T~ {
        +str model_name
        +str cache_dir
        +InferenceSession inference_session
        +load_model() Any
        +run(input_feed)~T~
    }
    
    class OnnxTextModel~T~ {
        +mean_pooling(output, attention_mask)
        +encode(input_data) Iterable~T~
    }
    
    class OnnxImageModel~T~ {
        +preprocess_image(image) Tensor
        +encode(images) Iterable~T~
    }
    
    OnnxModel <|-- OnnxTextModel
    OnnxModel <|-- OnnxImageModel

The ONNX model abstraction provides:

  1. Model Loading: Downloads and caches ONNX models from HuggingFace or custom URLs
  2. Session Management: Creates and configures ONNX Runtime inference sessions with specified providers (CPU, CUDA, TensorRT)
  3. Input/Output Handling: Manages input preprocessing and output postprocessing

Embedding Types

FastEmbed supports multiple embedding paradigms through specialized classes.

Dense Embeddings

Dense embeddings convert inputs into fixed-dimensional continuous vectors. The OnnxTextEmbedding class provides dense text embeddings using models like BGE, GTE, and Jina embeddings.

graph LR
    A["Input Text"] --> B["Tokenization"]
    B --> C["ONNX Inference"]
    C --> D["mean_pooling"]
    D --> E["L2 Normalization"]
    E --> F["Dense Vector"]

Key supported models:

ModelDimensionContext LengthPrefix Required
BAAI/bge-small-en-v1.5384512No
BAAI/bge-base-en-v1.5768512No
BAAI/bge-large-en-v1.51024512No
jinaai/jina-embeddings-v2-base-en7688192No
intfloat/multilingual-e5-small384512Yes

Pooled Normalized Embeddings

The PooledNormalizedEmbedding class extends PooledEmbedding with built-in mean pooling and L2 normalization.

# From pooled_normalized_embedding.py
class PooledNormalizedEmbedding(PooledEmbedding):
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        if output.attention_mask is None:
            raise ValueError("attention_mask must be provided for document post-processing")
        
        embeddings = output.model_output
        attn_mask = output.attention_mask
        return normalize(self.mean_pooling(embeddings, attn_mask))

Sparse Embeddings

Sparse embeddings represent documents as sparse vectors with non-zero values only at relevant token positions. FastEmbed provides two sparse embedding approaches:

#### BM25

BM25 is implemented as a traditional sparse embedding model with IDF weighting. The formula used:

score(q, d) = Σ[ IDF(q_i) * (f(q_i, d) * (k + 1)) / (f(q_i, d) + k * (1 - b + b * (|d| / avg_len))) ]

Where:

  • IDF(q_i) is the inverse document frequency
  • f(q_i, d) is the term frequency
  • k, b are hyperparameters controlling saturation and length normalization

#### SPLADE++

SPLADE++ models use neural networks to generate sparse embeddings that combine semantic understanding with exact keyword matching.

Late Interaction Embeddings

Late interaction models like ColBERT generate multiple token-level embeddings that can be compared efficiently during retrieval. These are often combined with postprocessors like MUVERA for fixed-dimensional encoding.

Model Management

Model Source Configuration

Models can be loaded from multiple sources defined through the ModelSource class:

sources=ModelSource(
    hf="jinaai/jina-embeddings-v2-base-en",      # HuggingFace Hub
    url="https://storage.googleapis.com/...",      # Direct URL download
    _deprecated_tar_struct=True                    # Legacy tar format
)

Cache Strategy

The ModelManagement class handles model caching and lazy loading:

  1. Cache Directory: Configurable via FASTEMBED_CACHE_PATH environment variable or cache_dir parameter
  2. Lazy Loading: Models are not loaded until first inference when lazy_load=True
  3. Provider Selection: Automatic provider selection (CUDA > CPU) with fallback
graph TD
    A["Model Request"] --> B{"Cache Hit?"}
    B -->|Yes| C["Load from Cache"]
    B -->|No| D["Download Model"]
    D --> E["Store in Cache"]
    C --> F["Initialize ONNX Session"]
    E --> F
    F --> G["Ready for Inference"]

Parallel Processing

FastEmbed uses a worker-based parallel processing architecture for efficient batch inference.

ParallelProcessor

The ParallelProcessor class manages a pool of workers for parallel embedding generation:

graph TD
    subgraph "Main Process"
        A["ParallelProcessor"]
        B["Input Queue"]
    end
    
    subgraph "Worker Pool"
        C["Worker 1"]
        D["Worker 2"]
        E["Worker N"]
    end
    
    B --> C
    B --> D
    B --> E
    
    C --> F["Result Queue"]
    D --> F
    E --> F

Worker Classes

Workers inherit from EmbeddingWorker or specialized variants:

Worker ClassPurpose
EmbeddingWorkerBase worker for generic embeddings
OnnxTextEmbeddingWorkerText embedding with ONNX inference
PooledNormalizedEmbeddingWorkerWorker with pooling and normalization

Workers implement the init_embedding() method to initialize the embedding model:

def init_embedding(
    self,
    model_name: str,
    cache_dir: str | None = None,
    threads: int | None = None,
    providers: Sequence[OnnxProvider] | None = None,
    cuda: bool | Device = Device.AUTO,
    device_ids: list[int] | None = None,
    lazy_load: bool = False,
    device_id: int | None = None,
    specific_model_path: str | None = None,
    **kwargs: Any,
):

Cross-Encoder Reranking

The cross-encoder reranking system follows a separate architectural path:

graph LR
    A["Query"] --> D["Cross-Encoder"]
    B["Candidate Doc"] --> D
    D --> E["Relevance Scores"]

The OnnxTextCrossEncoder class provides reranking through:

class OnnxTextCrossEncoder(TextCrossEncoderBase, OnnxCrossEncoderModel):
    @classmethod
    def _list_supported_models(cls) -> list[BaseModelDescription]:
        return supported_onnx_models

Supported reranker models:

ModelContext LengthLanguages
jinaai/jina-reranker-v1-turbo-en1KEnglish
jinaai/jina-reranker-v2-base-multilingual1K (sliding window)Multilingual

Device and Provider Management

FastEmbed supports multiple execution providers with automatic device selection:

# Device selection logic (simplified)
if cuda and Device.CUDA available:
    use CUDA provider
elif cuda and Device.ONNX_CPU fallback:
    use CPU provider
else:
    use default provider

Supported Providers

ProviderDeviceUse Case
CPUExecutionProviderCPUGeneral purpose, no GPU
CUDAExecutionProviderNVIDIA GPUFast inference with CUDA
TensorRTExecutionProviderNVIDIA GPUOptimized batch inference

Configuration Options

Model Initialization Parameters

ParameterTypeDefaultDescription
model_namestr"BAAI/bge-small-en-v1.5"Model identifier
cache_dirstr \NoneNoneCache directory path
threadsint \NoneNoneThread count for CPU execution
providersSequence[OnnxProvider] \NoneNoneONNX execution providers
cudabool \DeviceDevice.AUTOCUDA device selection
device_idslist[int] \NoneNoneSpecific GPU device IDs
lazy_loadboolFalseDefer loading until first use
device_idint \NoneNoneSingle device ID assignment
specific_model_pathstr \NoneNoneOverride model file path

Data Flow

Text Embedding Pipeline

sequenceDiagram
    participant Client
    participant TextEmbedding
    participant OnnxTextModel
    participant ONNXRuntime
    
    Client->>TextEmbedding: embed(documents)
    TextEmbedding->>TextEmbedding: preprocess(documents)
    TextEmbedding->>OnnxTextModel: encode(preprocessed)
    OnnxTextModel->>ONNXRuntime: run(input_feed)
    ONNXRuntime-->>OnnxTextModel: raw_output
    OnnxTextModel->>OnnxTextModel: mean_pooling + normalize
    OnnxTextModel-->>TextEmbedding: embeddings
    TextEmbedding-->>Client: Iterable[NDArray]

Model Description Schema

Each model is described by a DenseModelDescription or BaseModelDescription:

DenseModelDescription(
    model="BAAI/bge-small-en-v1.5",
    dim=384,
    description="Text embeddings, Unimodal (text), English, 512 input tokens truncation",
    license="mit",
    size_in_GB=0.067,
    sources=ModelSource(hf="qdrant/bge-small-en-v1.5-onnx-q"),
    model_file="model_optimized.onnx",
)
FieldTypeDescription
modelstrHuggingFace model identifier
dimintEmbedding dimension
descriptionstrHuman-readable model description
licensestrModel license
size_in_GBfloatModel file size
sourcesModelSourceDownload sources configuration
model_filestrONNX model file name
additional_fileslist[str]Extra files (vocab, configs)
requires_idfboolIDF file requirement for sparse models

Post-Processing

MUVERA Post-Processor

The Muvera post-processor converts late-interaction (multi-vector) embeddings to fixed-dimensional representations:

graph TD
    A["Multi-Vector Embeddings"] --> B["Muvera Processing"]
    B --> C["Fixed-Dimensional Encoding"]
    C --> D["L2 Normalized FDE"]
    
    subgraph "Parameters"
        E["k_sim: number of partitions"]
        F["dim_proj: projection dimension"]
        G["r_reps: repetitions"]
    end

Output dimension calculation: r_reps * 2^k_sim * dim_proj

Summary

The FastEmbed architecture demonstrates a well-structured approach to embedding generation:

  • Modularity: Clear separation between embedding types, ONNX abstraction, and model management
  • Performance: ONNX Runtime integration with automatic provider selection
  • Flexibility: Support for dense, sparse, and late-interaction embeddings
  • Extensibility: Worker-based parallel processing with lazy loading support
  • Portability: Cross-platform ONNX execution with CUDA and TensorRT acceleration

Source: https://github.com/qdrant/fastembed / Human Manual

Text Embedding Module

Related topics: System Architecture, GPU Support and Acceleration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Model Categories

Continue reading this section for the full explanation and source context.

Section Model Selection Criteria

Continue reading this section for the full explanation and source context.

Related topics: System Architecture, GPU Support and Acceleration

Text Embedding Module

The Text Embedding Module in FastEmbed provides high-performance text vectorization capabilities using ONNX runtime for efficient inference. It supports dense embeddings, pooled embeddings, normalized embeddings, and multimodal (CLIP) text embeddings.

Architecture Overview

The module follows a layered architecture with base classes providing common functionality and specialized implementations for different embedding strategies.

graph TD
    Base[TextEmbeddingBase] --> OnnxTextEmbedding
    Base[TextEmbeddingBase] --> PooledEmbedding
    PooledEmbedding --> PooledNormalizedEmbedding
    OnnxTextEmbedding --> CLIPOnnxEmbedding
    OnnxTextEmbedding --> PooledEmbedding
    
    Worker[OnnxTextEmbeddingWorker] -.->|init_embedding| OnnxTextEmbedding
    CLIPWorker[CLIPEmbeddingWorker] -.->|init_embedding| CLIPOnnxEmbedding
    
    Models[Supported Models] -->|BGE, Jina, GTE, etc| OnnxTextEmbedding

Core Components

ComponentFilePurpose
TextEmbeddingBasetext_embedding.pyAbstract base class defining the embedding interface
OnnxTextEmbeddingonnx_embedding.pyMain ONNX-based text embedding implementation
PooledEmbeddingpooled_embedding.pyMean pooling variant for document embeddings
PooledNormalizedEmbeddingpooled_normalized_embedding.pyL2-normalized pooled embeddings
CLIPOnnxEmbeddingclip_embedding.pyCLIP-based multimodal text embeddings

Supported Models

FastEmbed's Text Embedding Module supports numerous pre-trained models across different languages and use cases.

Model Categories

CategoryModelsDimLicenseDescription
BGE (Base)BAAI/bge-base-en-v1.5768MITEnglish text, 512 tokens
BGE LargeBAAI/bge-large-en-v1.51024MITHigh-quality English embeddings
BGE SmallBAAI/bge-small-en-v1.5384MITLightweight English embeddings
Jinajinaai/jina-embeddings-v2-base-en768Apache 2.0English, 8192 tokens
Jina Codejinaai/jina-embeddings-v2-base-code768Apache 2.030 programming languages
GTEthenlper/gte-base768MITGeneral text embeddings
Snowflake Arcticsnowflake/snowflake-arctic-embed-m768Apache 2.0Query-optimized embeddings
Multilingual E5intfloat/multilingual-e5-large1024MIT100 languages support

Model Selection Criteria

Different models have varying requirements for query/document prefixes:

# Models requiring prefixes
prefix_required = ["intfloat/multilingual-e5-small", "intfloat/multilingual-e5-large"]

# Models not requiring prefixes  
prefix_not_required = ["BAAI/bge-base-en-v1.5", "jinaai/jina-embeddings-v2-base-en"]

Sources: fastembed/text/onnx_embedding.py:1-200

Class Hierarchy

TextEmbeddingBase

The abstract base class defining the contract for all text embedding implementations.

class TextEmbeddingBase(EmbeddingModel[list[float]]):
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        ...
    
    def embed(self, documents: Iterable[str]) -> Iterable[list[float]]:
        ...

Sources: fastembed/text/text_embedding.py

OnnxTextEmbedding

The primary implementation class that leverages ONNX runtime for inference.

class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[NumpyArray]):
    def __init__(
        self,
        model_name: str = "BAAI/bge-small-en-v1.5",
        cache_dir: str | None = None,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_ids: list[int] | None = None,
        lazy_load: bool = False,
        device_id: int | None = None,
        specific_model_path: str | None = None,
        **kwargs: Any,
    )

#### Constructor Parameters

ParameterTypeDefaultDescription
model_namestr"BAAI/bge-small-en-v1.5"Name of the model to use
cache_dir`str \None`NoneCache directory path
threads`int \None`NoneNumber of threads for ONNX
providersSequence[OnnxProvider]NoneONNX execution providers
cuda`bool \Device`Device.AUTOCUDA acceleration
device_idslist[int]NoneGPU device IDs
lazy_loadboolFalseLoad model on first use

Sources: fastembed/text/onnx_embedding.py:200-250

Pooling Strategies

The module implements multiple pooling strategies to aggregate token-level embeddings into sentence-level embeddings.

Mean Pooling

Mean pooling computes the average of all token embeddings weighted by the attention mask.

def mean_pooling(self, embeddings: NumpyArray, attention_mask: NumpyArray) -> NumpyArray:
    # Expand attention mask to broadcast
    attention_mask_expanded = np.expand_dims(attention_mask, -1)
    # Sum embeddings where mask is active
    sum_embeddings = np.sum(embeddings * attention_mask_expanded, axis=1)
    # Count valid tokens
    counts = np.sum(attention_mask_expanded, axis=1)
    # Return mean
    return sum_embeddings / counts

PooledEmbedding

Applies mean pooling after ONNX inference to generate document embeddings.

class PooledEmbedding(OnnxTextEmbedding):
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        embeddings = output.model_output
        attn_mask = output.attention_mask
        return self.mean_pooling(embeddings, attn_mask)

PooledNormalizedEmbedding

Extends pooling with L2 normalization for cosine similarity optimization.

class PooledNormalizedEmbedding(PooledEmbedding):
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        embeddings = output.model_output
        attn_mask = output.attention_mask
        return normalize(self.mean_pooling(embeddings, attn_mask))

Sources: fastembed/text/pooled_embedding.py Sources: fastembed/text/pooled_normalized_embedding.py

CLIP Text Embeddings

The CLIP embedding implementation provides multimodal text/image embedding capabilities.

Supported CLIP Models

ModelDimensionLicenseDescription
Qdrant/clip-ViT-B-32-text512MITCLIP ViT-B/32 text encoder
jinaai/jina-clip-v1768Apache 2.0Jina CLIP multimodal

CLIPOnnxEmbedding

class CLIPOnnxEmbedding(OnnxTextEmbedding):
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        return output.model_output  # Direct passthrough, no pooling

Sources: fastembed/text/clip_embedding.py

Inference Workflow

graph LR
    A[Input Text] --> B[Tokenization]
    B --> C[ONNX Inference]
    C --> D{Embedding Type?}
    D -->|Standard| E[Direct Output]
    D -->|Pooled| F[Mean Pooling]
    D -->|Pooled Norm| G[Mean Pooling + Normalize]
    E --> H[Final Embeddings]
    F --> H
    G --> H

Usage Examples

Basic Dense Embedding

from fastembed import TextEmbedding

model = TextEmbedding(model_name="BAAI/bge-base-en-v1.5")
documents = [
    "The quick brown fox jumps over the lazy dog",
    "A journey of a thousand miles begins with a single step"
]

embeddings = list(model.embed(documents))
# Returns: list of 768-dimensional embedding vectors

Pooled Normalized Embedding

from fastembed import PooledNormalizedEmbedding

model = PooledNormalizedEmbedding(model_name="BAAI/bge-base-en-v1.5")
embeddings = list(model.embed(documents))
# Returns: L2-normalized pooled embeddings

With Custom ONNX Providers

from fastembed import TextEmbedding
from fastembed.common import OnnxProvider

model = TextEmbedding(
    model_name="BAAI/bge-large-en-v1.5",
    providers=[OnnxProvider.CPUExecutionProvider],
    threads=8
)

Multilingual E5 with Prefix

from fastembed import TextEmbedding

model = TextEmbedding(model_name="intfloat/multilingual-e5-small")

# E5 models require query prefix
query_text = "query: " + user_query
document_text = "passage: " + document_text

query_embedding = list(model.embed([query_text]))
doc_embedding = list(model.embed([document_text]))

Model Sources Configuration

Models can be loaded from multiple sources:

from fastembed.common import ModelSource

sources=ModelSource(
    hf="xenova/jina-embeddings-v2-base-en",      # HuggingFace Hub
    url="https://storage.googleapis.com/...",     # Direct URL
    _deprecated_tar_struct=True                   # Legacy format
)
Source TypePriorityDescription
hfPrimaryHuggingFace Hub repository
urlFallbackDirect download URL
Local cacheCachedPreviously downloaded files

Post-Processing Pipeline

graph TD
    subgraph "ONNX Output"
        A[model_output] --> B[attention_mask]
    end
    
    subgraph "Post-Processing"
        B --> C{Masking Required?}
        A --> C
        C -->|Yes| D[Apply Attention Mask]
        D --> E{Mormalization?}
        C -->|No| E
        E -->|Yes| F[L2 Normalize]
        E -->|No| G[Return Raw]
        F --> H[Final Output]
        G --> H
    end

Configuration Constants

ParameterDefaultDescription
default_model"BAAI/bge-small-en-v1.5"Fallback model
pooling_typePoolingType.MEANPooling strategy
normalizationTrueL2 normalization flag

Sources: fastembed/text/onnx_text_model.py

Integration with Qdrant

FastEmbed text embeddings are designed for seamless integration with Qdrant vector database:

from fastembed import TextEmbedding
import qdrant_client

model = TextEmbedding(model_name="BAAI/bge-base-en-v1.5")
embeddings = list(model.embed(documents))

# Upload to Qdrant
client = qdrant_client.QdrantClient()
client.upsert(
    collection_name="text_embeddings",
    points=[...]
)

Performance Considerations

Token Truncation Limits

Model TypeMax TokensNotes
BGE Small/Large512Standard context
Jina v28192Extended context
Multilingual E5512Query-optimized
Arctic Embed512-2048Variable by model

Hardware Acceleration

The module automatically detects and utilizes available hardware:

  1. CUDA - GPU acceleration via CUDAExecutionProvider
  2. CPU - Multi-threaded via CPUExecutionProvider
  3. CoreML - Apple Silicon support via CoreMLExecutionProvider

Sources: fastembed/text/onnx_embedding.py:250-300

Error Handling

Common Error Cases

# ValueError: attention_mask must be provided for pooled embeddings
model = PooledNormalizedEmbedding(...)
# Must ensure model outputs attention_mask

# ModelNotSupportedError: Unknown model
model = TextEmbedding(model_name="unknown/model")
# Falls back to default model or raises error

Validation Requirements

CheckConditionError Type
attention_maskRequired for pooledValueError
model_nameMust be in supported listModelNotSupportedError
cache_dirValid pathOSError

See Also

Sources: fastembed/text/onnx_embedding.py:1-200

Image Embedding Module

Related topics: System Architecture, Late Interaction Models

Section Related Pages

Continue reading this section for the full explanation and source context.

Section OnnxImageEmbedding

Continue reading this section for the full explanation and source context.

Section Constructor Parameters

Continue reading this section for the full explanation and source context.

Section CLIP-based Models

Continue reading this section for the full explanation and source context.

Related topics: System Architecture, Late Interaction Models

Image Embedding Module

The Image Embedding Module in FastEmbed provides functionality for generating vector representations (embeddings) from images using ONNX-based models. This module enables efficient image similarity search, image clustering, and multimodal retrieval applications.

Architecture Overview

The image embedding system follows a layered architecture pattern, separating the model definitions, ONNX inference logic, and embedding base classes.

graph TD
    A[Image Input] --> B[ImageEmbeddingBase]
    B --> C[OnnxImageModel]
    C --> D[OnnxImageEmbedding]
    D --> E[ONNX Runtime]
    E --> F[Embedding Vectors]
    
    G[Supported Models] --> D
    G -.-> H[ResNet50]
    G -.-> I[Unicom-ViT-B-16]
    G -.-> J[Unicom-ViT-B-32]
    G -.-> K[jinaai/jina-clip-v1]

Supported Image Models

The module supports multiple image embedding models with varying dimensions and capabilities.

ModelDimensionTypeLicenseSize (GB)HF Source
Qdrant/resnet50-onnx512Image onlyapache-2.00.10link
Qdrant/Unicom-ViT-B-16768Multimodal (text&image)apache-2.00.82link
Qdrant/Unicom-ViT-B-32512Multimodal (text&image)apache-2.00.48link
jinaai/jina-clip-v1768Multimodal (text&image)apache-2.00.34link

Sources: fastembed/image/onnx_embedding.py:1-40

Core Classes

OnnxImageEmbedding

The main class for generating image embeddings extends ImageEmbeddingBase and OnnxImageModel[NumpyArray].

class OnnxImageEmbedding(ImageEmbeddingBase, OnnxImageModel[NumpyArray]):

Class Hierarchy:

graph LR
    A[TextEmbeddingBase] -->|inheritance| B[ImageEmbeddingBase]
    C[OnnxTextModel] -->|generic| D[OnnxImageModel]
    E[OnnxTextEmbedding] -->|reuse pattern| F[OnnxImageEmbedding]

The class follows the same architectural pattern as OnnxTextEmbedding, sharing the ONNX inference infrastructure with text embedding models. Sources: fastembed/image/onnx_embedding.py:44-50

Constructor Parameters

ParameterTypeDefaultDescription
model_namestr"Qdrant/clip-ViT-B-32"Name of the model to use
cache_dir`str \None`NonePath to cache directory
threads`int \None`NoneNumber of threads for inference
providers`Sequence[OnnxProvider] \None`NoneONNX execution providers
cuda`bool \Device`Device.AUTOCUDA device configuration
device_ids`list[int] \None`NoneSpecific device IDs
lazy_loadboolFalseLoad model lazily
device_id`int \None`NoneSpecific device ID

Multimodal Image Models

CLIP-based Models

The Unicom-ViT-B-16 and Unicom-ViT-B-32 models support both image and text inputs, enabling cross-modal retrieval scenarios.

ModelVision DimensionDescription
Unicom-ViT-B-16768More detailed embeddings (16x16 patches)
Unicom-ViT-B-32512Faster processing (32x32 patches)

jina-clip-v1

The jinaai/jina-clip-v1 model is a 2024 multimodal model supporting both text and image inputs:

DenseModelDescription(
    model="jinaai/jina-clip-v1",
    dim=768,
    description="Image embeddings, Multimodal (text&image), 2024 year",
    license="apache-2.0",
    size_in_GB=0.34,
    sources=ModelSource(hf="jinaai/jina-clip-v1"),
    model_file="onnx/vision_model.onnx",
),

Sources: fastembed/image/onnx_embedding.py:20-28

Model Loading Workflow

sequenceDiagram
    participant User
    participant OnnxImageEmbedding
    participant OnnxImageModel
    participant ONNX Runtime
    participant HuggingFace

    User->>OnnxImageEmbedding: __init__(model_name)
    OnnxImageEmbedding->>OnnxImageModel: Load model from cache/HF
    OnnxImageModel->>HuggingFace: Download if needed
    OnnxImageModel->>ONNX Runtime: Initialize session
    ONNX Runtime-->>OnnxImageModel: Session ready
    OnnxImageModel-->>OnnxImageEmbedding: Model loaded
    User->>OnnxImageEmbedding: embed(image)
    OnnxImageEmbedding->>ONNX Runtime: Run inference
    ONNX Runtime-->>User: Embedding vector

Usage Examples

Basic Image Embedding

from fastembed import ImageEmbedding

model = ImageEmbedding(model_name="Qdrant/Unicom-ViT-B-32")
embeddings = list(model.embed(["path/to/image.jpg"]))

CLIP Text-Image Retrieval

For multimodal models like jina-clip-v1, you can perform cross-modal retrieval:

from fastembed import ImageEmbedding, TextEmbedding

image_model = ImageEmbedding(model_name="jinaai/jina-clip-v1")
text_model = TextEmbedding(model_name="jinaai/jina-clip-v1")

# Generate embeddings for both modalities
image_emb = list(image_model.embed(["image_path.jpg"]))
text_emb = list(text_model.embed(["search query"]))

# Compute similarity
from numpy import dot

similarity = dot(image_emb[0], text_emb[0])

Integration with Qdrant

Image embeddings generated by this module are designed for use with Qdrant vector database:

# Creating a Qdrant collection with image embeddings
from qdrant_client import QdrantClient

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="images",
    vectors_config={
        "image": VectorParams(
            size=768,  # For Unicom-ViT-B-16 or jina-clip-v1
            distance=Distance.COSINE
        )
    }
)

Supported Model Files

Each model specifies its ONNX model file location:

ModelModel FileAdditional Files
ResNet50model.onnxNone
Unicom-ViT-B-16model.onnxNone
Unicom-ViT-B-32model.onnxNone
jina-clip-v1onnx/vision_model.onnxonnx/text_model.onnx

Sources: fastembed/image/onnx_embedding.py:10-28

Relationship with Text Embedding Module

The Image Embedding Module shares significant implementation with the Text Embedding Module:

graph TD
    A[OnnxTextModel] -->|shared base| B[OnnxImageModel]
    A -->|shared base| C[OnnxTextEmbedding]
    B -->|shared base| D[OnnxImageEmbedding]
    
    E[supported_onnx_models] -->|text models| C
    F[supported_image_models] -->|image models| D
    
    G[CLIPEmbeddingWorker] -.->|reused| H[OnnxTextEmbeddingWorker]

This design ensures consistent ONNX inference behavior and reduces code duplication across embedding types. Sources: fastembed/text/onnx_embedding.py

Performance Considerations

ModelSize (GB)Embedding DimUse Case
ResNet500.10512Fast, lightweight embeddings
Unicom-ViT-B-320.48512Balanced speed/quality
Unicom-ViT-B-160.82768Higher quality, slower
jina-clip-v10.34768Multimodal, 2024 model

Configuration Options

The module inherits configuration capabilities from the base classes:

# Example with full configuration
model = OnnxImageEmbedding(
    model_name="Qdrant/Unicom-ViT-B-16",
    cache_dir="~/.cache/fastembed",
    providers=["CPUExecutionProvider"],  # or "CUDAExecutionProvider"
    threads=4,
    lazy_load=True
)

Summary

The Image Embedding Module provides:

  • 4 supported image models ranging from lightweight to high-quality
  • Multimodal support via CLIP-based models for text-image retrieval
  • ONNX runtime optimization for efficient inference
  • Qdrant integration for vector storage and similarity search
  • Consistent API with the text embedding module

Sources: fastembed/image/onnx_embedding.py:1-40

Sparse Embedding Models

Related topics: Text Embedding Module, System Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Class Hierarchy

Continue reading this section for the full explanation and source context.

Section Supported Models

Continue reading this section for the full explanation and source context.

Section SparseTextEmbeddingBase

Continue reading this section for the full explanation and source context.

Related topics: Text Embedding Module, System Architecture

Sparse Embedding Models

Overview

Sparse embedding models in FastEmbed represent text as high-dimensional sparse vectors where most dimensions have zero values. Unlike dense embeddings where every dimension contributes to meaning, sparse embeddings only store non-zero values for specific tokens or features. This approach combines semantic understanding with exact keyword matching capabilities.

The sparse representation consists of:

  • Indices: Token identifiers in the vocabulary
  • Values: Weight/scores representing token importance
# Example sparse embedding output
SparseEmbedding(indices=[17, 123, 919, ...], values=[0.71, 0.22, 0.39, ...])

Sources: fastembed/sparse/sparse_text_embedding.py:1-50

Architecture

Class Hierarchy

graph TD
    A[SparseTextEmbeddingBase] --> B[MiniCOIL]
    A --> C[Bm25]
    A --> D[BM42]
    A --> E[SPLADE++]
    
    F[OnnxTextModel<br/>SparseEmbedding] --> A
    
    G[OnnxTextEmbeddingWorker] --> F

The sparse embedding system is built on a base class SparseTextEmbeddingBase that extends OnnxTextModel[SparseEmbedding], providing a unified interface for all sparse embedding implementations.

Sources: fastembed/sparse/minicoil.py:30-50

Supported Models

ModelTypeLanguageSizeLicenseRequires IDF
SPLADE++Sparse/SPLADEEnglish0.22 GBapache-2.0Yes
BM25Traditional BM25Multi-language0.01 GBapache-2.0Yes
BM42Hybrid BM25+AttentionEnglish0.04 GBapache-2.0Yes
MiniCOILSemantic + KeywordEnglish0.09 GBapache-2.0Yes

Sources: fastembed/sparse/bm25.py:1-30

Core Components

SparseTextEmbeddingBase

The base class for all sparse embedding models defines the common interface and behavior:

class SparseTextEmbeddingBase(OnnxTextModel[SparseEmbedding]):
    """Base class for sparse text embedding models"""
    
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        """Returns list of supported sparse models"""
        
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[SparseEmbedding]:
        """Post-process ONNX model output to sparse format"""

Sources: fastembed/sparse/sparse_text_embedding.py:1-100

SparseEmbedding Data Model

The SparseEmbedding class represents sparse vectors with two primary attributes:

AttributeTypeDescription
indiceslist[int]Vocabulary token IDs with non-zero values
valueslist[float]Corresponding importance weights for each index

Sources: fastembed/sparse/sparse_text_embedding.py:100-150

Implementation Details

BM25

BM25 (Best Matching 25) is a traditional sparse embedding model that evaluates token importance based on term frequency and inverse document frequency.

Formula:

score(q, d) = SUM[ IDF(q_i) * (f(q_i, d) * (k + 1)) / (f(q_i, d) + k * (1 - b + b * (|d| / avg_len))) ]
ParameterDefaultDescription
k1.5Term frequency saturation parameter
b0.75Length normalization parameter
avg_lenComputedAverage document length

WARNING: BM25 is expected to be used with modifier="idf" in the sparse vector index of Qdrant.

Sources: fastembed/sparse/bm25.py:30-80

MiniCOIL

MiniCOIL is a sparse embedding model that combines semantic meaning resolution with exact keyword matching behavior.

Key Characteristics:

  • Converts vocabulary tokens into 4-dimensional components of sparse vectors
  • Weights tokens by their frequency in the corpus
  • Falls back to BM25-like behavior for out-of-vocabulary tokens
class MiniCOIL(SparseTextEmbeddingBase, OnnxTextModel[SparseEmbedding]):
    """
    MiniCOIL resolves semantic meaning while keeping exact keyword match behavior.
    Each vocabulary token is converted into 4d component of a sparse vector.
    """

Sources: fastembed/sparse/minicoil.py:30-55

BM42

BM42 extends traditional BM25 by incorporating attention weights from transformer models, creating a hybrid sparse representation.

Sources: fastembed/sparse/bm42.py:1-50

SPLADE++

SPLADE++ (SParse Lexical AnD expRessive model) uses a sparse expansion approach where each token can expand to related terms in the vocabulary, enabling semantic matching while maintaining interpretability.

Sources: fastembed/sparse/splade_pp.py:1-50

Data Flow

graph LR
    A[Input Text] --> B[Tokenization]
    B --> C[ONNX Model Inference]
    C --> D[ONNX Output Context]
    D --> E[Post-Processing]
    E --> F[SparseEmbedding]
    
    G[Vocabulary] --> C
    G --> H[SparseVectorsConverter]
    H --> E

SparseVectorsConverter Utility

The SparseVectorsConverter class handles conversion between different sparse vector formats, particularly for MiniCOIL's word embeddings.

Key Operations:

  • Converts sentence embeddings to Qdrant sparse vector format
  • Handles out-of-vocabulary (OOV) words with fallback to BM25
  • Manages vocabulary word embeddings with 4-dimensional components
# Example input structure
{
    "vector": WordEmbedding({
        "word": "vector",
        "forms": ["vector", "vectors"],
        "count": 2,
        "word_id": 1231,
        "embedding": [0.1, 0.2, 0.3, 0.4]
    }),
    "axiotic": WordEmbedding({  # OOV word
        "word": "axiotic",
        "forms": ["axiotics"],
        "count": 1,
        "word_id": -1,
    })
}

Sources: fastembed/sparse/utils/sparse_vectors_converter.py:50-100

Usage Examples

Basic Usage

from fastembed import SparseTextEmbedding

# Initialize with default SPLADE++ model
model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")

# Generate sparse embeddings
documents = ["Example text for embedding", "Another document"]
embeddings = list(model.embed(documents))

# Output: [SparseEmbedding(indices=[17, 123, 919, ...], values=[0.71, 0.22, 0.39, ...])]

BM25 Usage

from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="Qdrant/bm25")
embeddings = list(model.embed(documents))

With Qdrant

Sparse embeddings are designed for use with Qdrant's sparse vector index with modifier="idf":

from qdrant_client import QdrantClient
from fastembed import SparseTextEmbedding

client = QdrantClient()
model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")

# Embedding will have indices and values compatible with Qdrant sparse vectors

Sources: README.md:1-100

Configuration Options

ParameterTypeDefaultDescription
model_namestr"prithivida/Splade_PP_en_v1"Name of the sparse embedding model
cache_dirstr \NoneNoneCache directory path for model files
threadsint \NoneNoneNumber of threads for inference
providersSequence[OnnxProvider] \NoneNoneONNX execution providers
lazy_loadboolFalseWhether to load model lazily
device_idint \NoneNoneSpecific device ID for execution

Language Support

ModelLanguages
SPLADE++English only
BM2515+ languages
BM42English
MiniCOILEnglish

Sources: fastembed/sparse/bm25.py:1-25

Requirements

All sparse embedding models require IDF (Inverse Document Frequency) weighting for optimal performance. This is typically handled by the vector database (e.g., Qdrant) during indexing and search operations.

Source: https://github.com/qdrant/fastembed / Human Manual

Late Interaction Models

Related topics: Image Embedding Module, System Architecture, Text Embedding Module

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Late Interaction vs Traditional Dense Retrieval

Continue reading this section for the full explanation and source context.

Section Colbert Late Interaction Mechanism

Continue reading this section for the full explanation and source context.

Section Text-based Late Interaction Models

Continue reading this section for the full explanation and source context.

Related topics: Image Embedding Module, System Architecture, Text Embedding Module

Late Interaction Models

Overview

Late Interaction Models represent an advanced embedding paradigm that departs from traditional single-vector dense representations. Unlike conventional dense embedding models that compress entire documents into a single embedding vector, late interaction models preserve token-level embeddings and defer the similarity computation until query time. This approach enables granular token-to-token interactions between queries and documents, significantly improving retrieval precision for complex semantic matching tasks.

The FastEmbed library implements two categories of late interaction models:

CategoryScopeUse Case
Text-basedQuery-document text matchingSemantic search, question answering
MultimodalText + image joint embeddingVisual document retrieval, image search

Architecture

Late Interaction vs Traditional Dense Retrieval

graph TD
    subgraph "Traditional Dense Retrieval"
        A1[Query Text] --> B1[Encoder]
        D1[Document] --> E1[Encoder]
        B1 --> C1[Single Query Vector]
        E1 --> F1[Single Document Vector]
        C1 --> G1[Dot Product / Cosine Similarity]
        F1 --> G1
    end
    
    subgraph "Late Interaction Retrieval"
        A2[Query Text] --> B2[Encoder]
        D2[Document] --> E2[Encoder]
        B2 --> C2[Query Token Embeddings]
        E2 --> F2[Document Token Embeddings]
        C2 --> G2[Late Interaction Module]
        F2 --> G2
        G2 --> H2[Max-Sum Similarity]
    end
    
    style C1 fill:#ffcccc
    style F1 fill:#ffcccc
    style C2 fill:#ccffcc
    style F2 fill:#ccffcc

Colbert Late Interaction Mechanism

The Colbert model employs a max-similarity strategy where query tokens independently find their most similar document token, and relevance is computed as the sum of these maximum similarities:

graph LR
    Q1[Query: "q1 q2 q3"] --> QE[Query Encoder]
    D1[Doc: "d1 d2"] --> DE[Document Encoder]
    QE --> QT["Q_emb: [v₁, v₂, v₃]"]
    DE --> DT["D_emb: [u₁, u₂]"]
    QT --> MM[Similarity Matrix]
    DT --> MM
    MM --> MS[Max Similarity<br/>sim(qᵢ) = maxⱼ S(vᵢ, uⱼ)]
    MS --> SR[Score = Σᵢ sim(qᵢ)]

Supported Models

Text-based Late Interaction Models

ModelDimensionContext LengthMultilingualLicenseSize
jinaai/jina-colbert-v21288192Yescc-by-nc-4.02.24 GB

Multimodal Late Interaction Models

ModelModalityDescription
vidore/colpaliText + ImageVision-Language late interaction
vidore/colqwen2Text + ImageQwen2-based multimodal
vidore/colmodernvbertText + ImageModern vision backbone

Base Classes

LateInteractionTextEmbedding

Abstract base class for text-based late interaction models.

class LateInteractionTextEmbedding(TextEmbeddingBase):
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        ...
    
    @classmethod
    def _get_worker_class(cls) -> Type[OnnxTextEmbeddingWorker]:
        ...

Sources: late_interaction_text_embedding.py:1-50

Colbert Base Class

The Colbert class implements the core late interaction logic:

class Colbert(LateInteractionTextEmbedding):
    QUERY_MARKER_TOKEN_ID: int =  1  # Default CLS token
    DOCUMENT_MARKER_TOKEN_ID: int = 1  # Default CLS token
    MIN_QUERY_LENGTH: int = 32
    MASK_TOKEN: str = "[MASK]"

Sources: colbert.py:20-60

Key methods:

MethodPurpose
encode_query()Encode a single query, returning token embeddings
encode_document()Encode a single document, returning token embeddings
score()Compute late interaction similarity between query and document

Jina Colbert Implementation

The JinaColbert class extends the base Colbert with model-specific configuration:

class JinaColbert(Colbert):
    QUERY_MARKER_TOKEN_ID = 250002
    DOCUMENT_MARKER_TOKEN_ID = 250003
    MIN_QUERY_LENGTH = 31  # 32 minus 1 for special token
    MASK_TOKEN = "<mask>"

Sources: jina_colbert.py:15-19

#### Model Configuration

ParameterValueDescription
modeljinaai/jina-colbert-v2HuggingFace model identifier
dim128Token embedding dimension
size_in_GB2.24Model size
context_length8192Maximum input tokens
licensecc-by-nc-4.0Model license

Multimodal Late Interaction

ColPali Model

ColPali extends the late interaction paradigm to visual documents by treating images as sequences of patches:

class ColPali(MultimodalTextImageBase, OnnxMultimodalModel[MultivectorEmbedding]):
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        return supported_colpali_models

Sources: colpali.py:1-50

ColModernVBert Model

ColModernVBert provides an alternative vision-language architecture with a modern backbone:

class ColModernVBert(MultimodalTextImageBase, OnnxMultimodalModel[MultivectorEmbedding]):
    QUERY_MARKER_TOKEN_ID = 0
    DOCUMENT_MARKER_TOKEN_ID = 1
    MIN_QUERY_LENGTH = 32
    MASK_TOKEN = "<mask>"

Sources: colmodernvbert.py:1-50

Architecture: Multimodal Late Interaction

graph TD
    subgraph "Query Processing"
        QT[Text Query] --> QE[Text Encoder]
        QV[Query Image] --> QP[Patch Extraction]
        QP --> QI[Query Image Embeddings]
        QE --> QT_emb[Query Token Embeddings]
    end
    
    subgraph "Document Processing"
        DT[Document Text] --> DE[Text Encoder]
        DV[Document Image] --> DP[Patch Extraction]
        DP --> DI[Doc Image Embeddings]
        DE --> DT_emb[Document Token Embeddings]
    end
    
    subgraph "Late Interaction"
        QT_emb --> LI[Interaction Module]
        DT_emb --> LI
        QI --> LI
        DI --> LI
        LI --> SM[Similarity Matrix]
        SM --> MS[Max-Sum Pooling]
    end
    
    MS --> SC[Relevance Score]

Usage Examples

Text-based Late Interaction

from fastembed import LateInteractionTextEmbedding

# Initialize the model
model = LateInteractionTextEmbedding(
    model_name="jinaai/jina-colbert-v2"
)

# Encode query and document separately
query_embedding = model.query_embed("What is machine learning?")
doc_embedding = model.doc_embed("Machine learning is a subset of AI...")

# Compute late interaction score
score = model.score(query_embedding, doc_embedding)

Multimodal Late Interaction

from fastembed import LateInteractionTextImageEmbedding

model = LateInteractionTextImageEmbedding(
    model_name="vidore/colpali"
)

# Encode image with optional text
image_embedding = model.doc_embed(image=image_bytes)
query_embedding = model.query_embed("Find charts about revenue")

Configuration Options

Common Parameters

ParameterTypeDefaultDescription
model_namestrRequiredModel identifier
cache_dirstrNoneLocal cache directory
threadsintNoneCPU threads for inference
providersSequence[OnnxProvider]NoneONNX execution providers
lazy_loadboolFalseDefer model loading until first use
device_idintNoneSpecific device index

ONNX Providers

ProviderDescriptionPriority
CPUExecutionProviderCPU inferenceDefault fallback
CUDAExecutionProviderNVIDIA GPUPreferred for speed
CoreMLExecutionProviderApple SiliconMobile/iOS

Post-processing: Muvera

Muvera is a post-processing technique that converts late interaction embeddings (multi-vector) into fixed-dimensional representations:

from fastembed.postprocess import Muvera

# Convert from multi-vector to fixed-dim
muvera = Muvera.from_multivector_model(
    model=late_interaction_model,
    k_sim=6,
    dim_proj=32
)

# Process document
fde = muvera.process_document(multivector_embedding)

Sources: postprocess/muvera.py:1-100

Muvera Configuration

ParameterDescriptionImpact
k_simLog₂ of number of bucketsMemory vs precision
dim_projProjection dimensionOutput size
r_repsNumber of repetitionsRobustness
random_seedRandom seedReproducibility

Output dimension formula: r_reps × 2^k_sim × dim_proj

Performance Considerations

Token Length Limits

ModelQuery LimitDocument Limit
jina-colbert-v231 tokens8192 tokens
colpaliVariableVariable

Memory Usage

Late interaction models store per-token embeddings rather than single vectors:

  • Traditional dense: N × D memory (N documents, D dimension)
  • Late interaction: N × T × D memory (T = avg token count)

This trade-off enables better precision at the cost of increased memory footprint.

Comparison with Dense Retrieval

AspectDense RetrievalLate Interaction
Embedding typeSingle vectorToken-level vectors
Query speedO(1) comparisonO(Q × D) interaction
PrecisionGood for semantic similarityExcellent for term matching
MemoryLowerHigher
InterpretabilityLimitedToken-level attribution
ComponentFilePurpose
ColbertEmbeddingWorkercolbert.pyParallel embedding worker
OnnxMultimodalModelonnx_multimodal_model.pyBase for ONNX multimodal
MultivectorEmbeddingTypesOutput type for late interaction
Muveramuvera.pyDimensionality reduction post-process

References

  • Original Colbert paper: Khattab & Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction"
  • Model source: jinaai/jina-colbert-v2

Sources: late_interaction_text_embedding.py:1-50

ONNX Model Infrastructure

Related topics: System Architecture, GPU Support and Acceleration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Base Classes

Continue reading this section for the full explanation and source context.

Section Model Loading

Continue reading this section for the full explanation and source context.

Section Execution Providers

Continue reading this section for the full explanation and source context.

Related topics: System Architecture, GPU Support and Acceleration

ONNX Model Infrastructure

Overview

The ONNX Model Infrastructure is the core runtime layer in FastEmbed that enables efficient execution of embedding models through the ONNX (Open Neural Network Exchange) format. This infrastructure provides a unified abstraction for loading, executing, and post-processing ONNX models across different embedding modalities including text, images, and sparse embeddings.

FastEmbed leverages ONNX Runtime to achieve cross-platform compatibility and optimized inference performance without requiring PyTorch or TensorFlow dependencies. The architecture separates model execution concerns from embedding-specific logic, enabling a clean separation of concerns between the ONNX runtime layer and higher-level embedding abstractions. Sources: fastembed/text/onnx_embedding.py:1-50

Architecture Overview

The ONNX infrastructure follows a class hierarchy pattern where base classes define the runtime contract and concrete implementations provide modality-specific behavior. The architecture is designed around the following core components:

graph TD
    A[ONNX Runtime] --> B[OnnxModel Base]
    B --> C[OnnxTextModel]
    B --> D[OnnxImageModel]
    B --> E[OnnxCrossEncoderModel]
    C --> F[OnnxTextEmbedding]
    C --> G[PooledEmbedding]
    C --> H[CLIPOnnxEmbedding]
    C --> I[MiniCOIL]
    D --> J[OnnxImageEmbedding]
    E --> K[OnnxTextCrossEncoder]
    
    L[TextEmbeddingBase] --> F
    M[ImageEmbeddingBase] --> J
    N[SparseTextEmbeddingBase] --> I

Core Base Classes

The infrastructure defines three primary base classes that orchestrate ONNX model execution:

ClassFilePurpose
OnnxModelfastembed/common/onnx_model.pyCore runtime for ONNX session management
OnnxTextModelfastembed/text/onnx_text_model.pyText-specific ONNX execution
OnnxImageModelfastembed/image/onnx_embedding.pyImage-specific ONNX execution

Sources: fastembed/common/onnx_model.py:1-100

ONNX Session Management

Model Loading

The OnnxModel base class handles the lifecycle of ONNX model loading and execution. When a model is initialized, it performs the following operations:

  1. Resolves the model file path from cache or downloads from source
  2. Configures ONNX Runtime session options (threads, providers)
  3. Creates an inference session with the specified execution providers
  4. Validates model inputs and outputs
class OnnxModel(Generic[T]):
    def __init__(
        self,
        model_dir: Path,
        model_file: str,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_id: int | None = None,
        **kwargs: Any,
    ):
        self._load_onnx_model(
            model_dir=model_dir,
            model_file=model_file,
            threads=threads,
            providers=providers,
            cuda=cuda,
            device_id=device_id,
        )

Sources: fastembed/common/onnx_model.py:50-80

Execution Providers

The infrastructure supports multiple ONNX Runtime execution providers for hardware acceleration:

ProviderPriorityUse Case
CUDAExecutionProviderGPU accelerationNVIDIA GPUs
CPUExecutionProviderFallbackCPU inference

The Device enum provides automatic device selection:

class Device(Enum):
    CPU = "cpu"
    CUDA = "cuda"
    AUTO = "auto"

Sources: fastembed/common/types.py:1-50

Text Embedding Infrastructure

OnnxTextEmbedding

The OnnxTextEmbedding class is the primary implementation for text embedding generation. It inherits from both TextEmbeddingBase and OnnxTextModel, combining the ONNX runtime with embedding-specific logic.

class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[NumpyArray]):
    """Implementation of the Flag Embedding model."""
    
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        return supported_onnx_models

Sources: fastembed/text/onnx_embedding.py:60-85

#### Supported Models

The text embedding infrastructure includes a comprehensive list of supported models:

ModelDimensionLicenseSize (GB)Token Limit
BAAI/bge-base-en768mit0.42512
BAAI/bge-base-en-v1.5768mit0.21512
BAAI/bge-large-en-v1.51024mit1.20512
BAAI/bge-small-en-v1.5384mit0.067512
snowflake/snowflake-arctic-embed-m768apache-2.00.43512
snowflake/snowflake-arctic-embed-m-long768apache-2.00.542048
jinaai/jina-clip-v1768apache-2.00.55multimodal
mixedbread-ai/mxbai-embed-large-v11024apache-2.00.64512

Sources: fastembed/text/onnx_embedding.py:30-150

Pooled Embedding Variants

The infrastructure provides specialized pooling strategies through inheritance:

#### PooledNormalizedEmbedding

Applies mean pooling over token embeddings followed by L2 normalization:

class PooledNormalizedEmbedding(PooledEmbedding):
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        embeddings = output.model_output
        attn_mask = output.attention_mask
        return normalize(self.mean_pooling(embeddings, attn_mask))

Supported models for pooled normalized embeddings include:

  • jinaai/jina-embeddings-v2-base-en (768 dim, 8192 tokens)
  • jinaai/jina-embeddings-v2-small-en (512 dim, 8192 tokens)
  • thenlper/gte-base (768 dim, 512 tokens)
  • thenlper/gte-large (1024 dim, 512 tokens)

Sources: fastembed/text/pooled_normalized_embedding.py:50-100

#### PooledEmbedding

Standard pooled embedding with mean pooling over token representations:

class PooledEmbedding(OnnxTextEmbedding):
    @classmethod
    def _get_worker_class(cls) -> Type[OnnxTextEmbeddingWorker]:
        return PooledEmbeddingWorker

Supported multilingual and specialized models:

  • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (384 dim, ~50 languages)
  • sentence-transformers/paraphrase-multilingual-mpnet-base-v2 (768 dim, ~50 languages)
  • intfloat/multilingual-e5-large (1024 dim, ~100 languages)

Sources: fastembed/text/pooled_embedding.py:50-150

Image Embedding Infrastructure

OnnxImageEmbedding

Image embedding models inherit from OnnxImageModel and ImageEmbeddingBase:

class OnnxImageEmbedding(ImageEmbeddingBase, OnnxImageModel[NumpyArray]):
    def __init__(self, model_name: str, cache_dir: str | None = None, ...):
        ...

Supported image models:

ModelDimensionLicenseSize (GB)
Qdrant/resnet50-onnx2048apache-2.0-
Qdrant/Unicom-ViT-B-16768apache-2.00.82
Qdrant/Unicom-ViT-B-32512apache-2.00.48
jinaai/jina-clip-v1768apache-2.00.34

Sources: fastembed/image/onnx_embedding.py:30-80

Multimodal Embedding

CLIP Embeddings

The CLIPOnnxEmbedding class provides multimodal (text and image) embedding capabilities:

class CLIPOnnxEmbedding(OnnxTextEmbedding):
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        return supported_clip_models
    
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        return output.model_output

Currently supported CLIP model:

  • Qdrant/clip-ViT-B-32-text (512 dim, 77 tokens)

Sources: fastembed/text/clip_embedding.py:20-50

Late Interaction Multimodal

The ColModernVbert class implements late interaction models with image processing capabilities:

def load_onnx_model(self) -> None:
    self._load_onnx_model(...)
    
    # Load image processing configuration
    processor_config_path = self._model_dir / "processor_config.json"
    self.image_seq_len = processor_config.get("image_seq_len", 64)
    self.max_image_size = preprocessor_config.get("max_image_size", {}).get("longest_edge", 512)

Sources: fastembed/late_interaction_multimodal/colmodernvbert.py:50-100

Sparse Embedding Infrastructure

MiniCOIL

The MiniCOIL class implements sparse embedding with semantic resolution:

class MiniCOIL(SparseTextEmbeddingBase, OnnxTextModel[SparseEmbedding]):
    """
    MiniCOIL is a sparse embedding model, that resolves semantic meaning of the words,
    while keeping exact keyword match behavior.
    """

Each vocabulary token is converted into a 4-dimensional component of a sparse vector, weighted by token frequency in the corpus. If a token is not found in the corpus, it is treated exactly like in BM25.

Supported sparse models:

  • Qdrant/minicoil-v1 (0.09 GB, requires IDF weighting)

Sources: fastembed/sparse/minicoil.py:40-80

Worker Architecture

The infrastructure uses a worker-based pattern for parallel embedding generation:

graph LR
    A[Main Thread] --> B[OnnxTextEmbeddingWorker]
    B --> C[ONNX Session]
    C --> D[Tokenization]
    D --> E[Model Inference]
    E --> F[Post-processing]
    F --> G[Normalized Embeddings]

Worker Classes

Worker ClassParentPurpose
OnnxTextEmbeddingWorkerBaseStandard text embedding generation
PooledEmbeddingWorkerOnnxTextEmbeddingWorkerMean pooling after inference
PooledNormalizedEmbeddingWorkerOnnxTextEmbeddingWorkerPooling + L2 normalization
CLIPEmbeddingWorkerOnnxTextEmbeddingWorkerCLIP-specific processing

Sources: fastembed/text/onnx_text_model.py:1-50

Reranking Infrastructure

OnnxTextCrossEncoder

The cross-encoder reranking uses a specialized ONNX model class:

class OnnxTextCrossEncoder(TextCrossEncoderBase, OnnxCrossEncoderModel):
    @classmethod
    def _list_supported_models(cls) -> list[BaseModelDescription]:
        return supported_onnx_models

Supported reranker models:

ModelLicenseSize (GB)Context
jinaai/jina-reranker-v1-turbo-enapache-2.00.151K context
jinaai/jina-reranker-v2-base-multilingualcc-by-nc-4.01.111K context, sliding window

Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py:30-80

Model Source Configuration

ModelSource

Models can be loaded from multiple sources:

@dataclass
class ModelSource:
    hf: str | None = None           # HuggingFace Hub
    url: str | None = None         # Direct URL download
    _deprecated_tar_struct: bool = False  # Legacy tar format

ModelDescription

The base model description structure:

@dataclass
class DenseModelDescription:
    model: str
    dim: int
    description: str
    license: str
    size_in_GB: float
    sources: ModelSource
    model_file: str
    additional_files: list[str] | None = None

Sources: fastembed/common/model_description.py:1-80

Inference Workflow

sequenceDiagram
    participant User
    participant EmbeddingClass
    participant OnnxModel
    participant ONNXRuntime
    
    User->>EmbeddingClass: embed(texts)
    EmbeddingClass->>OnnxModel: preprocess(texts)
    OnnxModel->>OnnxModel: tokenize()
    OnnxModel->>ONNXRuntime: run(session)
    ONNXRuntime-->>OnnxModel: model_output
    OnnxModel->>EmbeddingClass: _post_process_onnx_output()
    EmbeddingClass->>EmbeddingClass: normalize/pool()
    EmbeddingClass-->>User: numpy arrays

Configuration Parameters

Common Parameters

ParameterTypeDefaultDescription
model_namestrmodel-specificName of the model to use
cache_dir`str \None`NoneCache directory for model files
threads`int \None`NoneNumber of threads for ONNX
providers`Sequence[OnnxProvider] \None`NoneExecution providers
cuda`bool \Device`Device.AUTOCUDA device selection
device_ids`list[int] \None`NoneMultiple GPU device IDs
lazy_loadboolFalseDefer model loading
device_id`int \None`NoneSpecific device ID
specific_model_path`str \None`NoneCustom model file path

Sources: fastembed/text/onnx_embedding.py:85-120

Lazy Loading

The infrastructure supports lazy loading for memory-efficient initialization:

def __init__(
    self,
    lazy_load: bool = False,
    ...
):
    if not lazy_load:
        self.load_onnx_model()

When lazy_load=True, the ONNX model is not loaded until the first inference call, reducing startup memory footprint.

Type System

Core Types

TypeDefinitionUsage
NumpyArraynp.ndarray[Any, np.dtype[Any]]Dense embedding arrays
SparseEmbeddingCustom sparse representationSparse embedding vectors
OnnxProviderExecution provider typeCPU, CUDA providers
DeviceEnumDevice selection (CPU/CUDA/AUTO)

Sources: fastembed/common/types.py:1-100

Summary

The ONNX Model Infrastructure provides a robust, extensible foundation for embedding generation in FastEmbed. Key characteristics include:

  • Unified Runtime: Single ONNX execution layer across all embedding modalities
  • Hardware Acceleration: Support for CUDA and CPU execution providers
  • Model Flexibility: Dynamic model loading from HuggingFace, URLs, or local cache
  • Extensible Architecture: Clean inheritance hierarchy for adding new embedding types
  • Memory Efficiency: Lazy loading and optimized session management
  • Cross-Modal Support: Text, image, sparse, and multimodal embeddings

This infrastructure enables FastEmbed to deliver high-performance embedding generation without external ML framework dependencies, making it suitable for production deployments with varying hardware constraints.

Sources: fastembed/common/onnx_model.py:1-100

GPU Support and Acceleration

Related topics: Installation Guide, ONNX Model Infrastructure

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Device Enum

Continue reading this section for the full explanation and source context.

Section Initialization Parameters

Continue reading this section for the full explanation and source context.

Section Constructor Signature

Continue reading this section for the full explanation and source context.

Related topics: Installation Guide, ONNX Model Infrastructure

GPU Support and Acceleration

FastEmbed provides comprehensive GPU acceleration support through ONNX Runtime's execution providers, enabling high-performance inference on NVIDIA GPUs. The library offers flexible device management with automatic detection, multi-GPU support for parallel processing, and lazy loading capabilities for efficient resource utilization.

Architecture Overview

FastEmbed's GPU acceleration is built on top of ONNX Runtime, which provides hardware-accelerated inference for ONNX models. All embedding model classes inherit from base ONNX model classes that handle device initialization and session management.

graph TD
    A[User Code] --> B[TextEmbedding / ImageEmbedding / CrossEncoder]
    B --> C[ONNX Model Base Classes]
    C --> D[ONNX Runtime Session]
    D --> E{Hardware Acceleration}
    E --> F[CUDA Execution Provider]
    E --> G[CPU Execution Provider]
    E --> H[TensorRT Provider]
    
    F --> H1[NVIDIA GPU]
    G --> H2[CPU Fallback]
    
    style F fill:#4CAF50,color:#fff
    style H1 fill:#2196F3,color:#fff

Supported Model Types

FastEmbed supports GPU acceleration across multiple embedding modalities and processing types.

Model TypeClassGPU SupportDescription
Text EmbeddingsOnnxTextEmbeddingDense text embeddings via ONNX
Pooled EmbeddingsPooledEmbeddingPooled representation embeddings
Normalized PooledPooledNormalizedEmbeddingL2-normalized pooled embeddings
Image EmbeddingsOnnxImageEmbeddingVision model embeddings
Cross EncodersOnnxTextCrossEncoderReranking and relevance scoring
Sparse EmbeddingsSPLADE modelsLexical sparse embeddings

Sources: fastembed/text/onnx_embedding.py:1-50

Device Configuration

Device Enum

The Device enum defines available compute devices with automatic selection capability.

class Device(Enum):
    AUTO = "auto"  # Automatically select best available device
    CPU = "cpu"    # Force CPU execution
    CUDA = "cuda"  # NVIDIA GPU acceleration

Initialization Parameters

All ONNX embedding classes accept the following GPU-related parameters:

ParameterTypeDefaultDescription
cuda`bool \Device`Device.AUTOEnable CUDA or specify device type
providersSequence[OnnxProvider]NoneONNX Runtime providers (mutually exclusive with cuda)
device_idslist[int]NoneGPU device IDs for multi-GPU data parallelism
device_idintNoneSpecific device ID for single-process loading
lazy_loadboolFalseDefer model loading until first use

Sources: fastembed/text/onnx_embedding.py:47-57

Constructor Signature

def __init__(
    self,
    model_name: str = "BAAI/bge-small-en-v1.5",
    cache_dir: str | None = None,
    threads: int | None = None,
    providers: Sequence[OnnxProvider] | None = None,
    cuda: bool | Device = Device.AUTO,
    device_ids: list[int] | None = None,
    lazy_load: bool = False,
    device_id: int | None = None,
    specific_model_path: str | None = None,
    **kwargs: Any,
):

Sources: fastembed/text/onnx_embedding.py:44-66

GPU Initialization Workflow

sequenceDiagram
    participant User
    participant Embedding as Embedding Class
    participant Base as ONNX Model Base
    participant Runtime as ONNX Runtime
    participant Device as Compute Device
    
    User->>Embedding: Initialize(cuda=True)
    Embedding->>Base: super().__init__()
    Base->>Device: Auto-detect device
    Device-->>Base: Available devices
    Base->>Runtime: Create InferenceSession
    Runtime->>Device: Load model to GPU
    Device-->>Runtime: Model loaded
    Runtime-->>Base: Session ready
    Base-->>Embedding: Return session
    Embedding-->>User: Instance ready

Multi-GPU Configuration

Data Parallel Processing

For scenarios requiring distribution across multiple GPUs, FastEmbed supports device ID specification for data-parallel workloads.

from fastembed import TextEmbedding

# Initialize for multi-GPU data parallelism
embedding_model = TextEmbedding(
    model_name="BAAI/bge-base-en-v1.5",
    cuda=True,
    device_ids=[0, 1, 2, 3],  # Use 4 GPUs
    lazy_load=True  # Required for multi-GPU setup
)

Sources: fastembed/text/onnx_embedding.py:52-55

Lazy Loading for Multi-GPU

When using multiple GPUs, lazy_load=True defers model loading until first inference, which is essential for avoiding resource conflicts in multi-process scenarios.

embedding_model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    cuda=True,
    device_ids=[0, 1],
    lazy_load=True  # Load on-demand in worker processes
)

Sources: fastembed/text/onnx_embedding.py:54

ONNX Runtime Providers

Provider Selection

ONNX Runtime supports multiple execution providers. FastEmbed allows explicit provider specification via the providers parameter, which is mutually exclusive with the cuda parameter.

from fastembed import TextEmbedding

# Using explicit provider specification
model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

Provider Priority

When multiple providers are specified, ONNX Runtime attempts to use them in order of preference, falling back to subsequent providers if the preferred one is unavailable.

graph LR
    A[Query] --> B{CUDA Available?}
    B -->|Yes| C[CUDAExecutionProvider]
    B -->|No| D{CPU Provider Available?}
    D -->|Yes| E[CPUExecutionProvider]
    D -->|No| F[Error]
    
    C --> G[GPU Inference]
    E --> H[CPU Inference]
    
    style C fill:#4CAF50,color:#fff
    style E fill:#FF9800,color:#fff

GPU Installation

Package Variants

FastEmbed offers separate packages for CPU and GPU operation.

PackageCommandUse Case
CPU (default)pip install fastembedStandard installations
GPUpip install fastembed-gpuNVIDIA GPU acceleration

Sources: README.md:1-20

Qdrant Integration

For vector database workflows with GPU acceleration:

pip install qdrant-client[fastembed-gpu]
from fastembed import TextEmbedding

# GPU-accelerated embedding for Qdrant
model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    providers=["CUDAExecutionProvider"]
)
print("The model BAAI/bge-small-en-v1.5 is ready to use on a GPU.")

Sources: README.md:1-30

Implementation Across Model Classes

OnnxTextEmbedding

The primary text embedding class with full GPU support:

class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[NumpyArray]):
    """Implementation of the Flag Embedding model with ONNX acceleration."""
    
    def __init__(
        self,
        model_name: str = "BAAI/bge-small-en-v1.5",
        cache_dir: str | None = None,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_ids: list[int] | None = None,
        lazy_load: bool = False,
        device_id: int | None = None,
        specific_model_path: str | None = None,
        **kwargs: Any,
    ):

Sources: fastembed/text/onnx_embedding.py:48-70

OnnxImageEmbedding

Image embeddings also inherit the same GPU acceleration framework:

class OnnxImageEmbedding(ImageEmbeddingBase, OnnxImageModel[NumpyArray]):
    def __init__(
        self,
        model_name: str,
        cache_dir: str | None = None,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_ids: list[int] | None = None,
        lazy_load: bool = False,
        device_id: int | None = None,
        specific_model_path: str | None = None,
        **kwargs: Any,
    ):

Sources: fastembed/image/onnx_embedding.py:1-30

OnnxTextCrossEncoder

Reranking models support GPU acceleration for cross-encoder inference:

class OnnxTextCrossEncoder(TextCrossEncoderBase, OnnxCrossEncoderModel):
    def __init__(
        self,
        model_name: str,
        cache_dir: str | None = None,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_ids: list[int] | None = None,
        lazy_load: bool = False,
        device_id: int | None = None,
        specific_model_path: str | None = None,
        **kwargs: Any,
    ):

Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py:1-50

Unified TextEmbedding Entry Point

The TextEmbedding class provides a unified interface that automatically selects the appropriate embedding type:

class TextEmbedding:
    def __init__(
        self,
        model_name: str = "BAAI/bge-small-en-v1.5",
        cache_dir: str | None = None,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_ids: list[int] | None = None,
        lazy_load: bool = False,
        **kwargs: Any,
    ):
        super().__init__(model_name, cache_dir, threads, **kwargs)
        # Automatically routes to appropriate embedding type
        for EMBEDDING_MODEL_TYPE in self.EMBEDDINGS_REGISTRY:
            supported_models = EMBEDDING_MODEL_TYPE._list_supported_models()
            if any(model_name.lower() == model.model.lower() 
                   for model in supported_models):
                self.model = EMBEDDING_MODEL_TYPE(
                    model_name=model_name,
                    cache_dir=cache_dir,
                    threads=threads,
                    providers=providers,
                    cuda=cuda,
                    device_ids=device_ids,
                    lazy_load=lazy_load,
                )

Sources: fastembed/text/text_embedding.py:1-100

Supported Models with GPU Acceleration

Text Embedding Models

ModelDimensionLicenseSize (GB)Token Limit
BAAI/bge-small-en-v1.5384MIT0.067512
BAAI/bge-base-en-v1.5768MIT0.21512
BAAI/bge-large-en-v1.51024MIT1.20512
jinaai/jina-embeddings-v2-base-en768Apache 2.00.528192
sentence-transformers/all-MiniLM-L6-v2384Apache 2.00.09256
mixedbread-ai/mxbai-embed-large-v11024Apache 2.00.64512
nomic-ai/nomic-embed-text-v1.5768Apache 2.00.138192

Sources: fastembed/text/onnx_embedding.py:1-150, fastembed/text/pooled_embedding.py:1-80

Image Embedding Models

ModelDimensionLicenseSize (GB)
Qdrant/Unicom-ViT-B-16768Apache 2.00.82
Qdrant/Unicom-ViT-B-32512Apache 2.00.48
jinaai/jina-clip-v1768Apache 2.00.55

Sources: fastembed/image/onnx_embedding.py:1-50

Reranking Models

ModelLicenseSize (GB)
jinaai/jina-reranker-v1-turbo-enApache 2.00.15
jinaai/jina-reranker-v2-base-multilingualCC BY-NC 4.01.11

Sources: fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py:1-40

Best Practices

Device Selection

from fastembed.common.types import Device

# Recommended: Automatic detection
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cuda=Device.AUTO)

# Explicit CUDA
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cuda=True)

# Force CPU (for debugging)
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cuda=False)

Multi-GPU Inference

# For batch processing across multiple GPUs
model = TextEmbedding(
    model_name="BAAI/bge-base-en-v1.5",
    cuda=True,
    device_ids=[0, 1],  # Parallel GPU usage
    lazy_load=True
)

Provider Fallback

# Explicit provider chain with fallback
model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    providers=[
        "CUDAExecutionProvider",  # Preferred
        "CPUExecutionProvider"    # Fallback
    ]
)

Limitations and Considerations

AspectDescription
Mutual Exclusivityproviders and cuda parameters cannot be used together
Device ID Scopedevice_ids is for data parallelism; device_id is for single-process loading
Lazy LoadingRequired for multi-GPU setups to avoid resource conflicts
Model SupportAll ONNX-exported models support GPU; not all models have ONNX exports

Sources: fastembed/text/onnx_embedding.py:47-57

Summary

FastEmbed's GPU acceleration framework provides:

  1. Automatic device detection via the Device.AUTO enum value
  2. Flexible provider configuration through ONNX Runtime's provider system
  3. Multi-GPU support with device ID lists for data-parallel workloads
  4. Lazy loading for efficient multi-process GPU utilization
  5. Consistent API across text, image, and reranking models
  6. Seamless fallback to CPU when CUDA is unavailable

Sources: fastembed/text/onnx_embedding.py:1-50

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2

First-time setup may fail or require extra isolation and rollback planning.

high [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14

First-time setup may fail or require extra isolation and rollback planning.

high Developers should check this security_permissions risk before relying on the project: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory

Developers may expose sensitive permissions or credentials: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory

medium Developers should check this installation risk before relying on the project: The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.

Developers may fail before the first successful local run: The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.

Doramagic Pitfall Log

Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2

  • Severity: high
  • Finding: Installation risk is backed by a source signal: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/qdrant/fastembed/issues/618

2. Installation risk: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14

  • Severity: high
  • Finding: Installation risk is backed by a source signal: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/qdrant/fastembed/issues/630

3. Security or permission risk: Developers should check this security_permissions risk before relying on the project: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory

  • Severity: high
  • Finding: Developers should check this security_permissions risk before relying on the project: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory
  • User impact: Developers may expose sensitive permissions or credentials: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory. Context: Observed when using python
  • Evidence: failure_mode_cluster:github_issue | fmev_d3890c2b3360ccb937839f70fd4aa584 | https://github.com/qdrant/fastembed/issues/626 | [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory

4. Installation risk: Developers should check this installation risk before relying on the project: The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: The dependency py-rust-stemmers cannot be downloaded in a pure Python environment.
  • User impact: Developers may fail before the first successful local run: The dependency py-rust-stemmers cannot be downloaded in a pure Python environment.
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: The dependency py-rust-stemmers cannot be downloaded in a pure Python environment.. Context: Observed when using python, docker
  • Evidence: failure_mode_cluster:github_issue | fmev_16e50a8626aff1576adeb1c0baab4785 | https://github.com/qdrant/fastembed/issues/466 | The dependency py-rust-stemmers cannot be downloaded in a pure Python environment.

5. Installation risk: Developers should check this installation risk before relying on the project: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2
  • User impact: Developers may fail before the first successful local run: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2. Context: Observed when using python, windows, linux
  • Evidence: failure_mode_cluster:github_issue | fmev_04529bc774f1c961d4adeb7190edecd7 | https://github.com/qdrant/fastembed/issues/618 | [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2

6. Installation risk: Developers should check this installation risk before relying on the project: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14
  • User impact: Developers may fail before the first successful local run: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14. Context: Observed when using python, macos, cuda
  • Evidence: failure_mode_cluster:github_issue | fmev_79a43347d96beb6d05eb6bfec2503fb5 | https://github.com/qdrant/fastembed/issues/630 | [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14

7. Installation risk: Developers should check this installation risk before relying on the project: v0.5.1

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: v0.5.1
  • User impact: Upgrade or migration may change expected behavior: v0.5.1
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v0.5.1. Context: Observed when using python
  • Evidence: failure_mode_cluster:github_release | fmev_8b37c58c613005c0182d0325aaf032f7 | https://github.com/qdrant/fastembed/releases/tag/v0.5.1 | v0.5.1

8. Installation risk: The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: The dependency py-rust-stemmers cannot be downloaded in a pure Python environment.. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/qdrant/fastembed/issues/466

9. Configuration risk: [Bug]: No timeout on model download — requests.get() can hang indefinitely

  • Severity: medium
  • Finding: Configuration risk is backed by a source signal: [Bug]: No timeout on model download — requests.get() can hang indefinitely. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/qdrant/fastembed/issues/627

10. Capability assumption: README/documentation is current enough for a first validation pass.

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: capability.assumptions | github_repo:666260877 | https://github.com/qdrant/fastembed | README/documentation is current enough for a first validation pass.

11. Project risk: Developers should check this runtime risk before relying on the project: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1

  • Severity: medium
  • Finding: Developers should check this runtime risk before relying on the project: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1
  • User impact: Developers may hit a documented source-backed failure mode: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1. Context: Observed when using python, linux
  • Evidence: failure_mode_cluster:github_issue | fmev_17b849ae47ffaf5d18cabbd577f373ca | https://github.com/qdrant/fastembed/issues/603 | [Bug]: Loading models with additional files fails with onnxruntime 1.24.1

12. Project risk: Developers should check this runtime risk before relying on the project: v0.4.2

  • Severity: medium
  • Finding: Developers should check this runtime risk before relying on the project: v0.4.2
  • User impact: Upgrade or migration may change expected behavior: v0.4.2
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v0.4.2. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_release | fmev_607c8ff157108b2b5fb78f55129b60f6 | https://github.com/qdrant/fastembed/releases/tag/v0.4.2 | v0.4.2

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using fastembed with real data or production workflows.

  • [[Bug]: Segmentation Fault or AssertionError during initialization on Pyt](https://github.com/qdrant/fastembed/issues/618) - github / github_issue
  • [[Bug]: No timeout on model download — requests.get() can hang indefinite](https://github.com/qdrant/fastembed/issues/627) - github / github_issue
  • [[Bug]: license error in pypi metadata](https://github.com/qdrant/fastembed/issues/620) - github / github_issue
  • [[Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14](https://github.com/qdrant/fastembed/issues/630) - github / github_issue
  • [[Bug]: Loading models with additional files fails with onnxruntime 1.24.](https://github.com/qdrant/fastembed/issues/603) - github / github_issue
  • The dependency py-rust-stemmers cannot be downloaded in a pure Python - github / github_issue
  • [[Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary](https://github.com/qdrant/fastembed/issues/626) - github / github_issue
  • v0.8.0 - github / github_release
  • v0.7.4 - github / github_release
  • v0.7.2 - github / github_release
  • v0.7.1 - github / github_release
  • v0.7.0 - github / github_release

Source: Project Pack community evidence and pitfall evidence