# https://github.com/qdrant/fastembed 项目说明书

生成时间：2026-05-21 05:37:43 UTC

## 目录

- [Introduction to FastEmbed](#page-introduction)
- [Installation Guide](#page-installation)
- [Quick Start Guide](#page-quickstart)
- [System Architecture](#page-architecture)
- [Text Embedding Module](#page-text-embedding)
- [Image Embedding Module](#page-image-embedding)
- [Sparse Embedding Models](#page-sparse-embedding)
- [Late Interaction Models](#page-late-interaction)
- [ONNX Model Infrastructure](#page-onnx-model)
- [GPU Support and Acceleration](#page-gpu-support)

<a id='page-introduction'></a>

## Introduction to FastEmbed

### 相关页面

相关主题：[Installation Guide](#page-installation), [System Architecture](#page-architecture), [Quick Start Guide](#page-quickstart)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/qdrant/fastembed/blob/main/README.md)
- [fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)
- [fastembed/text/pooled_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/pooled_embedding.py)
- [fastembed/text/pooled_normalized_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/pooled_normalized_embedding.py)
- [fastembed/image/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)
- [fastembed/sparse/bm25.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/bm25.py)
- [fastembed/sparse/minicoil.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/minicoil.py)
</details>

# Introduction to FastEmbed

FastEmbed is a lightweight, high-performance Python library designed for generating text and image embeddings using ONNX-based models. It provides a unified API for dense, sparse, and late-interaction embeddings with support for multiple embedding types and cross-encoder re-ranking models.

## Overview

FastEmbed serves as an embedding generation engine optimized for production use cases, particularly in vector search applications. The library emphasizes:

- **Performance**: Leverages ONNX Runtime for efficient CPU inference
- **Lightweight**: Minimal dependencies and small model sizes
- **Flexibility**: Supports dense, sparse, and multimodal embeddings
- **Ease of use**: Simple Python API for common embedding workflows

资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

## Architecture

FastEmbed follows a modular architecture with separate components for different embedding types and processing stages.

```mermaid
graph TD
    A[FastEmbed API] --> B[Text Embedding]
    A --> C[Image Embedding]
    A --> D[Sparse Embedding]
    A --> E[Cross Encoder]
    
    B --> B1[OnnxEmbedding]
    B --> B2[PooledEmbedding]
    B --> B3[PooledNormalizedEmbedding]
    
    C --> C1[OnnxImageModel]
    
    D --> D1[BM25]
    D --> D2[SPLADE++]
    D --> D3[MiniCOIL]
    
    E --> E1[OnnxCrossEncoderModel]
    
    B1 & B2 & B3 --> F[ONNX Runtime]
    C1 --> F
    E1 --> F
    D1 & D2 & D3 --> G[Tokenization Engine]
```

### Core Components

| Component | Purpose | File Location |
|-----------|---------|---------------|
| `TextEmbedding` | Dense text embeddings | `fastembed/text/` |
| `ImageEmbedding` | Image embeddings | `fastembed/image/` |
| `SparseTextEmbedding` | Sparse embeddings (BM25, SPLADE) | `fastembed/sparse/` |
| `TextCrossEncoder` | Re-ranking models | `fastembed/rerank/` |
| `LateInteractionTextEmbedding` | Late interaction embeddings | `fastembed/postprocess/` |

资料来源：[fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)

## Supported Models

FastEmbed supports an extensive collection of pre-converted ONNX models across multiple categories.

### Text Embedding Models

Text embeddings are categorized by their pooling strategy and normalization approach.

#### Dense Models (Unimodal)

| Model | Dimension | Languages | Max Tokens | License |
|-------|-----------|-----------|------------|---------|
| `BAAI/bge-base-en-v1.5` | 768 | English | 512 | MIT |
| `BAAI/bge-large-en-v1.5` | 1024 | English | 512 | MIT |
| `BAAI/bge-small-en-v1.5` | 384 | English | 512 | MIT |
| `thenlper/gte-base` | 768 | English | 512 | MIT |
| `thenlper/gte-large` | 1024 | English | 512 | MIT |
| `snowflake/snowflake-arctic-embed-m` | 768 | English | 512 | Apache-2.0 |
| `snowflake/snowflake-arctic-embed-l` | 1024 | English | 512 | Apache-2.0 |

资料来源：[fastembed/text/onnx_embedding.py:1-50](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)

#### Multilingual Models

| Model | Dimension | Languages | Max Tokens |
|-------|-----------|-----------|------------|
| `intfloat/multilingual-e5-small` | 384 | ~100 | 512 |
| `intfloat/multilingual-e5-large` | 1024 | ~100 | 512 |
| `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` | 768 | ~50 | 384 |
| `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | 384 | ~50 | 512 |

资料来源：[fastembed/text/pooled_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/pooled_embedding.py)

#### Jina AI Models

| Model | Dimension | Languages | Max Tokens |
|-------|-----------|-----------|------------|
| `jinaai/jina-embeddings-v2-base-en` | 768 | English | 8192 |
| `jinaai/jina-embeddings-v2-small-en` | 512 | English | 8192 |
| `jinaai/jina-embeddings-v2-base-zh` | 768 | Chinese/English | 8192 |
| `jinaai/jina-embeddings-v2-base-de` | 768 | German/English | 8192 |
| `jinaai/jina-embeddings-v2-base-code` | 768 | 30 languages | 8192 |

资料来源：[fastembed/text/pooled_normalized_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/pooled_normalized_embedding.py)

### Sparse Embedding Models

Sparse embeddings provide interpretable vectors with non-zero values only at specific token positions.

| Model | Type | Language | Description |
|-------|------|----------|-------------|
| `prithivida/Splade_PP_en_v1` | SPLADE++ | English | Sparse lexical + semantic |
| `Qdrant/minicoil-v1` | MiniCOIL | English | Semantic + keyword match |
| `Qdrant/bm25` | BM25 | Multilingual | Traditional BM25 ranking |

资料来源：[fastembed/sparse/bm25.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/bm25.py)

The MiniCOIL model combines semantic understanding with exact keyword matching:

```python
class MiniCOIL(SparseTextEmbeddingBase, OnnxTextModel[SparseEmbedding]):
    """
    MiniCOIL is a sparse embedding model, that resolves semantic meaning of the words,
    while keeping exact keyword match behavior.
    
    Each vocabulary token is converted into 4d component of a sparse vector, 
    which is then weighted by the token frequency in the corpus.
    If the token is not found in the corpus, it is treated exactly like in BM25.
    """
```

资料来源：[fastembed/sparse/minicoil.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/minicoil.py)

### Image Embedding Models

| Model | Dimension | Type | License |
|-------|-----------|------|---------|
| `Qdrant/resnet50-onnx` | 2048 | Image | Apache-2.0 |
| `Qdrant/Unicom-ViT-B-16` | 768 | Multimodal | Apache-2.0 |
| `Qdrant/Unicom-ViT-B-32` | 512 | Multimodal | Apache-2.0 |
| `jinaai/jina-clip-v1` | 768 | Multimodal | Apache-2.0 |

资料来源：[fastembed/image/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)

### Cross Encoder Reranking Models

| Model | Description | License | Size |
|-------|-------------|---------|------|
| `Xenova/ms-marco-MiniLM-L-12-v2` | MS MARCO passage ranking | Apache-2.0 | 0.12 GB |
| `BAAI/bge-reranker-base` | BGE reranker base | MIT | 1.04 GB |
| `jinaai/jina-reranker-v1-tiny-en` | Fast reranking, 8K context | Apache-2.0 | 0.13 GB |
| `jinaai/jina-reranker-v1-turbo-en` | Fast reranking, 8K context | Apache-2.0 | 0.15 GB |
| `jinaai/jina-reranker-v2-base-multilingual` | Multilingual, 1K context | CC-BY-NC-4.0 | 1.11 GB |

资料来源：[fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py](https://github.com/qdrant/fastembed/blob/main/fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py)

## Usage Patterns

### Text Embedding Generation

```python
from fastembed import TextEmbedding

# Initialize with default model
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Generate embeddings
documents = [
    "passage: The capital of France is Paris",
    "query: What is the capital of France?"
]

embeddings = list(model.embed(documents))
# Returns list of numpy arrays
```

资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

### Sparse Embedding with SPLADE++

```python
from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
embeddings = list(model.embed(documents))

# Returns:
# [
#   SparseEmbedding(indices=[ 17, 123, 919, ... ], values=[0.71, 0.22, 0.39, ...]),
#   SparseEmbedding(indices=[ 38,  12,  91, ... ], values=[0.11, 0.22, 0.39, ...])
# ]
```

资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

### Custom Model Configuration

```python
from fastembed import TextEmbedding, PoolingType
from fastembed.common import ModelSource

# Use custom model with specific configuration
model = TextEmbedding(
    model_name="intfloat/multilingual-e5-small",
    pooler_type=PoolingType.MEAN,
    normalization=True,
    sources=ModelSource(hf="intfloat/multilingual-e5-small"),
    dim=384,
    model_file="onnx/model.onnx"
)

embeddings = list(model.embed(documents))
```

资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

### Cross Encoder Reranking

```python
from fastembed import TextCrossEncoder

model = TextCrossEncoder(model_name="BAAI/bge-reranker-base")

# Score query-document pairs
query = "What is the capital of France?"
documents = ["Paris is the capital of France.", "Berlin is the capital of Germany."]

scores = list(model.rerank(query, documents))
```

资料来源：[fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py](https://github.com/qdrant/fastembed/blob/main/fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py)

## Post-Processing: MUVERA

FastEmbed includes MUVERA (Multi-Vector Reduction Algorithm) for converting late-interaction embeddings (like ColBERT) to fixed-dimensional encodings.

```python
from fastembed import LateInteractionTextEmbedding
from fastembed.postprocess import Muvera

model = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
muvera = Muvera.from_multivector_model(
    model=model,
    k_sim=6,
    dim_proj=32
)

# Convert late-interaction embeddings to fixed dimension
embeddings = np.array(list(model.embed(["sample text"])))
fde = muvera.process_document(embeddings[0])
```

资料来源：[fastembed/postprocess/muvera.py](https://github.com/qdrant/fastembed/blob/main/fastembed/postprocess/muvera.py)

## Model Caching

FastEmbed automatically caches downloaded models to avoid repeated downloads.

| Setting | Environment Variable | Default Location |
|---------|---------------------|-------------------|
| Cache Path | `FASTEMBED_CACHE_PATH` | System temp directory |

Models are stored in ONNX format after download and converted to optimized formats on first use.

## Prefix Requirements

Some models require specific text prefixes for query and document inputs:

| Prefix Requirement | Models | Example |
|-------------------|--------|---------|
| **Necessary** | E5, Nomic, Snowflake Arctic, BGE-small | Query: `query: ...`, Document: `passage: ...` |
| **Not Necessary** | Jina, BGE (base/large), GTE | Plain text input |

资料来源：[fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)

## Quick Reference

### Model Selection Guide

| Use Case | Recommended Model |
|----------|-------------------|
| English semantic search | `BAAI/bge-small-en-v1.5` |
| High-quality English | `BAAI/bge-large-en-v1.5` |
| Multilingual (100+ languages) | `intfloat/multilingual-e5-large` |
| Long documents (8K tokens) | `jinaai/jina-embeddings-v2-base-en` |
| Fast re-ranking | `jinaai/jina-reranker-v1-tiny-en` |
| Sparse lexical + semantic | `prithivida/Splade_PP_en_v1` |

### API Quick Reference

| Class | Import | Primary Method |
|-------|--------|----------------|
| Dense Text | `from fastembed import TextEmbedding` | `.embed(documents)` |
| Sparse Text | `from fastembed import SparseTextEmbedding` | `.embed(documents)` |
| Image | `from fastembed import ImageEmbedding` | `.embed(images)` |
| Cross Encoder | `from fastembed import TextCrossEncoder` | `.rerank(query, documents)` |
| Late Interaction | `from fastembed import LateInteractionTextEmbedding` | `.embed(documents)` |

## Further Documentation

- [FastEmbed GitHub Repository](https://github.com/qdrant/fastembed)
- [Qdrant Documentation](https://qdrant.tech/documentation/)
- [ONNX Runtime Documentation](https://onnxruntime.ai/docs/)

---

<a id='page-installation'></a>

## Installation Guide

### 相关页面

相关主题：[Introduction to FastEmbed](#page-introduction), [GPU Support and Acceleration](#page-gpu-support)

<details>
<summary>Relevant Source Files</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/qdrant/fastembed/blob/main/README.md)
- [RELEASE.md](https://github.com/qdrant/fastembed/blob/main/RELEASE.md)
- [pyproject.toml](https://github.com/qdrant/fastembed/blob/main/pyproject.toml)
- [fastembed/image/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)
- [fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)
- [fastembed/sparse/bm25.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/bm25.py)
- [mkdocs.yml](https://github.com/qdrant/fastembed/blob/main/mkdocs.yml)
</details>

# Installation Guide

FastEmbed is a lightweight, fast, and accurate embedding library developed by Qdrant. This guide covers all aspects of installing and configuring FastEmbed for various use cases including CPU inference, GPU acceleration, and integration with Qdrant vector database.

## Prerequisites

### System Requirements

| Requirement | Minimum | Recommended |
|-------------|---------|-------------|
| Python Version | 3.9+ | 3.10+ |
| RAM | 4 GB | 8 GB+ |
| Disk Space | 2 GB | 5 GB+ |
| GPU (Optional) | CUDA 11.8+ | CUDA 12.x |

Verify your Python version before installation:

```bash
python --version
```

资料来源：[README.md:1-100]()

## Package Variants

FastEmbed is distributed in two package variants:

| Package | Description | Use Case |
|---------|-------------|----------|
| `fastembed` | CPU-only version | General purpose embedding generation |
| `fastembed-gpu` | GPU-accelerated version | High-throughput production workloads |

### CPU Installation

Install the standard CPU version using pip:

```bash
pip install fastembed
```

资料来源：[README.md]()
资料来源：[RELEASE.md:1-15]()

### GPU Installation

For CUDA-enabled GPU acceleration:

```bash
pip install fastembed-gpu
```

The GPU package automatically includes the CUDA Execution Provider for ONNX Runtime, enabling significantly faster inference on NVIDIA GPUs.

资料来源：[RELEASE.md:1-15]()
资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

### Installation Verification

Verify successful installation:

```python
from fastembed import TextEmbedding

# List available models
print(TextEmbedding.list_supported_models())

# Initialize a model
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
print("FastEmbed installed successfully!")
```

## Supported Embedding Models

FastEmbed supports multiple embedding modalities organized into the following categories:

### Dense Text Embeddings

Dense text embeddings provide fixed-dimensional vector representations for text. Supported models include:

| Model | Dimension | Languages | Token Limit | License |
|-------|-----------|-----------|-------------|---------|
| `BAAI/bge-small-en-v1.5` | 384 | English | 512 | MIT |
| `BAAI/bge-base-en-v1.5` | 768 | English | 512 | MIT |
| `BAAI/bge-large-en-v1.5` | 1024 | English | 512 | MIT |
| `jinaai/jina-embeddings-v2-base-en` | 768 | English | 8192 | Apache 2.0 |
| `jinaai/jina-embeddings-v2-small-en` | 512 | English | 8192 | Apache 2.0 |
| `sentence-transformers/all-MiniLM-L6-v2` | 384 | English | 256 | Apache 2.0 |
| `mixedbread-ai/mxbai-embed-large-v1` | 1024 | English | 512 | Apache 2.0 |

资料来源：[fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)

### Multilingual Models

| Model | Dimension | Languages | Token Limit | License |
|-------|-----------|-----------|-------------|---------|
| `intfloat/multilingual-e5-small` | 384 | ~100 | 512 | MIT |
| `intfloat/multilingual-e5-large` | 1024 | ~100 | 512 | MIT |
| `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` | 768 | ~50 | 384 | Apache 2.0 |
| `jinaai/jina-embeddings-v2-base-de` | 768 | German, English | 8192 | Apache 2.0 |
| `jinaai/jina-embeddings-v2-base-zh` | 768 | Chinese, English | 8192 | Apache 2.0 |
| `jinaai/jina-embeddings-v2-base-es` | 768 | Spanish, English | 8192 | Apache 2.0 |

资料来源：[fastembed/text/pooled_normalized_embedding.py]()
资料来源：[fastembed/text/pooled_embedding.py]()

### Sparse Embeddings

Sparse embeddings represent text using high-dimensional sparse vectors, useful for keyword-based retrieval.

| Model | Type | License |
|-------|------|---------|
| `prithivida/Splade_PP_en_v1` | SPLADE++ | Apache 2.0 |
| `Qdrant/bm25` | BM25 | Apache 2.0 |

资料来源：[fastembed/sparse/bm25.py]()
资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

### Image Embeddings

| Model | Dimension | Type | License |
|-------|-----------|------|---------|
| `Qdrant/resnet50-onnx` | 2048 | Image | Apache 2.0 |
| `Qdrant/Unicom-ViT-B-16` | 768 | Multimodal | Apache 2.0 |
| `Qdrant/Unicom-ViT-B-32` | 512 | Multimodal | Apache 2.0 |
| `jinaai/jina-clip-v1` | 768 | Multimodal (text&image) | Apache 2.0 |

资料来源：[fastembed/image/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)

### Reranking Models

Cross-encoder models for re-ranking search results:

| Model | Context Length | License |
|-------|----------------|---------|
| `Xenova/ms-marco-MiniLM-L-12-v2` | - | Apache 2.0 |
| `BAAI/bge-reranker-base` | - | MIT |
| `jinaai/jina-reranker-v1-tiny-en` | 8K | Apache 2.0 |
| `jinaai/jina-reranker-v1-turbo-en` | 8K | Apache 2.0 |
| `jinaai/jina-reranker-v2-base-multilingual` | 1K | CC-BY-NC-4.0 |

资料来源：[fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py]()

## GPU Configuration

### CUDA Provider Setup

Enable GPU acceleration by specifying the CUDA execution provider:

```python
from fastembed import TextEmbedding
from fastembed.common import OnnxProvider

model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    providers=["CUDAExecutionProvider"]
)
print("The model BAAI/bge-small-en-v1.5 is ready to use on a GPU.")
```

资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

### GPU Installation Workflow

```mermaid
graph TD
    A[Install fastembed-gpu] --> B{Check CUDA Version}
    B -->|CUDA 11.8+| C[Install Compatible Driver]
    B -->|CUDA < 11.8| D[Upgrade CUDA]
    C --> E[Verify ONNX Runtime GPU Support]
    E --> F[Import TextEmbedding]
    F --> G[Configure Providers]
    G --> H[GPU Inference Ready]
```

## Qdrant Integration

FastEmbed integrates seamlessly with the Qdrant vector database for production deployments.

### Installation with Qdrant Client

```bash
# Standard Qdrant with FastEmbed
pip install qdrant-client[fastembed]

# GPU-accelerated FastEmbed with Qdrant
pip install qdrant-client[fastembed-gpu]
```

On zsh shells, use quotes:

```bash
pip install 'qdrant-client[fastembed]'
```

资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

### Complete Qdrant Integration Example

```python
from qdrant_client import QdrantClient, models

# Initialize the client
client = QdrantClient("localhost", port=6333)  # For production
# client = QdrantClient(":memory:")  # For experimentation

model_name = "sentence-transformers/all-MiniLM-L6-v2"
payload = [
    {"document": "Qdrant has Langchain integrations", "source": "Langchain-docs"},
    {"document": "Qdrant also has Llama Index integrations", "source": "LlamaIndex-docs"},
]
docs = [models.Document(text=data["document"], model=model_name) for data in payload]
ids = [42, 2]

client.create_collection(
    "demo_collection",
    vectors_config=models.VectorParams(
        size=client.get_embedding_size(model_name), distance=models.Distance.COSINE
    )
)

client.upload_collection(
    collection_name="demo_collection",
    vectors=docs,
    ids=ids,
    payload=payload,
)

search_result = client.query_points(
    collection_name="demo_collection",
    query=docs[0],
    limit=5,
)
```

资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

## Cache Configuration

### Default Cache Location

Models are cached in `fastembed_cache` within the system's temp directory by default. This location can be customized using environment variables.

### FASTEMBED_CACHE_PATH

Set a custom cache directory:

```bash
export FASTEMBED_CACHE_PATH=/path/to/custom/cache
```

The cache directory structure follows Hugging Face's conventions, with models organized by their source repository.

### Model Loading Behavior

```mermaid
graph TD
    A[Import FastEmbed Module] --> B{Model in Cache?}
    B -->|Yes| C[Load from Cache]
    B -->|No| D[Download from Source]
    D --> E{HuggingFace Available?}
    E -->|Yes| F[Download from HF Hub]
    E -->|No| G[Use URL Source]
    C --> H[Initialize ONNX Session]
    E --> H
    G --> H
    H --> I[Model Ready for Inference]
```

## Usage Examples

### Dense Text Embedding

```python
from fastembed import TextEmbedding

documents = [
    "passage: FastEmbed is a fast embedding library",
    "query: What is FastEmbed?",
    "passage: Qdrant is a vector database",
]

model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
embeddings = list(model.embed(documents))

print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding dimension: {embeddings[0].shape}")
```

### Sparse Text Embedding (SPLADE++)

```python
from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
embeddings = list(model.embed(documents))

# Output format:
# [
#   SparseEmbedding(indices=[ 17, 123, 919, ... ], values=[0.71, 0.22, 0.39, ...]),
#   SparseEmbedding(indices=[ 38,  12,  91, ... ], values=[0.11, 0.22, 0.39, ...])
# ]
```

资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

### Image Embedding

```python
from fastembed import ImageEmbedding
from PIL import Image

model = ImageEmbedding(model_name="Qdrant/resnet50-onnx")
images = [Image.open("path/to/image.jpg")]
embeddings = list(model.embed(images))
```

### Custom Model Source

Load a supported model with custom configuration:

```python
from fastembed import TextEmbedding
from fastembed.common import ModelSource, PoolingType

model = TextEmbedding(
    model_name="custom-model",
    pool_type=PoolingType.MEAN,
    normalization=True,
    sources=ModelSource(hf="intfloat/multilingual-e5-small"),
    dim=384,
    model_file="onnx/model.onnx"
)

embeddings = list(model.embed(documents))
```

资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

## Late Interaction Models

Late interaction models like ColBERT enable more sophisticated similarity matching:

```python
from fastembed import LateInteractionTextEmbedding

model = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
embeddings = list(model.embed(documents))
```

## Post-Processing with Muvera

The Muvera post-processor enables Fixed Dimensional Encoding (FDE) for multi-vector models:

```python
from fastembed import LateInteractionTextEmbedding
from fastembed.postprocess import Muvera
import numpy as np

model = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
muvera = Muvera.from_multivector_model(
    model=model,
    k_sim=6,
    dim_proj=32
)

embeddings = np.array(list(model.embed(["sample text"])))
fde = muvera.process_document(embeddings[0])
```

资料来源：[fastembed/postprocess/muvera.py]()

## Troubleshooting

### Common Installation Issues

| Issue | Solution |
|-------|----------|
| `ModuleNotFoundError: No module named 'fastembed'` | Run `pip install fastembed` |
| CUDA not available | Install `fastembed-gpu` and verify NVIDIA driver |
| Model download fails | Check network connectivity and HuggingFace access |
| Out of memory | Reduce batch size or use smaller model variant |

### Checking Installed Version

```python
import fastembed
print(fastembed.__version__)
```

### Verifying GPU Availability

```python
import onnxruntime as ort
print(f"Available providers: {ort.get_available_providers()}")
```

## Advanced Configuration

### Lazy Loading

Enable lazy model loading for memory-efficient initialization:

```python
from fastembed import TextEmbedding

model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    lazy_load=True  # Model loads on first inference call
)
```

### Thread Configuration

Optimize CPU thread usage:

```python
from fastembed import TextEmbedding

model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    threads=4  # Limit to 4 threads
)
```

### Device Selection

```python
from fastembed.common import Device

model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    cuda=True  # or Device.AUTO for automatic detection
)
```

## Documentation Resources

For more information, refer to the official documentation:

- [FastEmbed Documentation](https://qdrant.github.io/fastembed/)
- [Qdrant GitHub Repository](https://github.com/qdrant/fastembed/)
- [HuggingFace Model Hub](https://huggingface.co/Qdrant/fastembed)

资料来源：[mkdocs.yml](https://github.com/qdrant/fastembed/blob/main/mkdocs.yml)

---

<a id='page-quickstart'></a>

## Quick Start Guide

### 相关页面

相关主题：[Introduction to FastEmbed](#page-introduction), [Text Embedding Module](#page-text-embedding), [Image Embedding Module](#page-image-embedding)

<details>
<summary>Related Source Files</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/qdrant/fastembed/blob/main/README.md)
- [fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)
- [fastembed/text/pooled_normalized_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/pooled_normalized_embedding.py)
- [fastembed/image/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)
- [fastembed/sparse/minicoil.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/minicoil.py)
- [fastembed/sparse/bm25.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/bm25.py)
- [fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py](https://github.com/qdrant/fastembed/blob/main/fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py)
</details>

# Quick Start Guide

FastEmbed is a lightweight, fast, and accurate embedding library developed by Qdrant. It provides text and image embeddings using ONNX-based models optimized for production deployment. This guide covers the essential steps to get started with FastEmbed for your embedding needs.

## Overview

FastEmbed enables developers to generate high-quality vector embeddings for text and images with minimal configuration. The library supports multiple embedding types including dense embeddings, sparse embeddings, and cross-encoder reranking models.

```mermaid
graph TD
    A[FastEmbed Library] --> B[Text Embeddings]
    A --> C[Image Embeddings]
    A --> D[Sparse Embeddings]
    A --> E[Reranking Models]
    
    B --> B1[Dense Models]
    B --> B2[Pooled Models]
    
    D --> D1[SPLADE++]
    D --> D2[BM25]
    D --> D3[MiniCOIL]
```

## Installation

Install FastEmbed using pip:

```bash
pip install fastembed
```

For CUDA acceleration support:

```bash
pip install fastembed[cuda]
```

## Basic Text Embedding

### Loading a Model

Import and initialize a text embedding model:

```python
from fastembed import TextEmbedding

# Use the default model (BAAI/bge-small-en-v1.5)
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
```

### Generating Embeddings

```python
documents = [
    "passage: A man is eating food.",
    "passage: A man is eating a piece of broccoli.",
    "passage: A man is eating pasta.",
    "passage: A woman is cutting vegetables.",
]

embeddings = list(model.embed(documents))
```

The prefix `"passage:"` is required for some models like `BAAI/bge-small-en-v1.5` and `snowflake/snowflake-arctic-embed-xs` to indicate the text type. 资料来源：[README.md](https://github.com/qdrant/fastembed/blob/main/README.md)

## Supported Models

### Text Embedding Models

| Model Name | Dimension | Languages | Max Tokens | License | Size (GB) |
|------------|-----------|-----------|------------|---------|-----------|
| BAAI/bge-small-en-v1.5 | 384 | English | 512 | MIT | 0.067 |
| BAAI/bge-base-en-v1.5 | 768 | English | 512 | MIT | 0.21 |
| BAAI/bge-large-en-v1.5 | 1024 | English | 512 | MIT | 1.20 |
| intfloat/multilingual-e5-small | 384 | Multilingual (~100) | 512 | MIT | 0.09 |
| sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | 384 | Multilingual (~50) | 512 | Apache 2.0 | 0.22 |
| jinaai/jina-embeddings-v2-base-en | 768 | English | 8192 | Apache 2.0 | 0.52 |
| snowflake/snowflake-arctic-embed-xs | 384 | English | 512 | Apache 2.0 | 0.09 |
| snowflake/snowflake-arctic-embed-m-long | 768 | English | 2048 | Apache 2.0 | 0.54 |

资料来源：[fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)

### Pooled Normalized Embeddings

The `PooledNormalizedEmbedding` class applies mean pooling to the model output and normalizes the result:

```python
from fastembed.text import PooledNormalizedEmbedding

model = PooledNormalizedEmbedding(model_name="jinaai/jina-embeddings-v2-base-en")
```

The post-processing combines mean pooling with L2 normalization:

```python
def _post_process_onnx_output(self, output: OnnxOutputContext, **kwargs: Any) -> Iterable[NumpyArray]:
    embeddings = output.model_output
    attn_mask = output.attention_mask
    return normalize(self.mean_pooling(embeddings, attn_mask))
```

资料来源：[fastembed/text/pooled_normalized_embedding.py:78-84](https://github.com/qdrant/fastembed/blob/main/fastembed/text/pooled_normalized_embedding.py)

## Image Embeddings

FastEmbed supports multimodal models for image embeddings:

```python
from fastembed.image import OnnxImageEmbedding

model = OnnxImageEmbedding(model_name="Qdrant/Unicom-ViT-B-16")
```

### Supported Image Models

| Model Name | Dimension | Type | License | Size (GB) |
|------------|-----------|------|---------|-----------|
| Qdrant/Unicom-ViT-B-16 | 768 | Multimodal (text&image) | Apache 2.0 | 0.82 |
| Qdrant/Unicom-ViT-B-32 | 512 | Multimodal (text&image) | Apache 2.0 | 0.48 |
| jinaai/jina-clip-v1 | 768 | Multimodal (text&image) | Apache 2.0 | 0.34 |

资料来源：[fastembed/image/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)

## Sparse Embeddings

Sparse embeddings provide interpretable, non-dense vector representations useful for keyword-aware semantic search.

### SPLADE++

SPLADE++ is a sparse embedding model that resolves semantic meaning while preserving keyword match behavior:

```python
from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
embeddings = list(model.embed(documents))
```

Returns sparse vectors with indices and values:

```
[
  SparseEmbedding(indices=[17, 123, 919, ...], values=[0.71, 0.22, 0.39, ...]),
  SparseEmbedding(indices=[38, 12, 91, ...], values=[0.11, 0.22, 0.39, ...])
]
```

### BM25

Traditional BM25 implemented as sparse embeddings:

```python
from fastembed.sparse import Bm25

model = Bm25(language="en")
```

BM25 formula:

```
score(q, d) = SUM[ IDF(q_i) * (f(q_i, d) * (k + 1)) / (f(q_i, d) + k * (1 - b + b * (|d| / avg_len))) ]
```

资料来源：[fastembed/sparse/bm25.py:47-52](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/bm25.py)

### MiniCOIL

MiniCOIL combines semantic embeddings with exact keyword matching:

```python
from fastembed.sparse import MiniCOIL

model = MiniCOIL(model_name="Qdrant/minicoil-v1")
```

## Reranking Models

Cross-encoder reranking improves search results by re-scoring candidate documents:

```python
from fastembed import TextCrossEncoder

model = TextCrossEncoder(model_name="jinaai/jina-reranker-v1-turbo-en")
```

### Supported Reranker Models

| Model Name | License | Size (GB) | Context Length |
|------------|---------|-----------|----------------|
| jinaai/jina-reranker-v1-turbo-en | Apache 2.0 | 0.15 | 1K |
| jinaai/jina-reranker-v2-base-multilingual | CC BY-NC 4.0 | 1.11 | 1K (sliding window) |

资料来源：[fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py](https://github.com/qdrant/fastembed/blob/main/fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py)

## Common Configuration Options

### Initialization Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_name` | str | "BAAI/bge-small-en-v1.5" | Name of the model to use |
| `cache_dir` | str or None | None | Cache directory path |
| `threads` | int or None | None | Number of threads for ONNX execution |
| `providers` | Sequence[OnnxProvider] | None | ONNX execution providers (CPU, CUDA, etc.) |
| `cuda` | bool or Device | Device.AUTO | Enable CUDA acceleration |
| `device_ids` | list[int] | None | Specific GPU device IDs |
| `lazy_load` | bool | False | Defer model loading until first use |

资料来源：[fastembed/text/onnx_embedding.py:123-136](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)

## Complete Example Workflow

```mermaid
graph LR
    A[Input Documents] --> B[TextEmbedding]
    B --> C[Embedding Vectors]
    C --> D[Vector Database]
    D --> E[Similarity Search]
    E --> F[Candidate Results]
    F --> G[TextCrossEncoder]
    G --> H[Reranked Results]
```

```python
from fastembed import TextEmbedding, SparseTextEmbedding, TextCrossEncoder

# 1. Load models
text_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
sparse_model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")
reranker = TextCrossEncoder(model_name="jinaai/jina-reranker-v1-turbo-en")

# 2. Generate dense embeddings
documents = ["query: fastembed is fast", "passage: FastEmbed provides efficient embeddings"]
dense_embeddings = list(text_model.embed(documents))

# 3. Generate sparse embeddings
sparse_embeddings = list(sparse_model.embed(documents))

# 4. Rerank results
query = "query: What is FastEmbed?"
results = reranker.rerank(query=query, documents=documents, top_k=2)
```

## Language-Specific Models

For non-English content, use multilingual models:

| Language | Recommended Model |
|----------|-------------------|
| Chinese | `BAAI/bge-small-zh-v1.5`, `jinaai/jina-embeddings-v2-base-zh` |
| German | `jinaai/jina-embeddings-v2-base-de` |
| Spanish | `jinaai/jina-embeddings-v2-base-es` |
| Code | `jinaai/jina-embeddings-v2-base-code` |
| Multilingual | `intfloat/multilingual-e5-small`, `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` |

## Prefix Requirements

Some models require specific prefixes to distinguish query and passage text:

| Model | Prefix Required | Example |
|-------|-----------------|---------|
| BAAI/bge-small-en-v1.5 | Yes | `"query: ..."`, `"passage: ..."` |
| intfloat/multilingual-e5-small | Yes | `"query: ..."`, `"passage: ..."` |
| snowflake/snowflake-arctic-embed-xs | Yes | `"query: ..."`, `"passage: ..."` |
| jinaai/jina-embeddings-v2-base-en | Not necessary | Plain text |

资料来源：[fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)

## Next Steps

- Explore the [API Reference](../api_reference/) for detailed method documentation
- Learn about [Model Selection](../models/) to choose the right embedding model
- Read [Advanced Usage](../guides/advanced_usage/) for performance optimization
- Check [Examples](../examples/) for real-world use cases

---

<a id='page-architecture'></a>

## System Architecture

### 相关页面

相关主题：[Introduction to FastEmbed](#page-introduction), [ONNX Model Infrastructure](#page-onnx-model)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [fastembed/text/text_embedding_base.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/text_embedding_base.py)
- [fastembed/image/image_embedding_base.py](https://github.com/qdrant/fastembed/blob/main/fastembed/image/image_embedding_base.py)
- [fastembed/common/onnx_model.py](https://github.com/qdrant/fastembed/blob/main/fastembed/common/onnx_model.py)
- [fastembed/common/model_management.py](https://github.com/qdrant/fastembed/blob/main/fastembed/common/model_management.py)
- [fastembed/parallel_processor.py](https://github.com/qdrant/fastembed/blob/main/fastembed/parallel_processor.py)
</details>

# System Architecture

FastEmbed is a lightweight, fast text and image embedding library built on ONNX Runtime. The architecture is designed around modularity, enabling multiple embedding types (dense, sparse, late-interaction) through a unified interface while leveraging ONNX for cross-platform inference optimization.

## Architecture Overview

FastEmbed follows a layered architecture with clear separation of concerns:

```mermaid
graph TD
    subgraph "Public API Layer"
        A["TextEmbedding<br/>ImageEmbedding<br/>SparseTextEmbedding<br/>LateInteractionTextEmbedding"]
    end
    
    subgraph "Embedding Base Classes"
        B["TextEmbeddingBase"]
        C["ImageEmbeddingBase"]
        D["SparseTextEmbeddingBase"]
        E["LateInteractionTextEmbeddingBase"]
    end
    
    subgraph "ONNX Abstraction Layer"
        F["OnnxTextModel"]
        G["OnnxImageModel"]
    end
    
    subgraph "Model Management"
        H["ModelManagement"]
        I["OnnxModel"]
    end
    
    subgraph "Parallel Processing"
        J["ParallelProcessor"]
        K["EmbeddingWorker<br/>OnnxTextEmbeddingWorker"]
    end
    
    subgraph "ONNX Runtime"
        L["InferenceSession"]
    end
    
    A --> B
    A --> C
    A --> D
    A --> E
    B --> F
    C --> G
    F --> I
    G --> I
    I --> L
    H --> I
    J --> K
    K --> F
```

## Core Components

### Base Class Hierarchy

FastEmbed implements embedding models through inheritance hierarchies that separate concerns between the embedding interface and the ONNX runtime integration.

#### Text Embedding Base

The `TextEmbeddingBase` class provides the foundation for all text embedding models. It defines the abstract interface that concrete implementations must follow.

| Method | Purpose | Source |
|--------|---------|--------|
| `embed()` | Generate embeddings for input documents | [text_embedding_base.py]() |
| `_list_supported_models()` | Return list of supported model descriptions | [text_embedding_base.py]() |
| `_post_process_onnx_output()` | Transform raw ONNX output to final embeddings | [text_embedding_base.py]() |

#### Image Embedding Base

The `ImageEmbeddingBase` class mirrors the text embedding architecture for image inputs, supporting multimodal models like Jina CLIP.

| Property | Type | Description |
|----------|------|-------------|
| `model_name` | str | HuggingFace model identifier |
| `cache_dir` | str \| None | Local cache directory for model files |
| `lazy_load` | bool | Defer model loading until first use |

### ONNX Model Abstraction

The `OnnxModel` class serves as the bridge between the embedding interface and ONNX Runtime execution.

```mermaid
classDiagram
    class OnnxModel~T~ {
        +str model_name
        +str cache_dir
        +InferenceSession inference_session
        +load_model() Any
        +run(input_feed)~T~
    }
    
    class OnnxTextModel~T~ {
        +mean_pooling(output, attention_mask)
        +encode(input_data) Iterable~T~
    }
    
    class OnnxImageModel~T~ {
        +preprocess_image(image) Tensor
        +encode(images) Iterable~T~
    }
    
    OnnxModel <|-- OnnxTextModel
    OnnxModel <|-- OnnxImageModel
```

The ONNX model abstraction provides:

1. **Model Loading**: Downloads and caches ONNX models from HuggingFace or custom URLs
2. **Session Management**: Creates and configures ONNX Runtime inference sessions with specified providers (CPU, CUDA, TensorRT)
3. **Input/Output Handling**: Manages input preprocessing and output postprocessing

## Embedding Types

FastEmbed supports multiple embedding paradigms through specialized classes.

### Dense Embeddings

Dense embeddings convert inputs into fixed-dimensional continuous vectors. The `OnnxTextEmbedding` class provides dense text embeddings using models like BGE, GTE, and Jina embeddings.

```mermaid
graph LR
    A["Input Text"] --> B["Tokenization"]
    B --> C["ONNX Inference"]
    C --> D["mean_pooling"]
    D --> E["L2 Normalization"]
    E --> F["Dense Vector"]
```

Key supported models:

| Model | Dimension | Context Length | Prefix Required |
|-------|-----------|----------------|------------------|
| BAAI/bge-small-en-v1.5 | 384 | 512 | No |
| BAAI/bge-base-en-v1.5 | 768 | 512 | No |
| BAAI/bge-large-en-v1.5 | 1024 | 512 | No |
| jinaai/jina-embeddings-v2-base-en | 768 | 8192 | No |
| intfloat/multilingual-e5-small | 384 | 512 | Yes |

### Pooled Normalized Embeddings

The `PooledNormalizedEmbedding` class extends `PooledEmbedding` with built-in mean pooling and L2 normalization.

```python
# From pooled_normalized_embedding.py
class PooledNormalizedEmbedding(PooledEmbedding):
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        if output.attention_mask is None:
            raise ValueError("attention_mask must be provided for document post-processing")
        
        embeddings = output.model_output
        attn_mask = output.attention_mask
        return normalize(self.mean_pooling(embeddings, attn_mask))
```

### Sparse Embeddings

Sparse embeddings represent documents as sparse vectors with non-zero values only at relevant token positions. FastEmbed provides two sparse embedding approaches:

#### BM25

BM25 is implemented as a traditional sparse embedding model with IDF weighting. The formula used:

```
score(q, d) = Σ[ IDF(q_i) * (f(q_i, d) * (k + 1)) / (f(q_i, d) + k * (1 - b + b * (|d| / avg_len))) ]
```

Where:
- `IDF(q_i)` is the inverse document frequency
- `f(q_i, d)` is the term frequency
- `k`, `b` are hyperparameters controlling saturation and length normalization

#### SPLADE++

SPLADE++ models use neural networks to generate sparse embeddings that combine semantic understanding with exact keyword matching.

### Late Interaction Embeddings

Late interaction models like ColBERT generate multiple token-level embeddings that can be compared efficiently during retrieval. These are often combined with postprocessors like MUVERA for fixed-dimensional encoding.

## Model Management

### Model Source Configuration

Models can be loaded from multiple sources defined through the `ModelSource` class:

```python
sources=ModelSource(
    hf="jinaai/jina-embeddings-v2-base-en",      # HuggingFace Hub
    url="https://storage.googleapis.com/...",      # Direct URL download
    _deprecated_tar_struct=True                    # Legacy tar format
)
```

### Cache Strategy

The `ModelManagement` class handles model caching and lazy loading:

1. **Cache Directory**: Configurable via `FASTEMBED_CACHE_PATH` environment variable or `cache_dir` parameter
2. **Lazy Loading**: Models are not loaded until first inference when `lazy_load=True`
3. **Provider Selection**: Automatic provider selection (CUDA > CPU) with fallback

```mermaid
graph TD
    A["Model Request"] --> B{"Cache Hit?"}
    B -->|Yes| C["Load from Cache"]
    B -->|No| D["Download Model"]
    D --> E["Store in Cache"]
    C --> F["Initialize ONNX Session"]
    E --> F
    F --> G["Ready for Inference"]
```

## Parallel Processing

FastEmbed uses a worker-based parallel processing architecture for efficient batch inference.

### ParallelProcessor

The `ParallelProcessor` class manages a pool of workers for parallel embedding generation:

```mermaid
graph TD
    subgraph "Main Process"
        A["ParallelProcessor"]
        B["Input Queue"]
    end
    
    subgraph "Worker Pool"
        C["Worker 1"]
        D["Worker 2"]
        E["Worker N"]
    end
    
    B --> C
    B --> D
    B --> E
    
    C --> F["Result Queue"]
    D --> F
    E --> F
```

### Worker Classes

Workers inherit from `EmbeddingWorker` or specialized variants:

| Worker Class | Purpose |
|--------------|---------|
| `EmbeddingWorker` | Base worker for generic embeddings |
| `OnnxTextEmbeddingWorker` | Text embedding with ONNX inference |
| `PooledNormalizedEmbeddingWorker` | Worker with pooling and normalization |

Workers implement the `init_embedding()` method to initialize the embedding model:

```python
def init_embedding(
    self,
    model_name: str,
    cache_dir: str | None = None,
    threads: int | None = None,
    providers: Sequence[OnnxProvider] | None = None,
    cuda: bool | Device = Device.AUTO,
    device_ids: list[int] | None = None,
    lazy_load: bool = False,
    device_id: int | None = None,
    specific_model_path: str | None = None,
    **kwargs: Any,
):
```

## Cross-Encoder Reranking

The cross-encoder reranking system follows a separate architectural path:

```mermaid
graph LR
    A["Query"] --> D["Cross-Encoder"]
    B["Candidate Doc"] --> D
    D --> E["Relevance Scores"]
```

The `OnnxTextCrossEncoder` class provides reranking through:

```python
class OnnxTextCrossEncoder(TextCrossEncoderBase, OnnxCrossEncoderModel):
    @classmethod
    def _list_supported_models(cls) -> list[BaseModelDescription]:
        return supported_onnx_models
```

Supported reranker models:

| Model | Context Length | Languages |
|-------|----------------|-----------|
| jinaai/jina-reranker-v1-turbo-en | 1K | English |
| jinaai/jina-reranker-v2-base-multilingual | 1K (sliding window) | Multilingual |

## Device and Provider Management

FastEmbed supports multiple execution providers with automatic device selection:

```python
# Device selection logic (simplified)
if cuda and Device.CUDA available:
    use CUDA provider
elif cuda and Device.ONNX_CPU fallback:
    use CPU provider
else:
    use default provider
```

### Supported Providers

| Provider | Device | Use Case |
|----------|--------|----------|
| CPUExecutionProvider | CPU | General purpose, no GPU |
| CUDAExecutionProvider | NVIDIA GPU | Fast inference with CUDA |
| TensorRTExecutionProvider | NVIDIA GPU | Optimized batch inference |

## Configuration Options

### Model Initialization Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_name` | str | "BAAI/bge-small-en-v1.5" | Model identifier |
| `cache_dir` | str \| None | None | Cache directory path |
| `threads` | int \| None | None | Thread count for CPU execution |
| `providers` | Sequence[OnnxProvider] \| None | None | ONNX execution providers |
| `cuda` | bool \| Device | Device.AUTO | CUDA device selection |
| `device_ids` | list[int] \| None | None | Specific GPU device IDs |
| `lazy_load` | bool | False | Defer loading until first use |
| `device_id` | int \| None | None | Single device ID assignment |
| `specific_model_path` | str \| None | None | Override model file path |

## Data Flow

### Text Embedding Pipeline

```mermaid
sequenceDiagram
    participant Client
    participant TextEmbedding
    participant OnnxTextModel
    participant ONNXRuntime
    
    Client->>TextEmbedding: embed(documents)
    TextEmbedding->>TextEmbedding: preprocess(documents)
    TextEmbedding->>OnnxTextModel: encode(preprocessed)
    OnnxTextModel->>ONNXRuntime: run(input_feed)
    ONNXRuntime-->>OnnxTextModel: raw_output
    OnnxTextModel->>OnnxTextModel: mean_pooling + normalize
    OnnxTextModel-->>TextEmbedding: embeddings
    TextEmbedding-->>Client: Iterable[NDArray]
```

### Model Description Schema

Each model is described by a `DenseModelDescription` or `BaseModelDescription`:

```python
DenseModelDescription(
    model="BAAI/bge-small-en-v1.5",
    dim=384,
    description="Text embeddings, Unimodal (text), English, 512 input tokens truncation",
    license="mit",
    size_in_GB=0.067,
    sources=ModelSource(hf="qdrant/bge-small-en-v1.5-onnx-q"),
    model_file="model_optimized.onnx",
)
```

| Field | Type | Description |
|-------|------|-------------|
| `model` | str | HuggingFace model identifier |
| `dim` | int | Embedding dimension |
| `description` | str | Human-readable model description |
| `license` | str | Model license |
| `size_in_GB` | float | Model file size |
| `sources` | ModelSource | Download sources configuration |
| `model_file` | str | ONNX model file name |
| `additional_files` | list[str] | Extra files (vocab, configs) |
| `requires_idf` | bool | IDF file requirement for sparse models |

## Post-Processing

### MUVERA Post-Processor

The `Muvera` post-processor converts late-interaction (multi-vector) embeddings to fixed-dimensional representations:

```mermaid
graph TD
    A["Multi-Vector Embeddings"] --> B["Muvera Processing"]
    B --> C["Fixed-Dimensional Encoding"]
    C --> D["L2 Normalized FDE"]
    
    subgraph "Parameters"
        E["k_sim: number of partitions"]
        F["dim_proj: projection dimension"]
        G["r_reps: repetitions"]
    end
```

Output dimension calculation: `r_reps * 2^k_sim * dim_proj`

## Summary

The FastEmbed architecture demonstrates a well-structured approach to embedding generation:

- **Modularity**: Clear separation between embedding types, ONNX abstraction, and model management
- **Performance**: ONNX Runtime integration with automatic provider selection
- **Flexibility**: Support for dense, sparse, and late-interaction embeddings
- **Extensibility**: Worker-based parallel processing with lazy loading support
- **Portability**: Cross-platform ONNX execution with CUDA and TensorRT acceleration

---

<a id='page-text-embedding'></a>

## Text Embedding Module

### 相关页面

相关主题：[System Architecture](#page-architecture), [GPU Support and Acceleration](#page-gpu-support)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)
- [fastembed/text/text_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/text_embedding.py)
- [fastembed/text/onnx_text_model.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_text_model.py)
- [fastembed/text/pooled_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/pooled_embedding.py)
- [fastembed/text/clip_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/clip_embedding.py)
- [fastembed/text/pooled_normalized_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/pooled_normalized_embedding.py)
</details>

# Text Embedding Module

The Text Embedding Module in FastEmbed provides high-performance text vectorization capabilities using ONNX runtime for efficient inference. It supports dense embeddings, pooled embeddings, normalized embeddings, and multimodal (CLIP) text embeddings.

## Architecture Overview

The module follows a layered architecture with base classes providing common functionality and specialized implementations for different embedding strategies.

```mermaid
graph TD
    Base[TextEmbeddingBase] --> OnnxTextEmbedding
    Base[TextEmbeddingBase] --> PooledEmbedding
    PooledEmbedding --> PooledNormalizedEmbedding
    OnnxTextEmbedding --> CLIPOnnxEmbedding
    OnnxTextEmbedding --> PooledEmbedding
    
    Worker[OnnxTextEmbeddingWorker] -.->|init_embedding| OnnxTextEmbedding
    CLIPWorker[CLIPEmbeddingWorker] -.->|init_embedding| CLIPOnnxEmbedding
    
    Models[Supported Models] -->|BGE, Jina, GTE, etc| OnnxTextEmbedding
```

### Core Components

| Component | File | Purpose |
|-----------|------|---------|
| `TextEmbeddingBase` | `text_embedding.py` | Abstract base class defining the embedding interface |
| `OnnxTextEmbedding` | `onnx_embedding.py` | Main ONNX-based text embedding implementation |
| `PooledEmbedding` | `pooled_embedding.py` | Mean pooling variant for document embeddings |
| `PooledNormalizedEmbedding` | `pooled_normalized_embedding.py` | L2-normalized pooled embeddings |
| `CLIPOnnxEmbedding` | `clip_embedding.py` | CLIP-based multimodal text embeddings |

## Supported Models

FastEmbed's Text Embedding Module supports numerous pre-trained models across different languages and use cases.

### Model Categories

| Category | Models | Dim | License | Description |
|----------|--------|-----|---------|-------------|
| BGE (Base) | `BAAI/bge-base-en-v1.5` | 768 | MIT | English text, 512 tokens |
| BGE Large | `BAAI/bge-large-en-v1.5` | 1024 | MIT | High-quality English embeddings |
| BGE Small | `BAAI/bge-small-en-v1.5` | 384 | MIT | Lightweight English embeddings |
| Jina | `jinaai/jina-embeddings-v2-base-en` | 768 | Apache 2.0 | English, 8192 tokens |
| Jina Code | `jinaai/jina-embeddings-v2-base-code` | 768 | Apache 2.0 | 30 programming languages |
| GTE | `thenlper/gte-base` | 768 | MIT | General text embeddings |
| Snowflake Arctic | `snowflake/snowflake-arctic-embed-m` | 768 | Apache 2.0 | Query-optimized embeddings |
| Multilingual E5 | `intfloat/multilingual-e5-large` | 1024 | MIT | 100 languages support |

### Model Selection Criteria

Different models have varying requirements for query/document prefixes:

```python
# Models requiring prefixes
prefix_required = ["intfloat/multilingual-e5-small", "intfloat/multilingual-e5-large"]

# Models not requiring prefixes  
prefix_not_required = ["BAAI/bge-base-en-v1.5", "jinaai/jina-embeddings-v2-base-en"]
```

资料来源：[fastembed/text/onnx_embedding.py:1-200]()

## Class Hierarchy

### TextEmbeddingBase

The abstract base class defining the contract for all text embedding implementations.

```python
class TextEmbeddingBase(EmbeddingModel[list[float]]):
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        ...
    
    def embed(self, documents: Iterable[str]) -> Iterable[list[float]]:
        ...
```

资料来源：[fastembed/text/text_embedding.py]()

### OnnxTextEmbedding

The primary implementation class that leverages ONNX runtime for inference.

```python
class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[NumpyArray]):
    def __init__(
        self,
        model_name: str = "BAAI/bge-small-en-v1.5",
        cache_dir: str | None = None,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_ids: list[int] | None = None,
        lazy_load: bool = False,
        device_id: int | None = None,
        specific_model_path: str | None = None,
        **kwargs: Any,
    )
```

#### Constructor Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_name` | `str` | `"BAAI/bge-small-en-v1.5"` | Name of the model to use |
| `cache_dir` | `str \| None` | `None` | Cache directory path |
| `threads` | `int \| None` | `None` | Number of threads for ONNX |
| `providers` | `Sequence[OnnxProvider]` | `None` | ONNX execution providers |
| `cuda` | `bool \| Device` | `Device.AUTO` | CUDA acceleration |
| `device_ids` | `list[int]` | `None` | GPU device IDs |
| `lazy_load` | `bool` | `False` | Load model on first use |

资料来源：[fastembed/text/onnx_embedding.py:200-250]()

## Pooling Strategies

The module implements multiple pooling strategies to aggregate token-level embeddings into sentence-level embeddings.

### Mean Pooling

Mean pooling computes the average of all token embeddings weighted by the attention mask.

```python
def mean_pooling(self, embeddings: NumpyArray, attention_mask: NumpyArray) -> NumpyArray:
    # Expand attention mask to broadcast
    attention_mask_expanded = np.expand_dims(attention_mask, -1)
    # Sum embeddings where mask is active
    sum_embeddings = np.sum(embeddings * attention_mask_expanded, axis=1)
    # Count valid tokens
    counts = np.sum(attention_mask_expanded, axis=1)
    # Return mean
    return sum_embeddings / counts
```

### PooledEmbedding

Applies mean pooling after ONNX inference to generate document embeddings.

```python
class PooledEmbedding(OnnxTextEmbedding):
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        embeddings = output.model_output
        attn_mask = output.attention_mask
        return self.mean_pooling(embeddings, attn_mask)
```

### PooledNormalizedEmbedding

Extends pooling with L2 normalization for cosine similarity optimization.

```python
class PooledNormalizedEmbedding(PooledEmbedding):
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        embeddings = output.model_output
        attn_mask = output.attention_mask
        return normalize(self.mean_pooling(embeddings, attn_mask))
```

资料来源：[fastembed/text/pooled_embedding.py]()
资料来源：[fastembed/text/pooled_normalized_embedding.py]()

## CLIP Text Embeddings

The CLIP embedding implementation provides multimodal text/image embedding capabilities.

### Supported CLIP Models

| Model | Dimension | License | Description |
|-------|-----------|---------|-------------|
| `Qdrant/clip-ViT-B-32-text` | 512 | MIT | CLIP ViT-B/32 text encoder |
| `jinaai/jina-clip-v1` | 768 | Apache 2.0 | Jina CLIP multimodal |

### CLIPOnnxEmbedding

```python
class CLIPOnnxEmbedding(OnnxTextEmbedding):
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        return output.model_output  # Direct passthrough, no pooling
```

资料来源：[fastembed/text/clip_embedding.py]()

## Inference Workflow

```mermaid
graph LR
    A[Input Text] --> B[Tokenization]
    B --> C[ONNX Inference]
    C --> D{Embedding Type?}
    D -->|Standard| E[Direct Output]
    D -->|Pooled| F[Mean Pooling]
    D -->|Pooled Norm| G[Mean Pooling + Normalize]
    E --> H[Final Embeddings]
    F --> H
    G --> H
```

## Usage Examples

### Basic Dense Embedding

```python
from fastembed import TextEmbedding

model = TextEmbedding(model_name="BAAI/bge-base-en-v1.5")
documents = [
    "The quick brown fox jumps over the lazy dog",
    "A journey of a thousand miles begins with a single step"
]

embeddings = list(model.embed(documents))
# Returns: list of 768-dimensional embedding vectors
```

### Pooled Normalized Embedding

```python
from fastembed import PooledNormalizedEmbedding

model = PooledNormalizedEmbedding(model_name="BAAI/bge-base-en-v1.5")
embeddings = list(model.embed(documents))
# Returns: L2-normalized pooled embeddings
```

### With Custom ONNX Providers

```python
from fastembed import TextEmbedding
from fastembed.common import OnnxProvider

model = TextEmbedding(
    model_name="BAAI/bge-large-en-v1.5",
    providers=[OnnxProvider.CPUExecutionProvider],
    threads=8
)
```

### Multilingual E5 with Prefix

```python
from fastembed import TextEmbedding

model = TextEmbedding(model_name="intfloat/multilingual-e5-small")

# E5 models require query prefix
query_text = "query: " + user_query
document_text = "passage: " + document_text

query_embedding = list(model.embed([query_text]))
doc_embedding = list(model.embed([document_text]))
```

## Model Sources Configuration

Models can be loaded from multiple sources:

```python
from fastembed.common import ModelSource

sources=ModelSource(
    hf="xenova/jina-embeddings-v2-base-en",      # HuggingFace Hub
    url="https://storage.googleapis.com/...",     # Direct URL
    _deprecated_tar_struct=True                   # Legacy format
)
```

| Source Type | Priority | Description |
|-------------|----------|-------------|
| `hf` | Primary | HuggingFace Hub repository |
| `url` | Fallback | Direct download URL |
| Local cache | Cached | Previously downloaded files |

## Post-Processing Pipeline

```mermaid
graph TD
    subgraph "ONNX Output"
        A[model_output] --> B[attention_mask]
    end
    
    subgraph "Post-Processing"
        B --> C{Masking Required?}
        A --> C
        C -->|Yes| D[Apply Attention Mask]
        D --> E{Mormalization?}
        C -->|No| E
        E -->|Yes| F[L2 Normalize]
        E -->|No| G[Return Raw]
        F --> H[Final Output]
        G --> H
    end
```

## Configuration Constants

| Parameter | Default | Description |
|-----------|---------|-------------|
| `default_model` | `"BAAI/bge-small-en-v1.5"` | Fallback model |
| `pooling_type` | `PoolingType.MEAN` | Pooling strategy |
| `normalization` | `True` | L2 normalization flag |

资料来源：[fastembed/text/onnx_text_model.py]()

## Integration with Qdrant

FastEmbed text embeddings are designed for seamless integration with Qdrant vector database:

```python
from fastembed import TextEmbedding
import qdrant_client

model = TextEmbedding(model_name="BAAI/bge-base-en-v1.5")
embeddings = list(model.embed(documents))

# Upload to Qdrant
client = qdrant_client.QdrantClient()
client.upsert(
    collection_name="text_embeddings",
    points=[...]
)
```

## Performance Considerations

### Token Truncation Limits

| Model Type | Max Tokens | Notes |
|------------|------------|-------|
| BGE Small/Large | 512 | Standard context |
| Jina v2 | 8192 | Extended context |
| Multilingual E5 | 512 | Query-optimized |
| Arctic Embed | 512-2048 | Variable by model |

### Hardware Acceleration

The module automatically detects and utilizes available hardware:

1. **CUDA** - GPU acceleration via CUDAExecutionProvider
2. **CPU** - Multi-threaded via CPUExecutionProvider
3. **CoreML** - Apple Silicon support via CoreMLExecutionProvider

资料来源：[fastembed/text/onnx_embedding.py:250-300]()

## Error Handling

### Common Error Cases

```python
# ValueError: attention_mask must be provided for pooled embeddings
model = PooledNormalizedEmbedding(...)
# Must ensure model outputs attention_mask

# ModelNotSupportedError: Unknown model
model = TextEmbedding(model_name="unknown/model")
# Falls back to default model or raises error
```

### Validation Requirements

| Check | Condition | Error Type |
|-------|-----------|------------|
| `attention_mask` | Required for pooled | `ValueError` |
| `model_name` | Must be in supported list | `ModelNotSupportedError` |
| `cache_dir` | Valid path | `OSError` |

## See Also

- [Sparse Text Embeddings](fastembed/sparse/) - SPLADE++ and BM25
- [Image Embeddings](fastembed/image/) - Vision model embeddings
- [Reranking Models](fastembed/rerank/) - Cross-encoder reranking
- [Post-Processors](fastembed/postprocess/) - MUVERA and other processors

---

<a id='page-image-embedding'></a>

## Image Embedding Module

### 相关页面

相关主题：[System Architecture](#page-architecture), [Late Interaction Models](#page-late-interaction)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [fastembed/image/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)
- [fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)
- [fastembed/text/clip_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/clip_embedding.py)
- [fastembed/text/pooled_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/pooled_embedding.py)
- [fastembed/image/image_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/image/image_embedding.py)
</details>

# Image Embedding Module

The Image Embedding Module in FastEmbed provides functionality for generating vector representations (embeddings) from images using ONNX-based models. This module enables efficient image similarity search, image clustering, and multimodal retrieval applications.

## Architecture Overview

The image embedding system follows a layered architecture pattern, separating the model definitions, ONNX inference logic, and embedding base classes.

```mermaid
graph TD
    A[Image Input] --> B[ImageEmbeddingBase]
    B --> C[OnnxImageModel]
    C --> D[OnnxImageEmbedding]
    D --> E[ONNX Runtime]
    E --> F[Embedding Vectors]
    
    G[Supported Models] --> D
    G -.-> H[ResNet50]
    G -.-> I[Unicom-ViT-B-16]
    G -.-> J[Unicom-ViT-B-32]
    G -.-> K[jinaai/jina-clip-v1]
```

## Supported Image Models

The module supports multiple image embedding models with varying dimensions and capabilities.

| Model | Dimension | Type | License | Size (GB) | HF Source |
|-------|-----------|------|---------|-----------|-----------|
| `Qdrant/resnet50-onnx` | 512 | Image only | apache-2.0 | 0.10 | [link](https://huggingface.co/Qdrant/resnet50-onnx) |
| `Qdrant/Unicom-ViT-B-16` | 768 | Multimodal (text&image) | apache-2.0 | 0.82 | [link](https://huggingface.co/Qdrant/Unicom-ViT-B-16) |
| `Qdrant/Unicom-ViT-B-32` | 512 | Multimodal (text&image) | apache-2.0 | 0.48 | [link](https://huggingface.co/Qdrant/Unicom-ViT-B-32) |
| `jinaai/jina-clip-v1` | 768 | Multimodal (text&image) | apache-2.0 | 0.34 | [link](https://huggingface.co/jinaai/jina-clip-v1) |

资料来源：[fastembed/image/onnx_embedding.py:1-40](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)

## Core Classes

### OnnxImageEmbedding

The main class for generating image embeddings extends `ImageEmbeddingBase` and `OnnxImageModel[NumpyArray]`.

```python
class OnnxImageEmbedding(ImageEmbeddingBase, OnnxImageModel[NumpyArray]):
```

**Class Hierarchy:**

```mermaid
graph LR
    A[TextEmbeddingBase] -->|inheritance| B[ImageEmbeddingBase]
    C[OnnxTextModel] -->|generic| D[OnnxImageModel]
    E[OnnxTextEmbedding] -->|reuse pattern| F[OnnxImageEmbedding]
```

The class follows the same architectural pattern as `OnnxTextEmbedding`, sharing the ONNX inference infrastructure with text embedding models. 资料来源：[fastembed/image/onnx_embedding.py:44-50](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)

### Constructor Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_name` | `str` | `"Qdrant/clip-ViT-B-32"` | Name of the model to use |
| `cache_dir` | `str \| None` | `None` | Path to cache directory |
| `threads` | `int \| None` | `None` | Number of threads for inference |
| `providers` | `Sequence[OnnxProvider] \| None` | `None` | ONNX execution providers |
| `cuda` | `bool \| Device` | `Device.AUTO` | CUDA device configuration |
| `device_ids` | `list[int] \| None` | `None` | Specific device IDs |
| `lazy_load` | `bool` | `False` | Load model lazily |
| `device_id` | `int \| None` | `None` | Specific device ID |

## Multimodal Image Models

### CLIP-based Models

The `Unicom-ViT-B-16` and `Unicom-ViT-B-32` models support both image and text inputs, enabling cross-modal retrieval scenarios.

| Model | Vision Dimension | Description |
|-------|------------------|-------------|
| Unicom-ViT-B-16 | 768 | More detailed embeddings (16x16 patches) |
| Unicom-ViT-B-32 | 512 | Faster processing (32x32 patches) |

### jina-clip-v1

The `jinaai/jina-clip-v1` model is a 2024 multimodal model supporting both text and image inputs:

```python
DenseModelDescription(
    model="jinaai/jina-clip-v1",
    dim=768,
    description="Image embeddings, Multimodal (text&image), 2024 year",
    license="apache-2.0",
    size_in_GB=0.34,
    sources=ModelSource(hf="jinaai/jina-clip-v1"),
    model_file="onnx/vision_model.onnx",
),
```

资料来源：[fastembed/image/onnx_embedding.py:20-28](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)

## Model Loading Workflow

```mermaid
sequenceDiagram
    participant User
    participant OnnxImageEmbedding
    participant OnnxImageModel
    participant ONNX Runtime
    participant HuggingFace

    User->>OnnxImageEmbedding: __init__(model_name)
    OnnxImageEmbedding->>OnnxImageModel: Load model from cache/HF
    OnnxImageModel->>HuggingFace: Download if needed
    OnnxImageModel->>ONNX Runtime: Initialize session
    ONNX Runtime-->>OnnxImageModel: Session ready
    OnnxImageModel-->>OnnxImageEmbedding: Model loaded
    User->>OnnxImageEmbedding: embed(image)
    OnnxImageEmbedding->>ONNX Runtime: Run inference
    ONNX Runtime-->>User: Embedding vector
```

## Usage Examples

### Basic Image Embedding

```python
from fastembed import ImageEmbedding

model = ImageEmbedding(model_name="Qdrant/Unicom-ViT-B-32")
embeddings = list(model.embed(["path/to/image.jpg"]))
```

### CLIP Text-Image Retrieval

For multimodal models like jina-clip-v1, you can perform cross-modal retrieval:

```python
from fastembed import ImageEmbedding, TextEmbedding

image_model = ImageEmbedding(model_name="jinaai/jina-clip-v1")
text_model = TextEmbedding(model_name="jinaai/jina-clip-v1")

# Generate embeddings for both modalities
image_emb = list(image_model.embed(["image_path.jpg"]))
text_emb = list(text_model.embed(["search query"]))

# Compute similarity
from numpy import dot

similarity = dot(image_emb[0], text_emb[0])
```

## Integration with Qdrant

Image embeddings generated by this module are designed for use with Qdrant vector database:

```python
# Creating a Qdrant collection with image embeddings
from qdrant_client import QdrantClient

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="images",
    vectors_config={
        "image": VectorParams(
            size=768,  # For Unicom-ViT-B-16 or jina-clip-v1
            distance=Distance.COSINE
        )
    }
)
```

## Supported Model Files

Each model specifies its ONNX model file location:

| Model | Model File | Additional Files |
|-------|------------|------------------|
| ResNet50 | `model.onnx` | None |
| Unicom-ViT-B-16 | `model.onnx` | None |
| Unicom-ViT-B-32 | `model.onnx` | None |
| jina-clip-v1 | `onnx/vision_model.onnx` | `onnx/text_model.onnx` |

资料来源：[fastembed/image/onnx_embedding.py:10-28](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)

## Relationship with Text Embedding Module

The Image Embedding Module shares significant implementation with the Text Embedding Module:

```mermaid
graph TD
    A[OnnxTextModel] -->|shared base| B[OnnxImageModel]
    A -->|shared base| C[OnnxTextEmbedding]
    B -->|shared base| D[OnnxImageEmbedding]
    
    E[supported_onnx_models] -->|text models| C
    F[supported_image_models] -->|image models| D
    
    G[CLIPEmbeddingWorker] -.->|reused| H[OnnxTextEmbeddingWorker]
```

This design ensures consistent ONNX inference behavior and reduces code duplication across embedding types. 资料来源：[fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)

## Performance Considerations

| Model | Size (GB) | Embedding Dim | Use Case |
|-------|-----------|---------------|----------|
| ResNet50 | 0.10 | 512 | Fast, lightweight embeddings |
| Unicom-ViT-B-32 | 0.48 | 512 | Balanced speed/quality |
| Unicom-ViT-B-16 | 0.82 | 768 | Higher quality, slower |
| jina-clip-v1 | 0.34 | 768 | Multimodal, 2024 model |

## Configuration Options

The module inherits configuration capabilities from the base classes:

```python
# Example with full configuration
model = OnnxImageEmbedding(
    model_name="Qdrant/Unicom-ViT-B-16",
    cache_dir="~/.cache/fastembed",
    providers=["CPUExecutionProvider"],  # or "CUDAExecutionProvider"
    threads=4,
    lazy_load=True
)
```

## Summary

The Image Embedding Module provides:

- **4 supported image models** ranging from lightweight to high-quality
- **Multimodal support** via CLIP-based models for text-image retrieval
- **ONNX runtime optimization** for efficient inference
- **Qdrant integration** for vector storage and similarity search
- **Consistent API** with the text embedding module

---

<a id='page-sparse-embedding'></a>

## Sparse Embedding Models

### 相关页面

相关主题：[Text Embedding Module](#page-text-embedding), [System Architecture](#page-architecture)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [fastembed/sparse/splade_pp.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/splade_pp.py)
- [fastembed/sparse/bm25.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/bm25.py)
- [fastembed/sparse/bm42.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/bm42.py)
- [fastembed/sparse/minicoil.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/minicoil.py)
- [fastembed/sparse/sparse_text_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/sparse_text_embedding.py)
- [fastembed/sparse/utils/sparse_vectors_converter.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/utils/sparse_vectors_converter.py)
</details>

# Sparse Embedding Models

## Overview

Sparse embedding models in FastEmbed represent text as high-dimensional sparse vectors where most dimensions have zero values. Unlike dense embeddings where every dimension contributes to meaning, sparse embeddings only store non-zero values for specific tokens or features. This approach combines semantic understanding with exact keyword matching capabilities.

The sparse representation consists of:
- **Indices**: Token identifiers in the vocabulary
- **Values**: Weight/scores representing token importance

```python
# Example sparse embedding output
SparseEmbedding(indices=[17, 123, 919, ...], values=[0.71, 0.22, 0.39, ...])
```

**资料来源：** [fastembed/sparse/sparse_text_embedding.py:1-50]()

## Architecture

### Class Hierarchy

```mermaid
graph TD
    A[SparseTextEmbeddingBase] --> B[MiniCOIL]
    A --> C[Bm25]
    A --> D[BM42]
    A --> E[SPLADE++]
    
    F[OnnxTextModel<br/>SparseEmbedding] --> A
    
    G[OnnxTextEmbeddingWorker] --> F
```

The sparse embedding system is built on a base class `SparseTextEmbeddingBase` that extends `OnnxTextModel[SparseEmbedding]`, providing a unified interface for all sparse embedding implementations.

**资料来源：** [fastembed/sparse/minicoil.py:30-50]()

### Supported Models

| Model | Type | Language | Size | License | Requires IDF |
|-------|------|----------|------|---------|--------------|
| SPLADE++ | Sparse/SPLADE | English | 0.22 GB | apache-2.0 | Yes |
| BM25 | Traditional BM25 | Multi-language | 0.01 GB | apache-2.0 | Yes |
| BM42 | Hybrid BM25+Attention | English | 0.04 GB | apache-2.0 | Yes |
| MiniCOIL | Semantic + Keyword | English | 0.09 GB | apache-2.0 | Yes |

**资料来源：** [fastembed/sparse/bm25.py:1-30]()

## Core Components

### SparseTextEmbeddingBase

The base class for all sparse embedding models defines the common interface and behavior:

```python
class SparseTextEmbeddingBase(OnnxTextModel[SparseEmbedding]):
    """Base class for sparse text embedding models"""
    
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        """Returns list of supported sparse models"""
        
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[SparseEmbedding]:
        """Post-process ONNX model output to sparse format"""
```

**资料来源：** [fastembed/sparse/sparse_text_embedding.py:1-100]()

### SparseEmbedding Data Model

The `SparseEmbedding` class represents sparse vectors with two primary attributes:

| Attribute | Type | Description |
|-----------|------|-------------|
| `indices` | list[int] | Vocabulary token IDs with non-zero values |
| `values` | list[float] | Corresponding importance weights for each index |

**资料来源：** [fastembed/sparse/sparse_text_embedding.py:100-150]()

## Implementation Details

### BM25

BM25 (Best Matching 25) is a traditional sparse embedding model that evaluates token importance based on term frequency and inverse document frequency.

**Formula:**
```
score(q, d) = SUM[ IDF(q_i) * (f(q_i, d) * (k + 1)) / (f(q_i, d) + k * (1 - b + b * (|d| / avg_len))) ]
```

| Parameter | Default | Description |
|-----------|---------|-------------|
| `k` | 1.5 | Term frequency saturation parameter |
| `b` | 0.75 | Length normalization parameter |
| `avg_len` | Computed | Average document length |

**WARNING:** BM25 is expected to be used with `modifier="idf"` in the sparse vector index of Qdrant.

**资料来源：** [fastembed/sparse/bm25.py:30-80]()

### MiniCOIL

MiniCOIL is a sparse embedding model that combines semantic meaning resolution with exact keyword matching behavior.

**Key Characteristics:**
- Converts vocabulary tokens into 4-dimensional components of sparse vectors
- Weights tokens by their frequency in the corpus
- Falls back to BM25-like behavior for out-of-vocabulary tokens

```python
class MiniCOIL(SparseTextEmbeddingBase, OnnxTextModel[SparseEmbedding]):
    """
    MiniCOIL resolves semantic meaning while keeping exact keyword match behavior.
    Each vocabulary token is converted into 4d component of a sparse vector.
    """
```

**资料来源：** [fastembed/sparse/minicoil.py:30-55]()

### BM42

BM42 extends traditional BM25 by incorporating attention weights from transformer models, creating a hybrid sparse representation.

**资料来源：** [fastembed/sparse/bm42.py:1-50]()

### SPLADE++

SPLADE++ (SParse Lexical AnD expRessive model) uses a sparse expansion approach where each token can expand to related terms in the vocabulary, enabling semantic matching while maintaining interpretability.

**资料来源：** [fastembed/sparse/splade_pp.py:1-50]()

## Data Flow

```mermaid
graph LR
    A[Input Text] --> B[Tokenization]
    B --> C[ONNX Model Inference]
    C --> D[ONNX Output Context]
    D --> E[Post-Processing]
    E --> F[SparseEmbedding]
    
    G[Vocabulary] --> C
    G --> H[SparseVectorsConverter]
    H --> E
```

## SparseVectorsConverter Utility

The `SparseVectorsConverter` class handles conversion between different sparse vector formats, particularly for MiniCOIL's word embeddings.

**Key Operations:**
- Converts sentence embeddings to Qdrant sparse vector format
- Handles out-of-vocabulary (OOV) words with fallback to BM25
- Manages vocabulary word embeddings with 4-dimensional components

```python
# Example input structure
{
    "vector": WordEmbedding({
        "word": "vector",
        "forms": ["vector", "vectors"],
        "count": 2,
        "word_id": 1231,
        "embedding": [0.1, 0.2, 0.3, 0.4]
    }),
    "axiotic": WordEmbedding({  # OOV word
        "word": "axiotic",
        "forms": ["axiotics"],
        "count": 1,
        "word_id": -1,
    })
}
```

**资料来源：** [fastembed/sparse/utils/sparse_vectors_converter.py:50-100]()

## Usage Examples

### Basic Usage

```python
from fastembed import SparseTextEmbedding

# Initialize with default SPLADE++ model
model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")

# Generate sparse embeddings
documents = ["Example text for embedding", "Another document"]
embeddings = list(model.embed(documents))

# Output: [SparseEmbedding(indices=[17, 123, 919, ...], values=[0.71, 0.22, 0.39, ...])]
```

### BM25 Usage

```python
from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(model_name="Qdrant/bm25")
embeddings = list(model.embed(documents))
```

### With Qdrant

Sparse embeddings are designed for use with Qdrant's sparse vector index with `modifier="idf"`:

```python
from qdrant_client import QdrantClient
from fastembed import SparseTextEmbedding

client = QdrantClient()
model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")

# Embedding will have indices and values compatible with Qdrant sparse vectors
```

**资料来源：** [README.md:1-100]()

## Configuration Options

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_name` | str | "prithivida/Splade_PP_en_v1" | Name of the sparse embedding model |
| `cache_dir` | str \| None | None | Cache directory path for model files |
| `threads` | int \| None | None | Number of threads for inference |
| `providers` | Sequence[OnnxProvider] \| None | None | ONNX execution providers |
| `lazy_load` | bool | False | Whether to load model lazily |
| `device_id` | int \| None | None | Specific device ID for execution |

## Language Support

| Model | Languages |
|-------|-----------|
| SPLADE++ | English only |
| BM25 | 15+ languages |
| BM42 | English |
| MiniCOIL | English |

**资料来源：** [fastembed/sparse/bm25.py:1-25]()

## Requirements

All sparse embedding models require IDF (Inverse Document Frequency) weighting for optimal performance. This is typically handled by the vector database (e.g., Qdrant) during indexing and search operations.

---

<a id='page-late-interaction'></a>

## Late Interaction Models

### 相关页面

相关主题：[Image Embedding Module](#page-image-embedding), [System Architecture](#page-architecture), [Text Embedding Module](#page-text-embedding)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [fastembed/late_interaction/colbert.py](https://github.com/qdrant/fastembed/blob/main/fastembed/late_interaction/colbert.py)
- [fastembed/late_interaction/jina_colbert.py](https://github.com/qdrant/fastembed/blob/main/fastembed/late_interaction/jina_colbert.py)
- [fastembed/late_interaction/late_interaction_text_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/late_interaction/late_interaction_text_embedding.py)
- [fastembed/late_interaction_multimodal/colpali.py](https://github.com/qdrant/fastembed/blob/main/fastembed/late_interaction_multimodal/colpali.py)
- [fastembed/late_interaction_multimodal/colmodernvbert.py](https://github.com/qdrant/fastembed/blob/main/fastembed/late_interaction_multimodal/colmodernvbert.py)
- [fastembed/late_interaction_multimodal/onnx_multimodal_model.py](https://github.com/qdrant/fastembed/blob/main/fastembed/late_interaction_multimodal/onnx_multimodal_model.py)
</details>

# Late Interaction Models

## Overview

Late Interaction Models represent an advanced embedding paradigm that departs from traditional single-vector dense representations. Unlike conventional dense embedding models that compress entire documents into a single embedding vector, late interaction models preserve token-level embeddings and defer the similarity computation until query time. This approach enables granular token-to-token interactions between queries and documents, significantly improving retrieval precision for complex semantic matching tasks.

The FastEmbed library implements two categories of late interaction models:

| Category | Scope | Use Case |
|----------|-------|----------|
| **Text-based** | Query-document text matching | Semantic search, question answering |
| **Multimodal** | Text + image joint embedding | Visual document retrieval, image search |

## Architecture

### Late Interaction vs Traditional Dense Retrieval

```mermaid
graph TD
    subgraph "Traditional Dense Retrieval"
        A1[Query Text] --> B1[Encoder]
        D1[Document] --> E1[Encoder]
        B1 --> C1[Single Query Vector]
        E1 --> F1[Single Document Vector]
        C1 --> G1[Dot Product / Cosine Similarity]
        F1 --> G1
    end
    
    subgraph "Late Interaction Retrieval"
        A2[Query Text] --> B2[Encoder]
        D2[Document] --> E2[Encoder]
        B2 --> C2[Query Token Embeddings]
        E2 --> F2[Document Token Embeddings]
        C2 --> G2[Late Interaction Module]
        F2 --> G2
        G2 --> H2[Max-Sum Similarity]
    end
    
    style C1 fill:#ffcccc
    style F1 fill:#ffcccc
    style C2 fill:#ccffcc
    style F2 fill:#ccffcc
```

### Colbert Late Interaction Mechanism

The Colbert model employs a **max-similarity** strategy where query tokens independently find their most similar document token, and relevance is computed as the sum of these maximum similarities:

```mermaid
graph LR
    Q1[Query: "q1 q2 q3"] --> QE[Query Encoder]
    D1[Doc: "d1 d2"] --> DE[Document Encoder]
    QE --> QT["Q_emb: [v₁, v₂, v₃]"]
    DE --> DT["D_emb: [u₁, u₂]"]
    QT --> MM[Similarity Matrix]
    DT --> MM
    MM --> MS[Max Similarity<br/>sim(qᵢ) = maxⱼ S(vᵢ, uⱼ)]
    MS --> SR[Score = Σᵢ sim(qᵢ)]
```

## Supported Models

### Text-based Late Interaction Models

| Model | Dimension | Context Length | Multilingual | License | Size |
|-------|-----------|----------------|---------------|---------|------|
| `jinaai/jina-colbert-v2` | 128 | 8192 | Yes | cc-by-nc-4.0 | 2.24 GB |

### Multimodal Late Interaction Models

| Model | Modality | Description |
|-------|----------|-------------|
| `vidore/colpali` | Text + Image | Vision-Language late interaction |
| `vidore/colqwen2` | Text + Image | Qwen2-based multimodal |
| `vidore/colmodernvbert` | Text + Image | Modern vision backbone |

## Base Classes

### LateInteractionTextEmbedding

Abstract base class for text-based late interaction models.

```python
class LateInteractionTextEmbedding(TextEmbeddingBase):
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        ...
    
    @classmethod
    def _get_worker_class(cls) -> Type[OnnxTextEmbeddingWorker]:
        ...
```

资料来源：[late_interaction_text_embedding.py:1-50](https://github.com/qdrant/fastembed/blob/main/fastembed/late_interaction/late_interaction_text_embedding.py)

### Colbert Base Class

The `Colbert` class implements the core late interaction logic:

```python
class Colbert(LateInteractionTextEmbedding):
    QUERY_MARKER_TOKEN_ID: int =  1  # Default CLS token
    DOCUMENT_MARKER_TOKEN_ID: int = 1  # Default CLS token
    MIN_QUERY_LENGTH: int = 32
    MASK_TOKEN: str = "[MASK]"
```

资料来源：[colbert.py:20-60](https://github.com/qdrant/fastembed/blob/main/fastembed/late_interaction/colbert.py)

Key methods:

| Method | Purpose |
|--------|---------|
| `encode_query()` | Encode a single query, returning token embeddings |
| `encode_document()` | Encode a single document, returning token embeddings |
| `score()` | Compute late interaction similarity between query and document |

## Jina Colbert Implementation

The `JinaColbert` class extends the base `Colbert` with model-specific configuration:

```python
class JinaColbert(Colbert):
    QUERY_MARKER_TOKEN_ID = 250002
    DOCUMENT_MARKER_TOKEN_ID = 250003
    MIN_QUERY_LENGTH = 31  # 32 minus 1 for special token
    MASK_TOKEN = "<mask>"
```

资料来源：[jina_colbert.py:15-19](https://github.com/qdrant/fastembed/blob/main/fastembed/late_interaction/jina_colbert.py)

#### Model Configuration

| Parameter | Value | Description |
|-----------|-------|-------------|
| `model` | `jinaai/jina-colbert-v2` | HuggingFace model identifier |
| `dim` | 128 | Token embedding dimension |
| `size_in_GB` | 2.24 | Model size |
| `context_length` | 8192 | Maximum input tokens |
| `license` | cc-by-nc-4.0 | Model license |

## Multimodal Late Interaction

### ColPali Model

ColPali extends the late interaction paradigm to visual documents by treating images as sequences of patches:

```python
class ColPali(MultimodalTextImageBase, OnnxMultimodalModel[MultivectorEmbedding]):
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        return supported_colpali_models
```

资料来源：[colpali.py:1-50](https://github.com/qdrant/fastembed/blob/main/fastembed/late_interaction_multimodal/colpali.py)

### ColModernVBert Model

ColModernVBert provides an alternative vision-language architecture with a modern backbone:

```python
class ColModernVBert(MultimodalTextImageBase, OnnxMultimodalModel[MultivectorEmbedding]):
    QUERY_MARKER_TOKEN_ID = 0
    DOCUMENT_MARKER_TOKEN_ID = 1
    MIN_QUERY_LENGTH = 32
    MASK_TOKEN = "<mask>"
```

资料来源：[colmodernvbert.py:1-50](https://github.com/qdrant/fastembed/blob/main/fastembed/late_interaction_multimodal/colmodernvbert.py)

### Architecture: Multimodal Late Interaction

```mermaid
graph TD
    subgraph "Query Processing"
        QT[Text Query] --> QE[Text Encoder]
        QV[Query Image] --> QP[Patch Extraction]
        QP --> QI[Query Image Embeddings]
        QE --> QT_emb[Query Token Embeddings]
    end
    
    subgraph "Document Processing"
        DT[Document Text] --> DE[Text Encoder]
        DV[Document Image] --> DP[Patch Extraction]
        DP --> DI[Doc Image Embeddings]
        DE --> DT_emb[Document Token Embeddings]
    end
    
    subgraph "Late Interaction"
        QT_emb --> LI[Interaction Module]
        DT_emb --> LI
        QI --> LI
        DI --> LI
        LI --> SM[Similarity Matrix]
        SM --> MS[Max-Sum Pooling]
    end
    
    MS --> SC[Relevance Score]
```

## Usage Examples

### Text-based Late Interaction

```python
from fastembed import LateInteractionTextEmbedding

# Initialize the model
model = LateInteractionTextEmbedding(
    model_name="jinaai/jina-colbert-v2"
)

# Encode query and document separately
query_embedding = model.query_embed("What is machine learning?")
doc_embedding = model.doc_embed("Machine learning is a subset of AI...")

# Compute late interaction score
score = model.score(query_embedding, doc_embedding)
```

### Multimodal Late Interaction

```python
from fastembed import LateInteractionTextImageEmbedding

model = LateInteractionTextImageEmbedding(
    model_name="vidore/colpali"
)

# Encode image with optional text
image_embedding = model.doc_embed(image=image_bytes)
query_embedding = model.query_embed("Find charts about revenue")
```

## Configuration Options

### Common Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_name` | str | Required | Model identifier |
| `cache_dir` | str | None | Local cache directory |
| `threads` | int | None | CPU threads for inference |
| `providers` | Sequence[OnnxProvider] | None | ONNX execution providers |
| `lazy_load` | bool | False | Defer model loading until first use |
| `device_id` | int | None | Specific device index |

### ONNX Providers

| Provider | Description | Priority |
|----------|-------------|----------|
| `CPUExecutionProvider` | CPU inference | Default fallback |
| `CUDAExecutionProvider` | NVIDIA GPU | Preferred for speed |
| `CoreMLExecutionProvider` | Apple Silicon | Mobile/iOS |

## Post-processing: Muvera

Muvera is a post-processing technique that converts late interaction embeddings (multi-vector) into fixed-dimensional representations:

```python
from fastembed.postprocess import Muvera

# Convert from multi-vector to fixed-dim
muvera = Muvera.from_multivector_model(
    model=late_interaction_model,
    k_sim=6,
    dim_proj=32
)

# Process document
fde = muvera.process_document(multivector_embedding)
```

资料来源：[postprocess/muvera.py:1-100](https://github.com/qdrant/fastembed/blob/main/fastembed/postprocess/muvera.py)

### Muvera Configuration

| Parameter | Description | Impact |
|-----------|-------------|--------|
| `k_sim` | Log₂ of number of buckets | Memory vs precision |
| `dim_proj` | Projection dimension | Output size |
| `r_reps` | Number of repetitions | Robustness |
| `random_seed` | Random seed | Reproducibility |

Output dimension formula: `r_reps × 2^k_sim × dim_proj`

## Performance Considerations

### Token Length Limits

| Model | Query Limit | Document Limit |
|-------|-------------|----------------|
| jina-colbert-v2 | 31 tokens | 8192 tokens |
| colpali | Variable | Variable |

### Memory Usage

Late interaction models store per-token embeddings rather than single vectors:

- **Traditional dense**: N × D memory (N documents, D dimension)
- **Late interaction**: N × T × D memory (T = avg token count)

This trade-off enables better precision at the cost of increased memory footprint.

## Comparison with Dense Retrieval

| Aspect | Dense Retrieval | Late Interaction |
|--------|-----------------|------------------|
| Embedding type | Single vector | Token-level vectors |
| Query speed | O(1) comparison | O(Q × D) interaction |
| Precision | Good for semantic similarity | Excellent for term matching |
| Memory | Lower | Higher |
| Interpretability | Limited | Token-level attribution |

## Related Components

| Component | File | Purpose |
|-----------|------|---------|
| `ColbertEmbeddingWorker` | `colbert.py` | Parallel embedding worker |
| `OnnxMultimodalModel` | `onnx_multimodal_model.py` | Base for ONNX multimodal |
| `MultivectorEmbedding` | Types | Output type for late interaction |
| `Muvera` | `muvera.py` | Dimensionality reduction post-process |

## References

- Original Colbert paper: Khattab & Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction"
- Model source: [jinaai/jina-colbert-v2](https://huggingface.co/jinaai/jina-colbert-v2)

---

<a id='page-onnx-model'></a>

## ONNX Model Infrastructure

### 相关页面

相关主题：[System Architecture](#page-architecture), [GPU Support and Acceleration](#page-gpu-support)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [fastembed/common/onnx_model.py](https://github.com/qdrant/fastembed/blob/main/fastembed/common/onnx_model.py)
- [fastembed/common/types.py](https://github.com/qdrant/fastembed/blob/main/fastembed/common/types.py)
- [fastembed/common/preprocessor_utils.py](https://github.com/qdrant/fastembed/blob/main/fastembed/common/preprocessor_utils.py)
- [fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)
- [fastembed/image/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)
- [fastembed/text/clip_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/clip_embedding.py)
- [fastembed/sparse/minicoil.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/minicoil.py)
</details>

# ONNX Model Infrastructure

## Overview

The ONNX Model Infrastructure is the core runtime layer in FastEmbed that enables efficient execution of embedding models through the ONNX (Open Neural Network Exchange) format. This infrastructure provides a unified abstraction for loading, executing, and post-processing ONNX models across different embedding modalities including text, images, and sparse embeddings.

FastEmbed leverages ONNX Runtime to achieve cross-platform compatibility and optimized inference performance without requiring PyTorch or TensorFlow dependencies. The architecture separates model execution concerns from embedding-specific logic, enabling a clean separation of concerns between the ONNX runtime layer and higher-level embedding abstractions. 资料来源：[fastembed/text/onnx_embedding.py:1-50]()

## Architecture Overview

The ONNX infrastructure follows a class hierarchy pattern where base classes define the runtime contract and concrete implementations provide modality-specific behavior. The architecture is designed around the following core components:

```mermaid
graph TD
    A[ONNX Runtime] --> B[OnnxModel Base]
    B --> C[OnnxTextModel]
    B --> D[OnnxImageModel]
    B --> E[OnnxCrossEncoderModel]
    C --> F[OnnxTextEmbedding]
    C --> G[PooledEmbedding]
    C --> H[CLIPOnnxEmbedding]
    C --> I[MiniCOIL]
    D --> J[OnnxImageEmbedding]
    E --> K[OnnxTextCrossEncoder]
    
    L[TextEmbeddingBase] --> F
    M[ImageEmbeddingBase] --> J
    N[SparseTextEmbeddingBase] --> I
```

### Core Base Classes

The infrastructure defines three primary base classes that orchestrate ONNX model execution:

| Class | File | Purpose |
|-------|------|---------|
| `OnnxModel` | `fastembed/common/onnx_model.py` | Core runtime for ONNX session management |
| `OnnxTextModel` | `fastembed/text/onnx_text_model.py` | Text-specific ONNX execution |
| `OnnxImageModel` | `fastembed/image/onnx_embedding.py` | Image-specific ONNX execution |

资料来源：[fastembed/common/onnx_model.py:1-100]()

## ONNX Session Management

### Model Loading

The `OnnxModel` base class handles the lifecycle of ONNX model loading and execution. When a model is initialized, it performs the following operations:

1. Resolves the model file path from cache or downloads from source
2. Configures ONNX Runtime session options (threads, providers)
3. Creates an inference session with the specified execution providers
4. Validates model inputs and outputs

```python
class OnnxModel(Generic[T]):
    def __init__(
        self,
        model_dir: Path,
        model_file: str,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_id: int | None = None,
        **kwargs: Any,
    ):
        self._load_onnx_model(
            model_dir=model_dir,
            model_file=model_file,
            threads=threads,
            providers=providers,
            cuda=cuda,
            device_id=device_id,
        )
```

资料来源：[fastembed/common/onnx_model.py:50-80]()

### Execution Providers

The infrastructure supports multiple ONNX Runtime execution providers for hardware acceleration:

| Provider | Priority | Use Case |
|----------|----------|----------|
| `CUDAExecutionProvider` | GPU acceleration | NVIDIA GPUs |
| `CPUExecutionProvider` | Fallback | CPU inference |

The `Device` enum provides automatic device selection:

```python
class Device(Enum):
    CPU = "cpu"
    CUDA = "cuda"
    AUTO = "auto"
```

资料来源：[fastembed/common/types.py:1-50]()

## Text Embedding Infrastructure

### OnnxTextEmbedding

The `OnnxTextEmbedding` class is the primary implementation for text embedding generation. It inherits from both `TextEmbeddingBase` and `OnnxTextModel`, combining the ONNX runtime with embedding-specific logic.

```python
class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[NumpyArray]):
    """Implementation of the Flag Embedding model."""
    
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        return supported_onnx_models
```

资料来源：[fastembed/text/onnx_embedding.py:60-85]()

#### Supported Models

The text embedding infrastructure includes a comprehensive list of supported models:

| Model | Dimension | License | Size (GB) | Token Limit |
|-------|-----------|---------|-----------|-------------|
| `BAAI/bge-base-en` | 768 | mit | 0.42 | 512 |
| `BAAI/bge-base-en-v1.5` | 768 | mit | 0.21 | 512 |
| `BAAI/bge-large-en-v1.5` | 1024 | mit | 1.20 | 512 |
| `BAAI/bge-small-en-v1.5` | 384 | mit | 0.067 | 512 |
| `snowflake/snowflake-arctic-embed-m` | 768 | apache-2.0 | 0.43 | 512 |
| `snowflake/snowflake-arctic-embed-m-long` | 768 | apache-2.0 | 0.54 | 2048 |
| `jinaai/jina-clip-v1` | 768 | apache-2.0 | 0.55 | multimodal |
| `mixedbread-ai/mxbai-embed-large-v1` | 1024 | apache-2.0 | 0.64 | 512 |

资料来源：[fastembed/text/onnx_embedding.py:30-150]()

### Pooled Embedding Variants

The infrastructure provides specialized pooling strategies through inheritance:

#### PooledNormalizedEmbedding

Applies mean pooling over token embeddings followed by L2 normalization:

```python
class PooledNormalizedEmbedding(PooledEmbedding):
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        embeddings = output.model_output
        attn_mask = output.attention_mask
        return normalize(self.mean_pooling(embeddings, attn_mask))
```

Supported models for pooled normalized embeddings include:
- `jinaai/jina-embeddings-v2-base-en` (768 dim, 8192 tokens)
- `jinaai/jina-embeddings-v2-small-en` (512 dim, 8192 tokens)
- `thenlper/gte-base` (768 dim, 512 tokens)
- `thenlper/gte-large` (1024 dim, 512 tokens)

资料来源：[fastembed/text/pooled_normalized_embedding.py:50-100]()

#### PooledEmbedding

Standard pooled embedding with mean pooling over token representations:

```python
class PooledEmbedding(OnnxTextEmbedding):
    @classmethod
    def _get_worker_class(cls) -> Type[OnnxTextEmbeddingWorker]:
        return PooledEmbeddingWorker
```

Supported multilingual and specialized models:
- `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` (384 dim, ~50 languages)
- `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` (768 dim, ~50 languages)
- `intfloat/multilingual-e5-large` (1024 dim, ~100 languages)

资料来源：[fastembed/text/pooled_embedding.py:50-150]()

## Image Embedding Infrastructure

### OnnxImageEmbedding

Image embedding models inherit from `OnnxImageModel` and `ImageEmbeddingBase`:

```python
class OnnxImageEmbedding(ImageEmbeddingBase, OnnxImageModel[NumpyArray]):
    def __init__(self, model_name: str, cache_dir: str | None = None, ...):
        ...
```

Supported image models:

| Model | Dimension | License | Size (GB) |
|-------|-----------|---------|-----------|
| `Qdrant/resnet50-onnx` | 2048 | apache-2.0 | - |
| `Qdrant/Unicom-ViT-B-16` | 768 | apache-2.0 | 0.82 |
| `Qdrant/Unicom-ViT-B-32` | 512 | apache-2.0 | 0.48 |
| `jinaai/jina-clip-v1` | 768 | apache-2.0 | 0.34 |

资料来源：[fastembed/image/onnx_embedding.py:30-80]()

## Multimodal Embedding

### CLIP Embeddings

The `CLIPOnnxEmbedding` class provides multimodal (text and image) embedding capabilities:

```python
class CLIPOnnxEmbedding(OnnxTextEmbedding):
    @classmethod
    def _list_supported_models(cls) -> list[DenseModelDescription]:
        return supported_clip_models
    
    def _post_process_onnx_output(
        self, output: OnnxOutputContext, **kwargs: Any
    ) -> Iterable[NumpyArray]:
        return output.model_output
```

Currently supported CLIP model:
- `Qdrant/clip-ViT-B-32-text` (512 dim, 77 tokens)

资料来源：[fastembed/text/clip_embedding.py:20-50]()

### Late Interaction Multimodal

The `ColModernVbert` class implements late interaction models with image processing capabilities:

```python
def load_onnx_model(self) -> None:
    self._load_onnx_model(...)
    
    # Load image processing configuration
    processor_config_path = self._model_dir / "processor_config.json"
    self.image_seq_len = processor_config.get("image_seq_len", 64)
    self.max_image_size = preprocessor_config.get("max_image_size", {}).get("longest_edge", 512)
```

资料来源：[fastembed/late_interaction_multimodal/colmodernvbert.py:50-100]()

## Sparse Embedding Infrastructure

### MiniCOIL

The `MiniCOIL` class implements sparse embedding with semantic resolution:

```python
class MiniCOIL(SparseTextEmbeddingBase, OnnxTextModel[SparseEmbedding]):
    """
    MiniCOIL is a sparse embedding model, that resolves semantic meaning of the words,
    while keeping exact keyword match behavior.
    """
```

Each vocabulary token is converted into a 4-dimensional component of a sparse vector, weighted by token frequency in the corpus. If a token is not found in the corpus, it is treated exactly like in BM25.

Supported sparse models:
- `Qdrant/minicoil-v1` (0.09 GB, requires IDF weighting)

资料来源：[fastembed/sparse/minicoil.py:40-80]()

## Worker Architecture

The infrastructure uses a worker-based pattern for parallel embedding generation:

```mermaid
graph LR
    A[Main Thread] --> B[OnnxTextEmbeddingWorker]
    B --> C[ONNX Session]
    C --> D[Tokenization]
    D --> E[Model Inference]
    E --> F[Post-processing]
    F --> G[Normalized Embeddings]
```

### Worker Classes

| Worker Class | Parent | Purpose |
|--------------|--------|---------|
| `OnnxTextEmbeddingWorker` | Base | Standard text embedding generation |
| `PooledEmbeddingWorker` | `OnnxTextEmbeddingWorker` | Mean pooling after inference |
| `PooledNormalizedEmbeddingWorker` | `OnnxTextEmbeddingWorker` | Pooling + L2 normalization |
| `CLIPEmbeddingWorker` | `OnnxTextEmbeddingWorker` | CLIP-specific processing |

资料来源：[fastembed/text/onnx_text_model.py:1-50]()

## Reranking Infrastructure

### OnnxTextCrossEncoder

The cross-encoder reranking uses a specialized ONNX model class:

```python
class OnnxTextCrossEncoder(TextCrossEncoderBase, OnnxCrossEncoderModel):
    @classmethod
    def _list_supported_models(cls) -> list[BaseModelDescription]:
        return supported_onnx_models
```

Supported reranker models:

| Model | License | Size (GB) | Context |
|-------|---------|-----------|---------|
| `jinaai/jina-reranker-v1-turbo-en` | apache-2.0 | 0.15 | 1K context |
| `jinaai/jina-reranker-v2-base-multilingual` | cc-by-nc-4.0 | 1.11 | 1K context, sliding window |

资料来源：[fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py:30-80]()

## Model Source Configuration

### ModelSource

Models can be loaded from multiple sources:

```python
@dataclass
class ModelSource:
    hf: str | None = None           # HuggingFace Hub
    url: str | None = None         # Direct URL download
    _deprecated_tar_struct: bool = False  # Legacy tar format
```

### ModelDescription

The base model description structure:

```python
@dataclass
class DenseModelDescription:
    model: str
    dim: int
    description: str
    license: str
    size_in_GB: float
    sources: ModelSource
    model_file: str
    additional_files: list[str] | None = None
```

资料来源：[fastembed/common/model_description.py:1-80]()

## Inference Workflow

```mermaid
sequenceDiagram
    participant User
    participant EmbeddingClass
    participant OnnxModel
    participant ONNXRuntime
    
    User->>EmbeddingClass: embed(texts)
    EmbeddingClass->>OnnxModel: preprocess(texts)
    OnnxModel->>OnnxModel: tokenize()
    OnnxModel->>ONNXRuntime: run(session)
    ONNXRuntime-->>OnnxModel: model_output
    OnnxModel->>EmbeddingClass: _post_process_onnx_output()
    EmbeddingClass->>EmbeddingClass: normalize/pool()
    EmbeddingClass-->>User: numpy arrays
```

## Configuration Parameters

### Common Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_name` | `str` | model-specific | Name of the model to use |
| `cache_dir` | `str \| None` | `None` | Cache directory for model files |
| `threads` | `int \| None` | `None` | Number of threads for ONNX |
| `providers` | `Sequence[OnnxProvider] \| None` | `None` | Execution providers |
| `cuda` | `bool \| Device` | `Device.AUTO` | CUDA device selection |
| `device_ids` | `list[int] \| None` | `None` | Multiple GPU device IDs |
| `lazy_load` | `bool` | `False` | Defer model loading |
| `device_id` | `int \| None` | `None` | Specific device ID |
| `specific_model_path` | `str \| None` | `None` | Custom model file path |

资料来源：[fastembed/text/onnx_embedding.py:85-120]()

## Lazy Loading

The infrastructure supports lazy loading for memory-efficient initialization:

```python
def __init__(
    self,
    lazy_load: bool = False,
    ...
):
    if not lazy_load:
        self.load_onnx_model()
```

When `lazy_load=True`, the ONNX model is not loaded until the first inference call, reducing startup memory footprint.

## Type System

### Core Types

| Type | Definition | Usage |
|------|------------|-------|
| `NumpyArray` | `np.ndarray[Any, np.dtype[Any]]` | Dense embedding arrays |
| `SparseEmbedding` | Custom sparse representation | Sparse embedding vectors |
| `OnnxProvider` | Execution provider type | CPU, CUDA providers |
| `Device` | Enum | Device selection (CPU/CUDA/AUTO) |

资料来源：[fastembed/common/types.py:1-100]()

## Summary

The ONNX Model Infrastructure provides a robust, extensible foundation for embedding generation in FastEmbed. Key characteristics include:

- **Unified Runtime**: Single ONNX execution layer across all embedding modalities
- **Hardware Acceleration**: Support for CUDA and CPU execution providers
- **Model Flexibility**: Dynamic model loading from HuggingFace, URLs, or local cache
- **Extensible Architecture**: Clean inheritance hierarchy for adding new embedding types
- **Memory Efficiency**: Lazy loading and optimized session management
- **Cross-Modal Support**: Text, image, sparse, and multimodal embeddings

This infrastructure enables FastEmbed to deliver high-performance embedding generation without external ML framework dependencies, making it suitable for production deployments with varying hardware constraints.

---

<a id='page-gpu-support'></a>

## GPU Support and Acceleration

### 相关页面

相关主题：[Installation Guide](#page-installation), [ONNX Model Infrastructure](#page-onnx-model)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [fastembed/text/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/onnx_embedding.py)
- [fastembed/image/onnx_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/image/onnx_embedding.py)
- [fastembed/text/pooled_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/pooled_embedding.py)
- [fastembed/text/pooled_normalized_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/pooled_normalized_embedding.py)
- [fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py](https://github.com/qdrant/fastembed/blob/main/fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py)
- [fastembed/sparse/splade_pp.py](https://github.com/qdrant/fastembed/blob/main/fastembed/sparse/splade_pp.py)
- [fastembed/text/text_embedding.py](https://github.com/qdrant/fastembed/blob/main/fastembed/text/text_embedding.py)
- [README.md](https://github.com/qdrant/fastembed/blob/main/README.md)
</details>

# GPU Support and Acceleration

FastEmbed provides comprehensive GPU acceleration support through ONNX Runtime's execution providers, enabling high-performance inference on NVIDIA GPUs. The library offers flexible device management with automatic detection, multi-GPU support for parallel processing, and lazy loading capabilities for efficient resource utilization.

## Architecture Overview

FastEmbed's GPU acceleration is built on top of ONNX Runtime, which provides hardware-accelerated inference for ONNX models. All embedding model classes inherit from base ONNX model classes that handle device initialization and session management.

```mermaid
graph TD
    A[User Code] --> B[TextEmbedding / ImageEmbedding / CrossEncoder]
    B --> C[ONNX Model Base Classes]
    C --> D[ONNX Runtime Session]
    D --> E{Hardware Acceleration}
    E --> F[CUDA Execution Provider]
    E --> G[CPU Execution Provider]
    E --> H[TensorRT Provider]
    
    F --> H1[NVIDIA GPU]
    G --> H2[CPU Fallback]
    
    style F fill:#4CAF50,color:#fff
    style H1 fill:#2196F3,color:#fff
```

## Supported Model Types

FastEmbed supports GPU acceleration across multiple embedding modalities and processing types.

| Model Type | Class | GPU Support | Description |
|------------|-------|-------------|-------------|
| Text Embeddings | `OnnxTextEmbedding` | ✅ | Dense text embeddings via ONNX |
| Pooled Embeddings | `PooledEmbedding` | ✅ | Pooled representation embeddings |
| Normalized Pooled | `PooledNormalizedEmbedding` | ✅ | L2-normalized pooled embeddings |
| Image Embeddings | `OnnxImageEmbedding` | ✅ | Vision model embeddings |
| Cross Encoders | `OnnxTextCrossEncoder` | ✅ | Reranking and relevance scoring |
| Sparse Embeddings | SPLADE models | ✅ | Lexical sparse embeddings |

资料来源：[fastembed/text/onnx_embedding.py:1-50]()

## Device Configuration

### Device Enum

The `Device` enum defines available compute devices with automatic selection capability.

```python
class Device(Enum):
    AUTO = "auto"  # Automatically select best available device
    CPU = "cpu"    # Force CPU execution
    CUDA = "cuda"  # NVIDIA GPU acceleration
```

### Initialization Parameters

All ONNX embedding classes accept the following GPU-related parameters:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `cuda` | `bool \| Device` | `Device.AUTO` | Enable CUDA or specify device type |
| `providers` | `Sequence[OnnxProvider]` | `None` | ONNX Runtime providers (mutually exclusive with cuda) |
| `device_ids` | `list[int]` | `None` | GPU device IDs for multi-GPU data parallelism |
| `device_id` | `int` | `None` | Specific device ID for single-process loading |
| `lazy_load` | `bool` | `False` | Defer model loading until first use |

资料来源：[fastembed/text/onnx_embedding.py:47-57]()

### Constructor Signature

```python
def __init__(
    self,
    model_name: str = "BAAI/bge-small-en-v1.5",
    cache_dir: str | None = None,
    threads: int | None = None,
    providers: Sequence[OnnxProvider] | None = None,
    cuda: bool | Device = Device.AUTO,
    device_ids: list[int] | None = None,
    lazy_load: bool = False,
    device_id: int | None = None,
    specific_model_path: str | None = None,
    **kwargs: Any,
):
```

资料来源：[fastembed/text/onnx_embedding.py:44-66]()

## GPU Initialization Workflow

```mermaid
sequenceDiagram
    participant User
    participant Embedding as Embedding Class
    participant Base as ONNX Model Base
    participant Runtime as ONNX Runtime
    participant Device as Compute Device
    
    User->>Embedding: Initialize(cuda=True)
    Embedding->>Base: super().__init__()
    Base->>Device: Auto-detect device
    Device-->>Base: Available devices
    Base->>Runtime: Create InferenceSession
    Runtime->>Device: Load model to GPU
    Device-->>Runtime: Model loaded
    Runtime-->>Base: Session ready
    Base-->>Embedding: Return session
    Embedding-->>User: Instance ready
```

## Multi-GPU Configuration

### Data Parallel Processing

For scenarios requiring distribution across multiple GPUs, FastEmbed supports device ID specification for data-parallel workloads.

```python
from fastembed import TextEmbedding

# Initialize for multi-GPU data parallelism
embedding_model = TextEmbedding(
    model_name="BAAI/bge-base-en-v1.5",
    cuda=True,
    device_ids=[0, 1, 2, 3],  # Use 4 GPUs
    lazy_load=True  # Required for multi-GPU setup
)
```

资料来源：[fastembed/text/onnx_embedding.py:52-55]()

### Lazy Loading for Multi-GPU

When using multiple GPUs, `lazy_load=True` defers model loading until first inference, which is essential for avoiding resource conflicts in multi-process scenarios.

```python
embedding_model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    cuda=True,
    device_ids=[0, 1],
    lazy_load=True  # Load on-demand in worker processes
)
```

资料来源：[fastembed/text/onnx_embedding.py:54]()

## ONNX Runtime Providers

### Provider Selection

ONNX Runtime supports multiple execution providers. FastEmbed allows explicit provider specification via the `providers` parameter, which is mutually exclusive with the `cuda` parameter.

```python
from fastembed import TextEmbedding

# Using explicit provider specification
model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
```

### Provider Priority

When multiple providers are specified, ONNX Runtime attempts to use them in order of preference, falling back to subsequent providers if the preferred one is unavailable.

```mermaid
graph LR
    A[Query] --> B{CUDA Available?}
    B -->|Yes| C[CUDAExecutionProvider]
    B -->|No| D{CPU Provider Available?}
    D -->|Yes| E[CPUExecutionProvider]
    D -->|No| F[Error]
    
    C --> G[GPU Inference]
    E --> H[CPU Inference]
    
    style C fill:#4CAF50,color:#fff
    style E fill:#FF9800,color:#fff
```

## GPU Installation

### Package Variants

FastEmbed offers separate packages for CPU and GPU operation.

| Package | Command | Use Case |
|---------|---------|----------|
| CPU (default) | `pip install fastembed` | Standard installations |
| GPU | `pip install fastembed-gpu` | NVIDIA GPU acceleration |

资料来源：[README.md:1-20]()

### Qdrant Integration

For vector database workflows with GPU acceleration:

```bash
pip install qdrant-client[fastembed-gpu]
```

```python
from fastembed import TextEmbedding

# GPU-accelerated embedding for Qdrant
model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    providers=["CUDAExecutionProvider"]
)
print("The model BAAI/bge-small-en-v1.5 is ready to use on a GPU.")
```

资料来源：[README.md:1-30]()

## Implementation Across Model Classes

### OnnxTextEmbedding

The primary text embedding class with full GPU support:

```python
class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[NumpyArray]):
    """Implementation of the Flag Embedding model with ONNX acceleration."""
    
    def __init__(
        self,
        model_name: str = "BAAI/bge-small-en-v1.5",
        cache_dir: str | None = None,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_ids: list[int] | None = None,
        lazy_load: bool = False,
        device_id: int | None = None,
        specific_model_path: str | None = None,
        **kwargs: Any,
    ):
```

资料来源：[fastembed/text/onnx_embedding.py:48-70]()

### OnnxImageEmbedding

Image embeddings also inherit the same GPU acceleration framework:

```python
class OnnxImageEmbedding(ImageEmbeddingBase, OnnxImageModel[NumpyArray]):
    def __init__(
        self,
        model_name: str,
        cache_dir: str | None = None,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_ids: list[int] | None = None,
        lazy_load: bool = False,
        device_id: int | None = None,
        specific_model_path: str | None = None,
        **kwargs: Any,
    ):
```

资料来源：[fastembed/image/onnx_embedding.py:1-30]()

### OnnxTextCrossEncoder

Reranking models support GPU acceleration for cross-encoder inference:

```python
class OnnxTextCrossEncoder(TextCrossEncoderBase, OnnxCrossEncoderModel):
    def __init__(
        self,
        model_name: str,
        cache_dir: str | None = None,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_ids: list[int] | None = None,
        lazy_load: bool = False,
        device_id: int | None = None,
        specific_model_path: str | None = None,
        **kwargs: Any,
    ):
```

资料来源：[fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py:1-50]()

## Unified TextEmbedding Entry Point

The `TextEmbedding` class provides a unified interface that automatically selects the appropriate embedding type:

```python
class TextEmbedding:
    def __init__(
        self,
        model_name: str = "BAAI/bge-small-en-v1.5",
        cache_dir: str | None = None,
        threads: int | None = None,
        providers: Sequence[OnnxProvider] | None = None,
        cuda: bool | Device = Device.AUTO,
        device_ids: list[int] | None = None,
        lazy_load: bool = False,
        **kwargs: Any,
    ):
        super().__init__(model_name, cache_dir, threads, **kwargs)
        # Automatically routes to appropriate embedding type
        for EMBEDDING_MODEL_TYPE in self.EMBEDDINGS_REGISTRY:
            supported_models = EMBEDDING_MODEL_TYPE._list_supported_models()
            if any(model_name.lower() == model.model.lower() 
                   for model in supported_models):
                self.model = EMBEDDING_MODEL_TYPE(
                    model_name=model_name,
                    cache_dir=cache_dir,
                    threads=threads,
                    providers=providers,
                    cuda=cuda,
                    device_ids=device_ids,
                    lazy_load=lazy_load,
                )
```

资料来源：[fastembed/text/text_embedding.py:1-100]()

## Supported Models with GPU Acceleration

### Text Embedding Models

| Model | Dimension | License | Size (GB) | Token Limit |
|-------|-----------|---------|-----------|-------------|
| `BAAI/bge-small-en-v1.5` | 384 | MIT | 0.067 | 512 |
| `BAAI/bge-base-en-v1.5` | 768 | MIT | 0.21 | 512 |
| `BAAI/bge-large-en-v1.5` | 1024 | MIT | 1.20 | 512 |
| `jinaai/jina-embeddings-v2-base-en` | 768 | Apache 2.0 | 0.52 | 8192 |
| `sentence-transformers/all-MiniLM-L6-v2` | 384 | Apache 2.0 | 0.09 | 256 |
| `mixedbread-ai/mxbai-embed-large-v1` | 1024 | Apache 2.0 | 0.64 | 512 |
| `nomic-ai/nomic-embed-text-v1.5` | 768 | Apache 2.0 | 0.13 | 8192 |

资料来源：[fastembed/text/onnx_embedding.py:1-150](), [fastembed/text/pooled_embedding.py:1-80]()

### Image Embedding Models

| Model | Dimension | License | Size (GB) |
|-------|-----------|---------|-----------|
| `Qdrant/Unicom-ViT-B-16` | 768 | Apache 2.0 | 0.82 |
| `Qdrant/Unicom-ViT-B-32` | 512 | Apache 2.0 | 0.48 |
| `jinaai/jina-clip-v1` | 768 | Apache 2.0 | 0.55 |

资料来源：[fastembed/image/onnx_embedding.py:1-50]()

### Reranking Models

| Model | License | Size (GB) |
|-------|---------|-----------|
| `jinaai/jina-reranker-v1-turbo-en` | Apache 2.0 | 0.15 |
| `jinaai/jina-reranker-v2-base-multilingual` | CC BY-NC 4.0 | 1.11 |

资料来源：[fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py:1-40]()

## Best Practices

### Device Selection

```python
from fastembed.common.types import Device

# Recommended: Automatic detection
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cuda=Device.AUTO)

# Explicit CUDA
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cuda=True)

# Force CPU (for debugging)
model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cuda=False)
```

### Multi-GPU Inference

```python
# For batch processing across multiple GPUs
model = TextEmbedding(
    model_name="BAAI/bge-base-en-v1.5",
    cuda=True,
    device_ids=[0, 1],  # Parallel GPU usage
    lazy_load=True
)
```

### Provider Fallback

```python
# Explicit provider chain with fallback
model = TextEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    providers=[
        "CUDAExecutionProvider",  # Preferred
        "CPUExecutionProvider"    # Fallback
    ]
)
```

## Limitations and Considerations

| Aspect | Description |
|--------|-------------|
| Mutual Exclusivity | `providers` and `cuda` parameters cannot be used together |
| Device ID Scope | `device_ids` is for data parallelism; `device_id` is for single-process loading |
| Lazy Loading | Required for multi-GPU setups to avoid resource conflicts |
| Model Support | All ONNX-exported models support GPU; not all models have ONNX exports |

资料来源：[fastembed/text/onnx_embedding.py:47-57]()

## Summary

FastEmbed's GPU acceleration framework provides:

1. **Automatic device detection** via the `Device.AUTO` enum value
2. **Flexible provider configuration** through ONNX Runtime's provider system
3. **Multi-GPU support** with device ID lists for data-parallel workloads
4. **Lazy loading** for efficient multi-process GPU utilization
5. **Consistent API** across text, image, and reranking models
6. **Seamless fallback** to CPU when CUDA is unavailable

---

---

## Doramagic Pitfall Log

Project: qdrant/fastembed

Summary: Found 29 potential pitfall items; 3 are high/blocking. Highest priority: installation - 来源证据：[Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2.

## 1. installation · 来源证据：[Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2

- Severity: high
- Evidence strength: source_linked
- Finding: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2
- User impact: 可能增加新用户试用和生产接入成本。
- Suggested check: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Guardrail action: 不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- Evidence: community_evidence:github | cevd_201d4035515846df8830ca0dad6960c5 | https://github.com/qdrant/fastembed/issues/618 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 2. installation · 来源证据：[Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14

- Severity: high
- Evidence strength: source_linked
- Finding: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14
- User impact: 可能阻塞安装或首次运行。
- Suggested check: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Guardrail action: 不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- Evidence: community_evidence:github | cevd_8147621574b345d7955e79ad98f4ba6f | https://github.com/qdrant/fastembed/issues/630 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 3. security_permissions · 失败模式：security_permissions: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside ca...

- Severity: high
- Evidence strength: source_linked
- Finding: Developers should check this security_permissions risk before relying on the project: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory
- User impact: Developers may expose sensitive permissions or credentials: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory. Context: Observed when using python
- Guardrail action: Do not recommend enabling privileged or credential-bearing paths until the source-backed risk is reviewed: https://github.com/qdrant/fastembed/issues/626
- Evidence: failure_mode_cluster:github_issue | fmev_d3890c2b3360ccb937839f70fd4aa584 | https://github.com/qdrant/fastembed/issues/626 | [Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory

## 4. installation · 失败模式：installation: The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this installation risk before relying on the project: The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.
- User impact: Developers may fail before the first successful local run: The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.. Context: Observed when using python, docker
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_16e50a8626aff1576adeb1c0baab4785 | https://github.com/qdrant/fastembed/issues/466 | The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.

## 5. installation · 失败模式：installation: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this installation risk before relying on the project: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2
- User impact: Developers may fail before the first successful local run: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2. Context: Observed when using python, windows, linux
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_04529bc774f1c961d4adeb7190edecd7 | https://github.com/qdrant/fastembed/issues/618 | [Bug]: Segmentation Fault or AssertionError during initialization on Python 3.14.2

## 6. installation · 失败模式：installation: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this installation risk before relying on the project: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14
- User impact: Developers may fail before the first successful local run: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14. Context: Observed when using python, macos, cuda
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_79a43347d96beb6d05eb6bfec2503fb5 | https://github.com/qdrant/fastembed/issues/630 | [Bug]: Unable to load 'Qdrant/bm25' on macOS python3.14

## 7. installation · 失败模式：installation: v0.5.1

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this installation risk before relying on the project: v0.5.1
- User impact: Upgrade or migration may change expected behavior: v0.5.1
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.5.1. Context: Observed when using python
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_8b37c58c613005c0182d0325aaf032f7 | https://github.com/qdrant/fastembed/releases/tag/v0.5.1 | v0.5.1

## 8. installation · 来源证据：The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.

- Severity: medium
- Evidence strength: source_linked
- Finding: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：The dependency `py-rust-stemmers` cannot be downloaded in a pure Python environment.
- User impact: 可能增加新用户试用和生产接入成本。
- Suggested check: 来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- Guardrail action: 不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- Evidence: community_evidence:github | cevd_0f507b37e33e456ea259e82966cecdc5 | https://github.com/qdrant/fastembed/issues/466 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 9. configuration · 来源证据：[Bug]: No timeout on model download — requests.get() can hang indefinitely

- Severity: medium
- Evidence strength: source_linked
- Finding: GitHub 社区证据显示该项目存在一个配置相关的待验证问题：[Bug]: No timeout on model download — requests.get() can hang indefinitely
- User impact: 可能阻塞安装或首次运行。
- Suggested check: 来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- Guardrail action: 不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- Evidence: community_evidence:github | cevd_6d570ba91cfd414f970a3a8da522be04 | https://github.com/qdrant/fastembed/issues/627 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 10. capability · 能力判断依赖假设

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: 假设不成立时，用户拿不到承诺的能力。
- Suggested check: 将假设转成下游验证清单。
- Guardrail action: 假设必须转成验证项；没有验证结果前不能写成事实。
- Evidence: capability.assumptions | github_repo:666260877 | https://github.com/qdrant/fastembed | README/documentation is current enough for a first validation pass.

## 11. runtime · 失败模式：runtime: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1
- User impact: Developers may hit a documented source-backed failure mode: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: Loading models with additional files fails with onnxruntime 1.24.1. Context: Observed when using python, linux
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_17b849ae47ffaf5d18cabbd577f373ca | https://github.com/qdrant/fastembed/issues/603 | [Bug]: Loading models with additional files fails with onnxruntime 1.24.1

## 12. runtime · 失败模式：runtime: v0.4.2

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: v0.4.2
- User impact: Upgrade or migration may change expected behavior: v0.4.2
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.4.2. Context: Source discussion did not expose a precise runtime context.
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_607c8ff157108b2b5fb78f55129b60f6 | https://github.com/qdrant/fastembed/releases/tag/v0.4.2 | v0.4.2

## 13. runtime · 失败模式：runtime: v0.7.1

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: v0.7.1
- User impact: Upgrade or migration may change expected behavior: v0.7.1
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.7.1. Context: Source discussion did not expose a precise runtime context.
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_07583dafa20e640a1720d7a77188475e | https://github.com/qdrant/fastembed/releases/tag/v0.7.1 | v0.7.1

## 14. runtime · 失败模式：runtime: v0.8.0

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: v0.8.0
- User impact: Upgrade or migration may change expected behavior: v0.8.0
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.8.0. Context: Observed when using python, cuda
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_054bae0c9c1667cf5568de7ecb7087c9 | https://github.com/qdrant/fastembed/releases/tag/v0.8.0 | v0.8.0

## 15. maintenance · 来源证据：[Bug]: license error in pypi metadata

- Severity: medium
- Evidence strength: source_linked
- Finding: GitHub 社区证据显示该项目存在一个维护/版本相关的待验证问题：[Bug]: license error in pypi metadata
- User impact: 可能增加新用户试用和生产接入成本。
- Suggested check: 来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- Guardrail action: 不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- Evidence: community_evidence:github | cevd_017eda744ea84e679aeb5d77d41f7541 | https://github.com/qdrant/fastembed/issues/620 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 16. maintenance · 维护活跃度未知

- Severity: medium
- Evidence strength: source_linked
- Finding: 未记录 last_activity_observed。
- User impact: 新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- Suggested check: 补 GitHub 最近 commit、release、issue/PR 响应信号。
- Guardrail action: 维护活跃度未知时，推荐强度不能标为高信任。
- Evidence: evidence.maintainer_signals | github_repo:666260877 | https://github.com/qdrant/fastembed | last_activity_observed missing

## 17. security_permissions · 下游验证发现风险项

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: 下游已经要求复核，不能在页面中弱化。
- Suggested check: 进入安全/权限治理复核队列。
- Guardrail action: 下游风险存在时必须保持 review/recommendation 降级。
- Evidence: downstream_validation.risk_items | github_repo:666260877 | https://github.com/qdrant/fastembed | no_demo; severity=medium

## 18. security_permissions · 存在评分风险

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: 风险会影响是否适合普通用户安装。
- Suggested check: 把风险写入边界卡，并确认是否需要人工复核。
- Guardrail action: 评分风险必须进入边界卡，不能只作为内部分数。
- Evidence: risks.scoring_risks | github_repo:666260877 | https://github.com/qdrant/fastembed | no_demo; severity=medium

## 19. security_permissions · 来源证据：[Bug]: Loading models with additional files fails with onnxruntime 1.24.1

- Severity: medium
- Evidence strength: source_linked
- Finding: GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：[Bug]: Loading models with additional files fails with onnxruntime 1.24.1
- User impact: 可能增加新用户试用和生产接入成本。
- Suggested check: 来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- Guardrail action: 不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- Evidence: community_evidence:github | cevd_cfcd79475b1344a0bba508ac4346edf4 | https://github.com/qdrant/fastembed/issues/603 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 20. security_permissions · 来源证据：[Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory

- Severity: medium
- Evidence strength: source_linked
- Finding: GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：[Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory
- User impact: 可能增加新用户试用和生产接入成本。
- Suggested check: 来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- Guardrail action: 不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- Evidence: community_evidence:github | cevd_7603e6da390349c9aff6eed6cc5072a9 | https://github.com/qdrant/fastembed/issues/626 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 21. capability · 失败模式：capability: [Bug]: license error in pypi metadata

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: [Bug]: license error in pypi metadata
- User impact: Developers may hit a documented source-backed failure mode: [Bug]: license error in pypi metadata
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: license error in pypi metadata. Context: Observed when using python
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_9e209e0c78e000b92930bccb67db1440 | https://github.com/qdrant/fastembed/issues/620 | [Bug]: license error in pypi metadata

## 22. runtime · 失败模式：performance: [Bug]: No timeout on model download — requests.get() can hang indefinitely

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: [Bug]: No timeout on model download — requests.get() can hang indefinitely
- User impact: Developers may hit a documented source-backed failure mode: [Bug]: No timeout on model download — requests.get() can hang indefinitely
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: [Bug]: No timeout on model download — requests.get() can hang indefinitely. Context: Observed when using python, docker
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_issue | fmev_69d0e997d6514a5d87d46c30a69795b3 | https://github.com/qdrant/fastembed/issues/627 | [Bug]: No timeout on model download — requests.get() can hang indefinitely

## 23. runtime · 失败模式：performance: v0.7.4

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: v0.7.4
- User impact: Upgrade or migration may change expected behavior: v0.7.4
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.7.4. Context: Source discussion did not expose a precise runtime context.
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_df69e175afaf53788c9f212e22680b87 | https://github.com/qdrant/fastembed/releases/tag/v0.7.4 | v0.7.4

## 24. maintenance · issue/PR 响应质量未知

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: 用户无法判断遇到问题后是否有人维护。
- Suggested check: 抽样最近 issue/PR，判断是否长期无人处理。
- Guardrail action: issue/PR 响应未知时，必须提示维护风险。
- Evidence: evidence.maintainer_signals | github_repo:666260877 | https://github.com/qdrant/fastembed | issue_or_pr_quality=unknown

## 25. maintenance · 发布节奏不明确

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: 安装命令和文档可能落后于代码，用户踩坑概率升高。
- Suggested check: 确认最近 release/tag 和 README 安装命令是否一致。
- Guardrail action: 发布节奏未知或过期时，安装说明必须标注可能漂移。
- Evidence: evidence.maintainer_signals | github_repo:666260877 | https://github.com/qdrant/fastembed | release_recency=unknown

## 26. maintenance · 失败模式：maintenance: v0.6.0

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: v0.6.0
- User impact: Upgrade or migration may change expected behavior: v0.6.0
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.6.0. Context: Observed when using python
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_abc84a0766f60148974f2e932e8dc4f4 | https://github.com/qdrant/fastembed/releases/tag/v0.6.0 | v0.6.0

## 27. maintenance · 失败模式：maintenance: v0.6.1

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: v0.6.1
- User impact: Upgrade or migration may change expected behavior: v0.6.1
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.6.1. Context: Source discussion did not expose a precise runtime context.
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_e909fbcb26efacd532e8194ede2ff4f0 | https://github.com/qdrant/fastembed/releases/tag/v0.6.1 | v0.6.1

## 28. maintenance · 失败模式：maintenance: v0.7.0

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: v0.7.0
- User impact: Upgrade or migration may change expected behavior: v0.7.0
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.7.0. Context: Source discussion did not expose a precise runtime context.
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_0ccd78cc792f8d6a79212fc6cfa512e4 | https://github.com/qdrant/fastembed/releases/tag/v0.7.0 | v0.7.0

## 29. maintenance · 失败模式：maintenance: v0.7.2

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: v0.7.2
- User impact: Upgrade or migration may change expected behavior: v0.7.2
- Suggested check: Before packaging this project, run the relevant install/config/quickstart check for: v0.7.2. Context: Source discussion did not expose a precise runtime context.
- Guardrail action: State this as source-backed community evidence, not as Doramagic reproduction.
- Evidence: failure_mode_cluster:github_release | fmev_28a8ed6e38ab5a351c6a64532998a7df | https://github.com/qdrant/fastembed/releases/tag/v0.7.2 | v0.7.2

<!-- canonical_name: qdrant/fastembed; human_manual_source: deepwiki_human_wiki -->