# https://github.com/neuml/txtai 项目说明书

生成时间：2026-05-16 22:47:03 UTC

## 目录

- [Introduction to txtai](#introduction)
- [Getting Started](#getting-started)
- [System Architecture](#architecture)
- [Embeddings Database](#embeddings-core)
- [Pipelines](#pipelines)
- [Workflows](#workflows)
- [Agents](#agents)
- [Database Integration](#database-integration)
- [Graph Networks](#graph-networks)
- [Scoring and Retrieval Algorithms](#scoring-algorithms)

<a id='introduction'></a>

## Introduction to txtai

### 相关页面

相关主题：[System Architecture](#architecture), [Getting Started](#getting-started)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/neuml/txtai/blob/main/README.md)
- [setup.py](https://github.com/neuml/txtai/blob/main/setup.py)
- [src/python/txtai/pipeline/data/textractor.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/data/textractor.py)
- [src/python/txtai/pipeline/llm/llm.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/llm/llm.py)
- [src/python/txtai/workflow/task/template.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/workflow/task/template.py)
- [examples/rag_quickstart.py](https://github.com/neuml/txtai/blob/main/examples/rag_quickstart.py)
- [src/python/txtai/agent/tool/read.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/agent/tool/read.py)
</details>

# Introduction to txtai

txtai is an **all-in-one AI framework** for semantic search, large language model (LLM) orchestration, and language model workflows. It provides a unified platform that combines vector search capabilities with traditional database features, enabling developers to build sophisticated AI-powered applications without managing multiple disparate systems.

## Overview

The central innovation of txtai is its **embeddings database** - a hybrid data store that unifies:

- **Sparse vector indexes** for keyword-based search
- **Dense vector indexes** for semantic similarity
- **Graph networks** for relationship modeling
- **Relational databases** for structured data storage

This architecture enables txtai to function both as a standalone vector search engine and as a powerful knowledge source for RAG (Retrieval Augmented Generation) applications. 资料来源：[README.md:1-20]()

## Architecture

The txtai architecture follows a layered design that separates concerns while maintaining tight integration between components.

```mermaid
graph TD
    subgraph "txtai Architecture"
        A[Application Layer] --> B[Workflow Engine]
        B --> C[Pipeline System]
        C --> D[Embeddings Database]
        D --> E[(Vector Index)]
        D --> F[(Graph Network)]
        D --> G[(Relational DB)]
    end
    
    subgraph "Pipelines"
        H[LLM Pipeline] 
        I[Data Pipeline]
        J[Text Pipeline]
        K[Audio Pipeline]
    end
    
    C --> H
    C --> I
    C --> J
    C --> K
```

### Core Components

| Component | Purpose | Key Features |
|-----------|---------|--------------|
| **Embeddings** | Vector database | Content storage, semantic search, hybrid queries |
| **Workflows** | Task orchestration | Parallel/sequential execution, function routing |
| **Pipelines** | Model execution | LLM, data processing, text operations |
| **Agents** | Autonomous execution | Tool-based reasoning, MCP integration |

## Key Features

### Vector Search Capabilities

txtai provides comprehensive vector search functionality including:

- Semantic/similarity/vector/neural search 资料来源：[README.md:45-50]()
- SQL integration for hybrid queries
- Object storage support
- Topic modeling
- Graph analysis
- Multimodal indexing (text, audio, images, video)

### Pipeline System

The pipeline system enables flexible model execution:

| Pipeline Type | Capabilities |
|---------------|--------------|
| **LLM** | Text generation, chat, vision support 资料来源：[src/python/txtai/pipeline/llm/llm.py:1-50]() |
| **Data** | Text extraction, document processing, chunking 资料来源：[src/python/txtai/pipeline/data/textractor.py:1-50]() |
| **Text** | NER, transcription, translation |
| **Audio** | Speech recognition, text-to-speech |
| **Train** | Model fine-tuning, ONNX conversion |

### Workflow Engine

Workflows orchestrate complex AI tasks using a template-based system:

```mermaid
graph LR
    A[Input] --> B[Template Task]
    B --> C[LLM Pipeline]
    C --> D[Output]
    
    B -.->|config| E[Template Rules]
    E -->|format| B
```

The template system supports named parameters, positional arguments, and strict/conservative parsing modes. 资料来源：[src/python/txtai/workflow/task/template.py:1-40]()

## Installation

### Core Dependencies

```python
# Default installation includes
default = [
    "faiss-cpu>=1.7.1.post2",
    "huggingface-hub>=0.34.0",
    "msgpack>=1.0.7",
    "numpy>=1.18.4",
    "regex>=2022.8.17",
    "pyyaml>=5.3",
    "safetensors>=0.4.5",
    "torch>=2.4",
    "transformers>=4.56.2",
]
```

资料来源：[setup.py:18-30]()

### Optional Installations

| Extra | Purpose | Command |
|-------|---------|---------|
| `pipeline` | All pipeline dependencies | `pip install txtai[pipeline]` |
| `vectors` | Embedding models | `pip install txtai[vectors]` |
| `api` | REST API server | `pip install txtai[api]` |
| `agent` | Autonomous agents | `pip install txtai[agent]` |
| `all` | Complete installation | `pip install txtai[all]` |

资料来源：[setup.py:55-150]()

## Basic Usage

### Semantic Search

```python
import txtai

embeddings = txtai.Embeddings()
embeddings.index(["Correct", "Not what we hoped"])
result = embeddings.search("positive", 1)
# Returns: [(0, 0.29862046241760254)]
```

资料来源：[README.md:35-42]()

### Text Extraction

```python
from txtai.pipeline import Textractor

textractor = Textractor(backend="docling", sections=True)
for chunk in textractor("document.pdf"):
    process(chunk)
```

资料来源：[examples/rag_quickstart.py:1-50]()

### RAG Pipeline

```python
from txtai import Embeddings, RAG
from txtai.pipeline import Textractor

# Build embeddings database
embeddings = Embeddings(content=True, path="Qwen/Qwen3-Embedding-0.6B")
embeddings.index(chunks)

# Create RAG pipeline
template = """
  Answer the following question using the provided context.
  Question: {question}
  Context: {context}
"""

rag = RAG(
    embeddings,
    "Qwen/Qwen3-0.6B",
    system="You are a friendly assistant",
    template=template,
)

result = rag("Summarize the main advancements made by BERT")
```

资料来源：[examples/rag_quickstart.py:50-80]()

## Use Cases

### Semantic Search

Traditional keyword-based search systems match exact terms. Semantic search understands natural language and identifies results with similar meaning, regardless of keyword overlap. 资料来源：[README.md:45-55]()

### RAG Applications

The embeddings database serves as a knowledge source for LLM applications:

```mermaid
graph TD
    A[User Query] --> B[Embeddings Search]
    B --> C[Retrieve Context]
    C --> D[LLM Generation]
    D --> E[Response]
    
    F[(Embeddings DB)] -.->|retrieval| C
```

### Autonomous Agents

Agents utilize tools to perform complex tasks:

```python
from txtai.agent.tool.read import ReadTool

tool = ReadTool(maxlength=40000)
content = tool.forward("path/to/file.txt")
```

资料来源：[src/python/txtai/agent/tool/read.py:1-60]()

## API Server

txtai includes a built-in REST API for cross-language compatibility:

```yaml
# app.yml
embeddings:
    path: sentence-transformers/all-MiniLM-L6-v2
```

```bash
CONFIG=app.yml uvicorn "txtai.api:app"
curl -X GET "http://localhost:8000/search?query=positive"
```

资料来源：[README.md:50-60]()

## Design Principles

1. **Low footprint** - Install additional dependencies and scale up when needed 资料来源：[README.md:60-70]()
2. **Local execution** - No need to ship data to remote services
3. **Model flexibility** - Work with micromodels up to large language models
4. **Modular design** - Use individual components or the full stack

## Requirements

- Python 3.10+
- PyTorch 2.4+
- Transformers 4.56.2+

The framework supports Hugging Face models, llama.cpp, Ollama, vLLM, and more for embeddings and LLM operations. 资料来源：[setup.py:18-30]()

---

<a id='getting-started'></a>

## Getting Started

### 相关页面

相关主题：[Introduction to txtai](#introduction)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [setup.py](https://github.com/neuml/txtai/blob/main/setup.py)
- [README.md](https://github.com/neuml/txtai/blob/main/README.md)
- [examples/rag_quickstart.py](https://github.com/neuml/txtai/blob/main/examples/rag_quickstart.py)
- [examples/wiki.py](https://github.com/neuml/txtai/blob/main/examples/wiki.py)
- [examples/workflows.py](https://github.com/neuml/txtai/blob/main/examples/workflows.py)
- [src/python/txtai/pipeline/llm/llm.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/llm/llm.py)
</details>

# Getting Started

txtai is an all-in-one AI framework for semantic search, LLM orchestration and language model workflows. The Getting Started guide provides the essential knowledge needed to install, configure, and begin using txtai effectively.

## Overview

txtai provides a unified interface for building AI applications with:

- **Vector search** with SQL, object storage, topic modeling, graph analysis and multimodal indexing
- **Embeddings** for text, documents, audio, images and video
- **Pipelines** powered by language models for prompts, question-answering, labeling, transcription, and translation
- **Workflows** that orchestrate multi-step processes
- **Agents** for autonomous decision-making

资料来源：[README.md](https://github.com/neuml/txtai/blob/main/README.md)

## Installation

### Requirements

| Requirement | Minimum Version | Notes |
|-------------|-----------------|-------|
| Python | 3.10+ | Required runtime |
| PyTorch | 2.4+ | Core deep learning framework |
| Transformers | 4.56.2+ | Model loading and inference |
| NumPy | 1.18.4+ | Numerical operations |

资料来源：[setup.py:27-37](https://github.com/neuml/txtai/blob/main/setup.py)

### Installation Methods

#### Standard Installation

Standard installation includes all default dependencies:

```bash
pip install txtai
```

This installs the core package with default dependencies including FAISS, HuggingFace Hub, msgpack, numpy, regex, PyYAML, safetensors, PyTorch, and transformers.

#### Minimal Installation

For environments with limited resources, install a minimal version:

```bash
MINIMAL=1 pip install txtai
```

资料来源：[setup.py:18-25](https://github.com/neuml/txtai/blob/main/setup.py)

### Optional Extras

txtai provides modular extras to scale functionality based on needs:

| Extra | Purpose | Key Dependencies |
|-------|---------|------------------|
| `api` | REST API hosting | fastapi, uvicorn, aiohttp |
| `pipeline-audio` | Audio processing | scipy, soundfile, webrtcvad |
| `pipeline-data` | Data extraction | beautifulsoup4, docling, tika |
| `pipeline-image` | Image processing | pillow, timm, imagehash |
| `pipeline-llm` | LLM inference | litellm, llama-cpp-python |
| `pipeline-text` | Text processing | gliner, sentencepiece |
| `pipeline-train` | Model training | accelerate, peft, onnx |
| `vectors` | Vector embeddings | sentence-transformers |
| `ann` | ANN indexes | faiss, hnswlib, annoy |
| `workflow` | Workflow tasks | requests, pandas, openpyxl |
| `agent` | Agent framework | smolagents, mcpadapt |
| `all` | Everything | All above extras |

资料来源：[setup.py:38-98](https://github.com/neuml/txtai/blob/main/setup.py)

Install with extras:

```bash
pip install txtai[api,workflow]
pip install txtai[all]
```

## Core Concepts

### Embeddings Database

The central component of txtai is the **embeddings database**, which combines:

- Vector indexes (sparse and dense)
- Graph networks
- Relational databases

This foundation enables vector search and serves as a knowledge source for LLM applications.

资料来源：[README.md](https://github.com/neuml/txtai/blob/main/README.md)

```mermaid
graph TD
    A[Embeddings Database] --> B[Vector Indexes]
    A --> C[Graph Networks]
    A --> D[Relational Database]
    B --> E[Dense Vectors]
    B --> F[Sparse Vectors]
```

### Pipeline Architecture

Pipelines are language model powered components that perform specific tasks:

```mermaid
graph LR
    A[Input Data] --> B[Pipeline]
    B --> C[LLM Pipeline]
    B --> D[Data Pipeline]
    B --> E[Text Pipeline]
    C --> F[Generated Content]
    D --> G[Extracted Data]
    E --> H[Processed Text]
```

Key pipeline types:

| Pipeline | Function |
|----------|----------|
| `Summary` | Text summarization |
| `Textractor` | Text extraction from files |
| `RAG` | Retrieval augmented generation |
| `Transcription` | Audio to text |
| `Translation` | Language translation |

资料来源：[src/python/txtai/pipeline/llm/llm.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/llm/llm.py)

### Workflow System

Workflows orchestrate multi-step processes by connecting tasks:

```mermaid
graph TD
    A[Workflow Input] --> B[Task 1]
    B --> C[Task 2]
    C --> D[Task N]
    D --> E[Workflow Output]
    
    F[Template Task] -->|formats| B
    G[URL Task] -->|fetches| B
```

资料来源：[examples/workflows.py](https://github.com/neuml/txtai/blob/main/examples/workflows.py)

## Quick Start

### Basic Semantic Search

The simplest way to get started with txtai is semantic search:

```python
import txtai

embeddings = txtai.Embeddings()
embeddings.index(["Correct", "Not what we hoped"])
embeddings.search("positive", 1)
# Returns: [(0, 0.29862046241760254)]
```

资料来源：[README.md](https://github.com/neuml/txtai/blob/main/README.md)

### Building a RAG Application

Retrieval Augmented Generation combines embeddings search with LLM inference:

```python
from txtai import Embeddings, RAG
from txtai.pipeline import Textractor

# Step 1: Extract text from documents
textractor = Textractor()
chunks = []
for f in ["document1.pdf", "document2.pdf"]:
    for chunk in textractor(f):
        chunks.append((f, chunk))

# Step 2: Build embeddings database
embeddings = Embeddings(content=True, path="Qwen/Qwen3-Embedding-0.6B", maxlength=2048)
embeddings.index(chunks)

# Step 3: Create RAG pipeline
template = """
  Answer the following question using the provided context.

  Question:
  {question}

  Context:
  {context}
"""

rag = RAG(
    embeddings,
    "Qwen/Qwen3-0.6B",
    system="You are a friendly assistant",
    template=template,
    output="flatten",
)

# Step 4: Query
question = "Summarize the main advancements made by BERT"
print(rag(question, maxlength=2048, stripthink=True))
```

资料来源：[examples/rag_quickstart.py](https://github.com/neuml/txtai/blob/main/examples/rag_quickstart.py)

### Wikipedia Summarization

Example combining external API with summarization:

```python
import requests
from txtai.pipeline import Summary

class Application:
    SEARCH_TEMPLATE = "https://en.wikipedia.org/w/api.php?action=opensearch&search=%s&limit=1&namespace=0&format=json"
    CONTENT_TEMPLATE = "https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=%s"

    def __init__(self):
        self.summary = Summary("sshleifer/distilbart-cnn-12-6")

    def query(self, query):
        # Search Wikipedia
        data = requests.get(self.SEARCH_TEMPLATE % query).json()
        if data and data[1]:
            page = data[1][0]
            content = requests.get(self.CONTENT_TEMPLATE % page).json()
            content = list(content["query"]["pages"].values())[0]["extract"]
            return self.summary(content)
        return None
```

资料来源：[examples/wiki.py](https://github.com/neuml/txtai/blob/main/examples/wiki.py)

## Configuration

### Embeddings Configuration

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path` | str | None | Model path (HuggingFace, llama.cpp, Ollama, vLLM) |
| `content` | bool | False | Enable content storage |
| `maxlength` | int | None | Maximum sequence length |

资料来源：[examples/rag_quickstart.py](https://github.com/neuml/txtai/blob/main/examples/rag_quickstart.py)

### LLM Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `maxlength` | int | None | Maximum sequence length |
| `stream` | bool | False | Enable streaming response |
| `stop` | list | None | List of stop strings |
| `stripthink` | bool | varies | Strip thinking tags from output |
| `defaultrole` | str | "auto" | Default role for text inputs |

资料来源：[src/python/txtai/pipeline/llm/llm.py:18-34](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/llm/llm.py)

### RAG Configuration

| Parameter | Type | Description |
|-----------|------|-------------|
| `embeddings` | Embeddings | Embeddings database instance |
| `path` | str | LLM model path |
| `system` | str | System prompt for the LLM |
| `template` | str | Prompt template with `{question}` and `{context}` |
| `output` | str | Output format ("flatten", etc.) |

## API Service

txtai includes a built-in API for language-agnostic applications:

```yaml
# app.yml
embeddings:
    path: sentence-transformers/all-MiniLM-L6-v2
```

```bash
CONFIG=app.yml uvicorn "txtai.api:app"
curl -X GET "http://localhost:8000/search?query=positive"
```

Benefits of the API service:
- Work with your programming language of choice
- Run local - no need to ship data to remote services
- Supports models from micromodels to large language models
- Low footprint - scale up when needed

资料来源：[README.md](https://github.com/neuml/txtai/blob/main/README.md)

## Next Steps

After completing the Getting Started guide, explore:

1. **Example Notebooks** - Over 70 example notebooks covering all functionality
2. **Workflows** - Build complex multi-step pipelines
3. **Agents** - Create autonomous AI agents with tool use
4. **Graph Analysis** - Network and relationship analysis
5. **Model Training** - Fine-tune models for your use case

All examples are available at: https://neuml.github.io/txtai/examples

---

<a id='architecture'></a>

## System Architecture

### 相关页面

相关主题：[Embeddings Database](#embeddings-core), [Database Integration](#database-integration)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/neuml/txtai/blob/main/README.md)
- [setup.py](https://github.com/neuml/txtai/blob/main/setup.py)
- [src/python/txtai/pipeline/data/textractor.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/data/textractor.py)
- [src/python/txtai/pipeline/llm/llm.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/llm/llm.py)
- [src/python/txtai/agent/tool/function.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/agent/tool/function.py)
- [src/python/txtai/agent/tool/factory.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/agent/tool/factory.py)
- [src/python/txtai/workflow/task/template.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/workflow/task/template.py)
- [src/python/txtai/cloud/hub.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/cloud/hub.py)
</details>

# System Architecture

txtai is an **all-in-one AI framework** designed for semantic search, LLM orchestration, and language model workflows. The architecture is built around a modular, extensible design that allows developers to scale from micromodels to large language models with minimal footprint.

## Architecture Overview

The txtai architecture consists of several interconnected layers that work together to provide comprehensive AI capabilities.

```mermaid
graph TB
    subgraph "Client Layer"
        API["REST API"]
        Console["Console"]
        Library["Python Library"]
    end
    
    subgraph "Core Engine"
        Embeddings["Embeddings Database"]
        Workflow["Workflow Engine"]
        Agent["Agent System"]
    end
    
    subgraph "Pipeline Layer"
        LLMPipeline["LLM Pipeline"]
        DataPipeline["Data Pipeline"]
        TextPipeline["Text Pipeline"]
        TrainPipeline["Train Pipeline"]
        AudioPipeline["Audio Pipeline"]
        ImagePipeline["Image Pipeline"]
    end
    
    subgraph "Index Layer"
        ANN["ANN Index"]
        Documents["Document Store"]
        Graph["Graph Network"]
        Database["Relational DB"]
    end
    
    API --> Embeddings
    API --> Workflow
    API --> Agent
    Embeddings --> ANN
    Embeddings --> Documents
    Embeddings --> Graph
    Embeddings --> Database
```

资料来源：[README.md](https://github.com/neuml/txtai/blob/main/README.md)

## Core Components

### Embeddings Database

The **embeddings database** is the foundational component of txtai. It serves as a union of multiple index types:

| Index Type | Purpose | Description |
|------------|---------|-------------|
| Vector Index (Sparse) | Sparse embeddings | Traditional BM25-style scoring |
| Vector Index (Dense) | Dense embeddings | Neural network-based semantic vectors |
| Graph Networks | Relationship mapping | NetworkX-based graph structures |
| Relational Databases | Structured data | SQLAlchemy-based data storage |

资料来源：[README.md](https://github.com/neuml/txtai/blob/main/README.md)

The embeddings database enables vector search capabilities and serves as a powerful knowledge source for large language model (LLM) applications.

### Pipeline System

Pipelines are the processing units of txtai, each designed for specific tasks:

```mermaid
graph LR
    Input["Input Data"] --> Pipeline["Pipeline"]
    
    subgraph "Pipeline Types"
        Audio["Audio Pipeline"]
        Data["Data Pipeline"]
        Image["Image Pipeline"]
        LLM["LLM Pipeline"]
        Text["Text Pipeline"]
        Train["Train Pipeline"]
    end
    
    Pipeline --> Output["Output Data"]
```

#### Data Pipeline

The Data Pipeline handles text extraction and processing:

| Component | Function |
|-----------|----------|
| `Textractor` | Extracts text from files and URLs |
| `FileToHTML` | Converts files to HTML format |
| `HTMLToMarkdown` | Transforms HTML to Markdown |
| `Segmentation` | Handles text segmentation (sentences, lines, paragraphs) |

The `Textractor` class supports multiple backends for file extraction:

```python
def __init__(
    self,
    sentences=False,
    lines=False,
    paragraphs=False,
    minlength=None,
    join=False,
    sections=False,
    cleantext=True,
    chunker=None,
    headers=None,
    backend="available",
    safeopen=False,
    **kwargs,
):
```

资料来源：[src/python/txtai/pipeline/data/textractor.py:19-37](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/data/textractor.py)

#### LLM Pipeline

The LLM Pipeline provides LLM generation capabilities:

```python
def __call__(
    self,
    text,
    maxlength=None,
    stream=False,
    stop=None,
    defaultrole="auto",
    stripthink=None,
    **kwargs,
):
```

Key features:
- Supports streaming responses
- Configurable stop sequences
- Built-in thinking tag stripping
- Chat and vision model support

资料来源：[src/python/txtai/pipeline/llm/llm.py:1-50](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/llm/llm.py)

### Agent System

The Agent System enables autonomous AI agents with tool use capabilities:

```mermaid
graph TD
    User["User Input"] --> Agent["Agent"]
    Agent --> Tools["Tool Collection"]
    Tools --> EmbeddingsTool["Embeddings Tool"]
    Tools --> FunctionTool["Function Tool"]
    Tools --> DefaultTools["Default Tools"]
    
    subgraph "Tool Factory"
        create["create()"]
        createtool["createtool()"]
    end
    
    Tools --> Result["Result"]
    Result --> Agent
```

#### FunctionTool

A `FunctionTool` wraps descriptive configuration with a target function for LLM prompts:

```python
class FunctionTool(Tool):
    def __init__(self, config):
        self.name = config["name"]
        self.description = config["description"]
        self.inputs = config["inputs"]
        self.output_type = config.get("output", config.get("output_type", "any"))
        self.target = config["target"]
```

资料来源：[src/python/txtai/agent/tool/function.py:1-50](https://github.com/neuml/txtai/blob/main/src/python/txtai/agent/tool/function.py)

#### Tool Factory

The `ToolFactory` creates tools from various sources:

| Input Type | Creation Method |
|------------|------------------|
| Tool instance | Direct pass-through |
| Function/Method | Auto-wrapped with `createtool()` |
| Dictionary | Config-based creation |
| String alias | Lookup in `DEFAULTS` registry |
| HTTP URL | MCP tool collection import |

资料来源：[src/python/txtai/agent/tool/factory.py:1-100](https://github.com/neuml/txtai/blob/main/src/python/txtai/agent/tool/factory.py)

### Workflow Engine

The Workflow Engine processes tasks through a series of steps:

```mermaid
graph LR
    Task["Task Input"] --> Template["Template Task"]
    Template --> Process["Process"]
    Process --> Output["Task Output"]
```

#### Template Task

The `TemplateTask` generates text from templates, supporting LLM prompt preparation:

```python
class TemplateTask(Task):
    def register(self, template=None, rules=None, strict=True):
        self.template = template if template else self.defaulttemplate()
        self.rules = rules if rules else self.defaultrules()
        self.formatter = TemplateFormatter() if strict else Formatter()
```

Template processing supports three input types:

| Input Type | Processing Method |
|------------|-------------------|
| Dictionary | Named parameters from keys |
| Tuple | Numbered parameters (arg0, arg1, ...) |
| Other | Default `{text}` parameter |

资料来源：[src/python/txtai/workflow/task/template.py:1-50](https://github.com/neuml/txtai/blob/main/src/python/txtai/workflow/task/template.py)

## Module Organization

### Package Structure

```
txtai/
├── agent/           # Agent system and tools
├── api/             # REST API implementation
├── cloud/           # Cloud integration (Hub)
├── embeddings/      # Embeddings database
├── scoring/         # Scoring/ranking system
├── pipeline/       # Pipeline components
│   ├── audio/      # Audio processing
│   ├── data/       # Data extraction
│   ├── image/      # Image processing
│   ├── llm/        # LLM generation
│   ├── text/       # Text processing
│   └── train/      # Model training
└── workflow/       # Workflow engine
```

### Dependency Groups

The project uses optional dependency groups for modular installation:

| Extra | Purpose | Key Dependencies |
|-------|---------|------------------|
| `default` | Core functionality | faiss-cpu, numpy, torch, transformers |
| `api` | REST API | fastapi, uvicorn, aiohttp |
| `pipeline` | All pipelines | Combination of all pipeline extras |
| `agent` | Agent system | smolagents, mcpadapt, jinja2 |
| `vectors` | Vector models | sentence-transformers, litellm |

资料来源：[setup.py](https://github.com/neuml/txtai/blob/main/setup.py)

## Data Flow Architecture

### Semantic Search Flow

```mermaid
sequenceDiagram
    participant User
    participant Embeddings
    participant ANN
    participant Documents
    
    User->>Embeddings: index(documents)
    Embeddings->>ANN: build(vector)
    Embeddings->>Documents: store(data)
    
    User->>Embeddings: search(query)
    Embeddings->>ANN: query(vector)
    ANN-->>Embeddings: top-k results
    Embeddings->>Documents: retrieve(ids)
    Documents-->>Embeddings: document content
    Embeddings-->>User: ranked results
```

### RAG Pipeline Flow

```mermaid
graph TD
    Question["User Question"] --> Embeddings["Embeddings DB"]
    Embeddings --> Context["Retrieved Context"]
    Context --> LLM["LLM Pipeline"]
    LLM --> Answer["Generated Answer"]
    
    subgraph "Indexing Phase"
        Documents["Documents"] --> Textractor["Textractor"]
        Textractor --> Chunks["Text Chunks"]
        Chunks --> Embeddings
    end
```

## Cloud Integration

### Hub Module

The Hub module enables model and data sharing via Hugging Face:

```python
def upload(self, path):
    with open(path, "r", encoding="utf-8") as f:
        content = f.read()
    
    if "embeddings " not in content:
        content += "documents filter=lfs diff=lfs merge=lfs -text\n"
        content += "embeddings filter=lfs diff=lfs merge=lfs -text\n"
    
    huggingface_hub.upload_file(...)
```

Features:
- Automatic LFS tracking for embeddings
- Token-based authentication
- Remote index management

资料来源：[src/python/txtai/cloud/hub.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/cloud/hub.py)

## Use Cases Built on Architecture

| Use Case | Architecture Components |
|----------|------------------------|
| Semantic Search | Embeddings + ANN Index + Workflow |
| RAG | Embeddings + LLM Pipeline + Template Task |
| LLM Orchestration | Agent System + Tool Factory + LLM Pipeline |
| Document Processing | Data Pipeline + Embeddings + Workflow |
| Multi-model Workflows | Pipeline Composition + Workflow Engine |

## Summary

The txtai architecture provides a flexible, extensible foundation for AI applications:

1. **Embeddings Database** - Unified vector/graph/relational indexing
2. **Pipeline System** - Modular processing for diverse data types
3. **Agent System** - Tool-augmented autonomous AI
4. **Workflow Engine** - Task orchestration and automation
5. **Cloud Integration** - Model sharing and deployment

This modular design enables developers to start with minimal installations and scale up capabilities as needed, supporting everything from simple semantic search to complex multi-model RAG systems with autonomous agents.

---

<a id='embeddings-core'></a>

## Embeddings Database

### 相关页面

相关主题：[System Architecture](#architecture), [Scoring and Retrieval Algorithms](#scoring-algorithms)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/python/txtai/embeddings/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/embeddings/__init__.py)
- [src/python/txtai/embeddings/base.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/embeddings/base.py)
- [src/python/txtai/embeddings/index/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/embeddings/index/__init__.py)
- [src/python/txtai/embeddings/search/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/embeddings/search/__init__.py)
- [src/python/txtai/vectors/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/vectors/__init__.py)
- [docs/embeddings/index.md](https://github.com/neuml/txtai/blob/main/docs/embeddings/index.md)
</details>

# Embeddings Database

The Embeddings Database is the foundational component of txtai, providing unified vector search capabilities that combine sparse and dense vector indexing, graph networks, and relational database functionality.

## Overview

The txtai embeddings database serves as a comprehensive solution for semantic search and vector-based data retrieval. It extends traditional database capabilities by incorporating neural network-based embedding generation and similarity search.

**Core Architecture:**

```mermaid
graph TD
    A[Embeddings Database] --> B[Vector Indexes]
    A --> C[Graph Networks]
    A --> D[Relational Database]
    B --> B1[Dense Vectors]
    B --> B2[Sparse Vectors]
```

资料来源：[README.md](https://github.com/neuml/txtai/blob/main/README.md)

## Key Features

| Feature | Description |
|---------|-------------|
| Vector Search | Semantic similarity search using dense and sparse embeddings |
| SQL Support | Traditional database queries combined with vector search |
| Content Storage | Built-in document content storage |
| Graph Analysis | Integration with graph networks for relationship-based queries |
| Multimodal Indexing | Support for text, documents, audio, images, and video |
| Topic Modeling | Built-in topic modeling capabilities |

资料来源：[README.md](https://github.com/neuml/txtai/blob/main/README.md)

## Installation and Setup

### Basic Installation

```python
import txtai

embeddings = txtai.Embeddings()
```

### Configuration Options

The embeddings database can be configured with the following key parameters:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path` | string | None | Model path (Hugging Face, llama.cpp, Ollama, vLLM) |
| `content` | bool | False | Enable content storage |
| `maxlength` | int | None | Maximum sequence length |
| `quantize` | bool | False | Enable model quantization |
| `gpu` | bool | True | Enable GPU inference |

资料来源：[examples/rag_quickstart.py](https://github.com/neuml/txtai/blob/main/examples/rag_quickstart.py)

## Core Operations

### Indexing Documents

```python
embeddings = txtai.Embeddings()
embeddings.index(["Correct", "Not what we hoped"])
```

资料来源：[README.md](https://github.com/neuml/txtai/blob/main/README.md)

### Semantic Search

```python
results = embeddings.search("positive", 1)
# Returns: [(0, 0.29862046241760254)]
```

The search method returns tuples of (index, score) ordered by similarity.

## Architecture Components

### Vector Indexes

The embeddings database supports multiple vector index backends:

| Backend | Package | Configuration |
|---------|---------|---------------|
| FAISS | `faiss-cpu` | Default backend |
| Annoy | `annoy` | Via `ann` extra |
| HNSWLib | `hnswlib` | Via `ann` extra |
| pgvector | `pgvector` | Via `ann` extra |
| sqlite-vec | `sqlite-vec` | Via `ann` extra |

资料来源：[setup.py](https://github.com/neuml/txtai/blob/main/setup.py)

### Vector Models

Supported embedding model providers:

- **Sentence Transformers**: `sentence-transformers>=5.0.0`
- **ONNX Models**: `onnx>=1.11.0`, `onnxruntime>=1.11.0`
- **llama.cpp**: `llama-cpp-python>=0.2.75`
- **LiteLLM**: `litellm>=1.37.16`
- **Model2Vec**: `model2vec>=0.3.0`
- **Static Vectors**: `staticvectors>=0.2.0`

资料来源：[setup.py](https://github.com/neuml/txtai/blob/main/setup.py)

## Data Flow

```mermaid
graph LR
    A[Input Text] --> B[Tokenizer]
    B --> C[Embedding Model]
    C --> D[Vector Index]
    D --> E[Search Query]
    E --> F[Similarity Score]
    F --> G[Results]
    
    H[(Content Store)] --> G
```

## RAG Integration

The embeddings database serves as the knowledge source for Retrieval Augmented Generation (RAG) workflows:

```python
from txtai import Embeddings, RAG

# Build embeddings database
embeddings = Embeddings(content=True, path="Qwen/Qwen3-Embedding-0.6B", maxlength=2048)
embeddings.index(chunks)

# Create RAG pipeline
rag = RAG(
    embeddings,
    "Qwen/Qwen3-0.6B",
    system="You are a friendly assistant",
    template=template,
    output="flatten",
)
```

资料来源：[examples/rag_quickstart.py](https://github.com/neuml/txtai/blob/main/examples/rag_quickstart.py)

### RAG Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `similarity` | Embeddings/Similarity | Knowledge source database |
| `path` | string | LLM model path |
| `quantize` | bool | Enable model quantization |
| `gpu` | bool | Enable GPU inference |
| `template` | string | Prompt template |
| `context` | int | Maximum context length |
| `minscore` | float | Minimum similarity score |
| `mintokens` | int | Minimum token count |

资料来源：[src/python/txtai/pipeline/llm/rag.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/llm/rag.py)

## API Deployment

The embeddings database can be exposed as a REST API:

```yaml
# app.yml
embeddings:
    path: sentence-transformers/all-MiniLM-L6-v2
```

```bash
CONFIG=app.yml uvicorn "txtai.api:app"
curl -X GET "http://localhost:8000/search?query=positive"
```

资料来源：[README.md](https://github.com/neuml/txtai/blob/main/README.md)

## Workflow Integration

The embeddings database integrates with txtai workflows for processing pipelines:

```python
from txtai.workflow import Workflow, Task

# Text extraction workflow
textractor = Textractor()
chunks = []
for f, chunk in textractor(files):
    chunks.append((f, chunk))

# Index extracted content
embeddings = Embeddings(content=True, path="Qwen/Qwen3-Embedding-0.6B")
embeddings.index(chunks)
```

资料来源：[examples/rag_quickstart.py](https://github.com/neuml/txtai/blob/main/examples/rag_quickstart.py)

## Dependencies

### Core Dependencies

| Package | Version | Purpose |
|---------|---------|---------|
| `faiss-cpu` | >=1.7.1.post2 | Default vector index |
| `torch` | >=2.4 | Tensor operations |
| `transformers` | >=4.56.2 | Model inference |
| `huggingface-hub` | >=0.34.0 | Model management |
| `numpy` | >=1.18.4 | Numerical operations |
| `safetensors` | >=0.4.5 | Model serialization |

资料来源：[setup.py](https://github.com/neuml/txtai/blob/main/setup.py)

### Optional Dependencies

| Extra | Packages | Use Case |
|-------|----------|----------|
| `ann` | annoy, hnswlib, pgvector | Alternative indexes |
| `vectors` | sentence-transformers, litellm | Vector models |
| `similarity` | ann + vectors | Full search capabilities |

## Use Cases

### Semantic Search

Traditional search systems use keywords to find data. Semantic search understands natural language and identifies results with the same meaning, not necessarily the same keywords.

### Knowledge Base for LLMs

The embeddings database serves as a powerful knowledge source for LLM applications, enabling retrieval augmented generation (RAG) processes.

### Multimodal Applications

- **Text**: Document embeddings for text classification and clustering
- **Images**: Visual similarity search
- **Audio**: Speech-to-text with semantic search
- **Video**: Frame-level semantic indexing

资料来源：[README.md](https://github.com/neuml/txtai/blob/main/README.md)

## Performance Considerations

- **Index Size**: Vector indexes scale with document count and embedding dimensionality
- **GPU Acceleration**: Enable `gpu=True` for faster inference on compatible hardware
- **Quantization**: Use `quantize=True` to reduce memory footprint with minimal accuracy loss
- **Batch Processing**: Index documents in batches for optimal throughput

## See Also

- [RAG Pipeline](https://neuml.github.io/txtai/pipeline/text/rag) - Retrieval Augmented Generation
- [Similarity Pipeline](https://neuml.github.io/txtai/pipeline/text/similarity) - Text similarity scoring
- [Workflows](https://neuml.github.io/txtai/workflows) - Pipeline orchestration
- [Agent Framework](https://neuml.github.io/txtai/agent) - Autonomous AI agents

---

<a id='pipelines'></a>

## Pipelines

### 相关页面

相关主题：[Workflows](#workflows), [Agents](#agents)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/python/txtai/pipeline/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/__init__.py)
- [src/python/txtai/pipeline/factory.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/factory.py)
- [src/python/txtai/pipeline/base.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/base.py)
- [src/python/txtai/pipeline/text/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/text/__init__.py)
- [src/python/txtai/pipeline/audio/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/audio/__init__.py)
- [src/python/txtai/pipeline/llm/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/llm/__init__.py)
- [docs/pipeline/index.md](https://github.com/neuml/txtai/blob/main/docs/pipeline/index.md)
</details>

# Pipelines

## Overview

Pipelines are the core processing components in txtai that power language model workflows. They provide a unified interface for various text processing, transformation, and generation tasks including text extraction, summarization, transcription, translation, and LLM inference.

The pipeline system enables txtai to work with text, documents, audio, images and video by providing modular, composable processing units that can be chained together to build complex workflows. 资料来源：[setup.py](setup.py)

## Architecture

```mermaid
graph TD
    subgraph "Pipeline Categories"
        Text[Text Pipelines]
        Audio[Audio Pipelines]
        LLM[LLM Pipelines]
        Data[Data Pipelines]
    end
    
    subgraph "Base Infrastructure"
        Base[Pipeline Base Class]
        Factory[Pipeline Factory]
    end
    
    Text --> Base
    Audio --> Base
    LLM --> Base
    Data --> Base
    Base --> Factory
```

## Pipeline Categories

### Text Pipelines

Text pipelines handle natural language processing tasks such as entity recognition, text classification, and language detection.

| Pipeline | Description | Key Features |
|----------|-------------|--------------|
| Labels | Text classification and labeling | Supports GLiNER-based models for NER and classification |
| Translation | Language translation | Powered by transformers |
| Segmentation | Text splitting | Sentence, line, paragraph, and section segmentation |

资料来源：[src/python/txtai/pipeline/text/__init__.py](src/python/txtai/pipeline/text/__init__.py)

### Audio Pipelines

Audio pipelines enable speech-to-text transcription and processing of audio content.

| Component | Description |
|-----------|-------------|
| Transcription | Convert audio to text using webrtcvad for voice activity detection |
| Audio processing | Support for sounddevice and soundfile libraries |

These pipelines require `scipy`, `sounddevice`, `soundfile`, and `webrtcvad-wheels` dependencies. 资料来源：[setup.py](setup.py)

### LLM Pipelines

LLM (Large Language Model) pipelines provide interfaces for text generation, chat completion, and vision-capable model interactions.

```mermaid
graph LR
    A[Input Text/List/Dict] --> B[Generator]
    B --> C[stream]
    B --> D[batch]
    C --> E[Streaming Response]
    D --> F[Complete Response]
```

The LLM pipeline supports multiple backends including HuggingFace transformers and liteLLM for unified LLM access. Key capabilities include:

- **Chat completion**: Supports both raw prompts and structured chat templates
- **Vision support**: Can process image inputs with vision-capable models
- **Streaming**: Real-time token-by-token response streaming
- **Stop sequences**: Configurable stop tokens for controlled generation

资料来源：[src/python/txtai/pipeline/llm/llm.py](src/python/txtai/pipeline/llm/llm.py), [src/python/txtai/pipeline/llm/huggingface.py](src/python/txtai/pipeline/llm/huggingface.py)

### Data Pipelines

Data pipelines handle document parsing, text extraction, and content conversion.

#### Textractor

The Textractor class extracts text from files using various backends:

```python
Textractor(
    sentences=False,      # Split into sentences
    lines=False,          # Split into lines
    paragraphs=False,     # Split into paragraphs
    minlength=None,       # Minimum text length filter
    join=False,           # Join segments
    sections=False,       # Extract sections
    cleantext=True,       # Clean extracted text
    chunker=None,         # Custom text chunker
    headers=None,         # HTTP headers for URLs
    backend="available",  # "tika", "docling", or "available"
    safeopen=False        # Restrict to safe URLs only
)
```

Supported backends:
- **Tika**: Apache Tika for comprehensive document parsing
- **Docling**: Alternative backend using docling library

资料来源：[src/python/txtai/pipeline/data/textractor.py](src/python/txtai/pipeline/data/textractor.py)

#### FileToHTML

Converts various file formats to HTML for further processing.

```mermaid
graph LR
    A[File Input] --> B[Backend Selection]
    B --> C{Tika Available?}
    C -->|Yes| D[Tika Backend]
    C -->|No| E{Docling Available?}
    E -->|Yes| F[Docling Backend]
    E -->|No| G[Return None]
    D --> H[HTML Output]
    F --> H
```

The backend detection follows this priority:
1. Tika (if `tika` package is installed)
2. Docling (if `docling` package is installed)
3. Returns `None` if no backend available

资料来源：[src/python/txtai/pipeline/data/filetohtml.py](src/python/txtai/pipeline/data/filetohtml.py)

#### HTMLToMarkdown

Converts HTML content to Markdown format, supporting paragraph and section-level processing.

## Pipeline Factory

The pipeline factory pattern enables dynamic pipeline instantiation based on configuration:

```python
# Via factory
pipeline = PipelineFactory.create(config)

# Direct instantiation
pipeline = Summary("model-name")
pipeline = Textractor()
```

This allows pipelines to be defined in YAML configuration files and instantiated programmatically. 资料来源：[src/python/txtai/pipeline/factory.py](src/python/txtai/pipeline/factory.py)

## Common Patterns

### Text Processing Chain

```mermaid
graph TD
    A[File/URL] --> B[Textractor]
    B --> C[Segmentation]
    C --> D[Text Clean]
    D --> E[Ready for Embeddings]
```

### LLM Generation Flow

```python
from txtai.pipeline import LLM

# Direct model loading
llm = LLM("model-name")

# Generate with various input formats
result = llm("What is artificial intelligence?")
result = llm([{"role": "user", "content": "Hello"}])
result = llm("Generate a summary", stream=True)
```

Parameters supported by LLM pipelines:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| text | str/list/dict | required | Input text or messages |
| maxlength | int | 512 | Maximum sequence length |
| stream | bool | False | Enable streaming response |
| stop | list | None | Stop sequences |
| defaultrole | str | "auto" | Default role for text inputs |
| stripthink | bool | None | Strip thinking tags |

资料来源：[src/python/txtai/pipeline/llm/llm.py](src/python/txtai/pipeline/llm/llm.py)

## Dependencies

Pipelines are organized into optional dependency groups:

```python
extras["pipeline-audio"]   # Audio processing
extras["pipeline-data"]     # Document parsing
extras["pipeline-image"]    # Image processing
extras["pipeline-llm"]      # LLM inference
extras["pipeline-text"]     # NLP tasks
extras["pipeline-train"]    # Model training
```

Full pipeline installation:
```bash
pip install txtai[pipeline]
```

资料来源：[setup.py](setup.py)

## Integration with Workflows

Pipelines integrate seamlessly with the workflow system:

```python
from txtai.workflow import Workflow, Task
from txtai.pipeline import Summary, Textractor

# Create a pipeline-based workflow
workflow = Workflow([
    Task(Textractor(paragraphs=True)),
    Task(Summary())
])

# Process documents
results = workflow(["path/to/document.pdf"])
```

The TemplateTask enables prompt template processing within workflows, supporting string formatting with named or positional parameters. 资料来源：[src/python/txtai/workflow/task/template.py](src/python/txtai/workflow/task/template.py)

## See Also

- [Workflows](workflow.md) - Chaining pipelines together
- [Embeddings](embeddings.md) - Vector indexing and search
- [Agents](agent.md) - LLM-powered autonomous agents

---

<a id='workflows'></a>

## Workflows

### 相关页面

相关主题：[Pipelines](#pipelines)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/python/txtai/workflow/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/workflow/__init__.py)
- [src/python/txtai/workflow/base.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/workflow/base.py)
- [src/python/txtai/workflow/factory.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/workflow/factory.py)
- [src/python/txtai/workflow/task/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/workflow/task/__init__.py)
- [src/python/txtai/workflow/task/base.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/workflow/task/base.py)
- [src/python/txtai/workflow/task/template.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/workflow/task/template.py)
- [examples/workflow_quickstart.py](https://github.com/neuml/txtai/blob/main/examples/workflow_quickstart.py)
</details>

# Workflows

## Overview

Workflows in txtai provide a declarative, pipeline-based approach to processing data through a series of connected tasks. They enable developers to chain together multiple processing components—including text extraction, summarization, translation, and large language model (LLM) calls—into cohesive data processing pipelines.

The workflow system supports:

- **Deterministic pipelines**: Define processing steps that execute in sequence
- **LLM-powered workflows**: Integrate language model prompts for intelligent processing
- **Parallel execution**: Configure thread or process-based concurrency
- **Flexible data transformation**: Handle one-to-many and many-to-one data transformations
- **Template-based prompts**: Generate prompts dynamically using template formatting

资料来源：[examples/workflow_quickstart.py:1-25]()

## Architecture

```mermaid
graph TD
    A[Input Data] --> B[Workflow]
    B --> C[Task 1: Textractor]
    C --> D[Task 2: Summary]
    D --> E[Task N: Translation/LLM]
    E --> F[Output Results]
    
    G[Task Configuration] --> B
    H[Initialize Action] --> C
    I[Finalize Action] --> E
```

### Core Components

| Component | File | Purpose |
|-----------|------|---------|
| `Workflow` | `workflow/__init__.py` | Main workflow orchestration class |
| `WorkflowBase` | `workflow/base.py` | Base class with core workflow logic |
| `Task` | `workflow/task/base.py` | Base class for all workflow tasks |
| `TemplateTask` | `workflow/task/template.py` | Task with template prompt support |
| `WorkflowFactory` | `workflow/factory.py` | Factory for workflow instantiation |

资料来源：[src/python/txtai/workflow/__init__.py:1-50]()

## Workflow Configuration

### Python API

```python
from txtai import Workflow
from txtai.workflow import Task

workflow = Workflow([
    Task(textractor),
    Task(summary),
    Task(translate)
])

# Execute workflow
results = list(workflow(input_data))
```

### YAML Configuration

```yaml
workflow:
  tasks:
    - action: textractor
      task: textractor
    - action: summary
      task: summary
```

资料来源：[src/python/txtai/workflow/factory.py:1-30]()

## Task System

### Task Base Class

The `Task` class is the fundamental building block of workflows. Each task defines:

- **Action**: Callable function or list of functions to execute
- **Select**: Filter functions to select specific data
- **Concurrency**: Thread or process-based execution
- **Data handling**: Unpacking, column selection, and merging strategies

```mermaid
graph LR
    A[Input Element] --> B{Selection Filter}
    B -->|Pass| C[Action Execution]
    B -->|Fail| D[Skip Element]
    C --> E[One-to-Many Transform]
    E --> F[Output Elements]
```

#### Task Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `action` | callable/list | `None` | Action(s) to execute on each data element |
| `select` | callable/list | `None` | Filter(s) to select data to process |
| `unpack` | bool | `True` | Unwrap data from (id, data, tag) tuples |
| `column` | int | `None` | Column index to select from tuple elements |
| `merge` | str | `"hstack"` | Merge mode for multi-action outputs |
| `initialize` | callable | `None` | Action executed before processing |
| `finalize` | callable | `None` | Action executed after processing |
| `concurrency` | str | `None` | `"thread"` or `"process"` for parallel execution |
| `onetomany` | bool | `True` | Enable one-to-many data transformations |

资料来源：[src/python/txtai/workflow/task/base.py:15-45]()

### TemplateTask

`TemplateTask` extends the base `Task` class with template processing capabilities for generating LLM prompts.

```python
from txtai.workflow.task import TemplateTask

def summary_template(text):
    return f"""Summarize the following text in 40 words or less.
    
{text}
"""

def translate_template(text, language):
    return f"""Translate the following text to {language}.
    
{text}
"""

workflow = Workflow([
    Task(textractor),
    Task(llm, action=lambda inputs: llm([summary_template(x) for x in inputs])),
    Task(llm, action=lambda inputs: llm([translate_template(x, "fr") for x in inputs]))
])
```

#### Template Processing

| Element Type | Template Behavior |
|--------------|-------------------|
| `dict` | Pass as named template parameters |
| `tuple` | Pass as `arg0`...`argN` template parameters |
| `str` | Pass as `{text}` parameter |

资料来源：[src/python/txtai/workflow/task/template.py:1-40]()

## Workflow Execution

### Basic Execution Flow

```mermaid
sequenceDiagram
    participant User
    participant Workflow
    participant Task1
    participant TaskN
    
    User->>Workflow: workflow(data)
    Workflow->>Task1: process(input)
    Task1->>Task1: select()
    Task1->>Task1: action()
    Task1->>Task1: onetomany transform
    Task1-->>TaskN: intermediate results
    TaskN->>TaskN: select()
    TaskN->>TaskN: action()
    TaskN-->>User: final results
```

### Concurrency Modes

| Mode | Use Case | Configuration |
|------|----------|---------------|
| `thread` | I/O-bound tasks | `concurrency="thread"` |
| `process` | CPU-bound tasks | `concurrency="process"` |
| `None` | Sequential execution | Default |

资料来源：[src/python/txtai/workflow/base.py:1-60]()

## Pipeline Components

### Available Pipeline Tasks

| Pipeline | Purpose | Example Usage |
|----------|---------|---------------|
| `Textractor` | Extract text from URLs/files | `Task(textractor)` |
| `Summary` | Generate text summaries | `Task(summary)` |
| `Translation` | Translate between languages | `Task(translate)` |
| `LLM` | Execute LLM prompts | `Task(llm)` |
| `Segmentation` | Split text into segments | `Task(segmentation)` |
| `Transcription` | Audio to text | `Task(transcription)` |

### Quick Start Example

```python
from txtai import LLM, Workflow
from txtai.pipeline import Summary, Textractor, Translation
from txtai.workflow import Task

# Step 1: Define available pipelines
textractor = Textractor(backend="docling", headers={"user-agent": "Mozilla/5.0"})
summary = Summary()
translate = Translation()

# Step 2: Define workflow tasks
workflow = Workflow([
    Task(textractor),           # Extract text from URL
    Task(summary),              # Generate summary
    Task(lambda inputs: [translate(x, "fr") for x in inputs])  # Translate to French
])

# Step 3: Run the workflow
results = list(workflow(["https://neuml.com"]))
```

资料来源：[examples/workflow_quickstart.py:1-30]()

### LLM-Powered Workflow Example

```python
from txtai import LLM, Workflow
from txtai.pipeline import Textractor
from txtai.workflow import Task

textractor = Textractor(backend="docling")
llm = LLM("Qwen/Qwen3-4B-Instruct-2507")

def summary(text):
    return f"""Summarize the following text in 40 words or less.
{text}
"""

def translate(text, language):
    return f"""Translate the following text to {language}.
{text}
"""

workflow = Workflow([
    Task(textractor),
    Task(lambda inputs: llm([summary(x) for x in inputs], maxlength=25000)),
    Task(lambda inputs: llm([translate(x, "fr") for x in inputs], maxlength=25000))
])

results = list(workflow(["https://neuml.com"]))
```

资料来源：[examples/workflow_quickstart.py:35-60]()

## Data Flow

```mermaid
graph LR
    A[(Input Data)] --> B[Workflow]
    B --> C[Task 1: Textractor]
    C --> D[Intermediate Data]
    D --> E[Task 2: Summary]
    E --> F[Intermediate Data]
    F --> G[Task N: Custom Action]
    G --> H[(Output Results)]
    
    I[URL Input] -->|UrlTask| C
    J[File Input] -->|FileTask| C
    K[Stream Input] -->|StreamTask| C
```

### Input/Output Handling

- **URL input**: Processed via `UrlTask` wrapper
- **File input**: Processed via `FileTask` wrapper
- **Stream input**: Processed via `StreamTask` wrapper
- **Tuple unpacking**: Supports `(id, data, tag)` format for labeled data

资料来源：[src/python/txtai/workflow/base.py:60-100]()

## Advanced Configuration

### Custom Task Actions

```python
from txtai.workflow import Task

def custom_filter(element):
    """Filter elements before processing"""
    return len(element) > 10

def custom_action(element):
    """Process single element"""
    return element.upper()

def custom_finalize(results):
    """Aggregate results after processing"""
    return sorted(results, key=lambda x: x[1], reverse=True)

workflow = Workflow([
    Task(
        action=custom_action,
        select=custom_filter,
        finalize=custom_finalize,
        concurrency="thread"
    )
])
```

### Workflow Factory

```python
from txtai.workflow import WorkflowFactory

# Create workflow from configuration
workflow = WorkflowFactory.create(config)

# Run with data
results = list(workflow(data))
```

资料来源：[src/python/txtai/workflow/factory.py:1-50]()

## Best Practices

1. **Use appropriate concurrency**: Set `concurrency="thread"` for I/O-bound tasks (API calls, file operations) and `concurrency="process"` for CPU-intensive tasks.

2. **Leverage one-to-many transforms**: Tasks can produce multiple outputs from single inputs, enable with `onetomany=True`.

3. **Template formatting**: Use `TemplateTask` for consistent prompt formatting across LLM calls.

4. **Error handling**: Implement `select` filters to skip invalid data before reaching action processing.

5. **Pipeline selection**: Only install necessary pipeline extras—e.g., `pip install txtai[pipeline-data]` for text extraction workflows.

资料来源：[examples/workflow_quickstart.py:1-25]()

## See Also

- [Embeddings](embeddings.md) - Vector search and similarity indexing
- [Pipelines](pipelines.md) - Language model pipelines
- [LLM Configuration](llm.md) - Large language model setup
- [API Documentation](api.md) - REST API reference

---

<a id='agents'></a>

## Agents

### 相关页面

相关主题：[Pipelines](#pipelines), [Workflows](#workflows)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/python/txtai/agent/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/agent/__init__.py)
- [src/python/txtai/agent/base.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/agent/base.py)
- [src/python/txtai/agent/factory.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/agent/factory.py)
- [src/python/txtai/agent/tool/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/agent/tool/__init__.py)
- [src/python/txtai/agent/tool/factory.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/blob/main/src/python/txtai/agent/tool/factory.py)
- [src/python/txtai/agent/tool/function.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/agent/tool/function.py)
- [docs/agent/index.md](https://github.com/neuml/txtai/blob/main/docs/agent/index.md)
- [setup.py](https://github.com/neuml/txtai/blob/main/setup.py) (agent dependencies)
</details>

# Agents

Agents in txtai enable autonomous task execution by combining Large Language Models (LLMs) with a flexible tool system. The agent framework allows building AI-powered workflows that can reason, plan, and execute actions through configurable tools.

## Overview

txtai agents are built on top of [smolagents](https://github.com/smol-ai/smolagents) and provide:

- **Tool-based execution**: Agents use tools to interact with external systems and data sources
- **LLM orchestration**: Seamless integration with various LLM backends including local models, APIs, and managed services
- **MCP support**: Integration with Model Context Protocol for extended tool collections
- **Configurable behavior**: Flexible configuration for tool selection, prompt templates, and execution parameters

The agent architecture separates concerns between tool definition, tool management, and agent execution, enabling modular and extensible agent systems.

## Architecture

```mermaid
graph TD
    A[User Input] --> B[Agent Core]
    B --> C[LLM Engine]
    C --> D[Tool Selection]
    D --> E[Tool Execution]
    E --> F[Results]
    F --> C
    C --> G[Final Response]
    
    H[ToolFactory] --> I[FunctionTool]
    H --> J[EmbeddingsTool]
    H --> K[MCPTools]
    H --> L[DefaultTools]
    
    subgraph Tool System
        I
        J
        K
        L
    end
```

### Core Components

| Component | File | Purpose |
|-----------|------|---------|
| Agent Base | `agent/base.py` | Core agent logic and execution loop |
| Agent Factory | `agent/factory.py` | Factory for creating configured agents |
| Tool Factory | `agent/tool/factory.py` | Creates and manages tool instances |
| Function Tool | `agent/tool/function.py` | Wraps Python functions as LLM tools |
| Tool Init | `agent/tool/__init__.py` | Tool abstractions and shared interfaces |

## Tool System

### Tool Types

The tool system supports multiple tool creation patterns:

| Type | Description | Source |
|------|-------------|--------|
| `FunctionTool` | Wraps a Python function with config | `agent/tool/function.py:1` |
| `EmbeddingsTool` | Embeddings-based search and retrieval | `agent/tool/factory.py` |
| `MCP Tools` | External MCP tool collections | `agent/tool/factory.py` |
| Default Tools | Built-in system tools | `agent/tool/factory.py` |

### FunctionTool

The `FunctionTool` class wraps descriptive configuration and injects it along with a target function into an LLM prompt.

```python
from smolagents import Tool

class FunctionTool(Tool):
    """
    A FunctionTool takes descriptive configuration and injects it along with a target function into an LLM prompt.
    """
```

**Configuration Schema:**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | Yes | Tool identifier |
| `description` | string | Yes | Human-readable tool description for LLM |
| `inputs` | dict | Yes | Input parameter definitions |
| `output` | string | No | Output type (default: `"any"`) |
| `output_type` | string | No | Alternative output type field |
| `target` | callable | Yes | Function to execute |

### Tool Factory

The `ToolFactory` class handles tool creation from various configuration formats:

```python
@staticmethod
def create(config):
    """
    Creates a new list of tools. This method iterates of the `tools` configuration option and creates a Tool instance
    for each entry. This supports the following:

      - Tool instance
      - Dictionary with `name`, `description`, `inputs`, `output` and `target` function configuration
      - String with a tool alias name

    Returns:
        list of tools
    """
```

**Supported Tool Input Formats:**

1. **Tool Instance**: Direct `smolagents.Tool` objects
2. **Dictionary**: Config with `name`, `description`, `inputs`, `output`, and `target`
3. **String Alias**: Named shortcuts (e.g., `"webview"`, `"defaults"`)
4. **MCP URL**: HTTP URLs for MCP tool collections

### Default Tools

| Tool Name | Description |
|-----------|-------------|
| `webview` | Web content reading and extraction |
| `read` | File and content reading capabilities |
| `defaults` | Loads all default tools |

## Agent Configuration

### Dependencies

The agent module requires the following dependencies (specified in `setup.py:17`):

```python
extras["agent"] = [
    "jinja2>=3.1.6",
    "mcpadapt>=0.1.0",
    "smolagents>=1.23",
]
```

| Dependency | Version | Purpose |
|------------|---------|---------|
| `jinja2` | >=3.1.6 | Template processing for prompts |
| `mcpadapt` | >=0.1.0 | MCP tool protocol adapter |
| `smolagents` | >=1.23 | Core agent framework |

### Configuration Options

| Parameter | Type | Description |
|-----------|------|-------------|
| `llm` | dict | LLM configuration with model path and parameters |
| `tools` | list | List of tool configurations or aliases |
| `template` | string | Prompt template for agent instructions |
| `max iterations` | int | Maximum agent execution iterations |
| `tool` | string | Primary execution tool/function |

## Agent Execution Flow

```mermaid
sequenceDiagram
    participant U as User
    participant A as Agent
    participant L as LLM
    participant T as Tool Manager
    
    U->>A: Execute Task
    A->>L: Generate Tool Call
    L-->>A: Tool Selection
    A->>T: Execute Tool
    T-->>A: Tool Result
    A->>L: Process Result
    L-->>A: Response
    A-->>U: Final Answer
```

## Integration with LLM Pipelines

Agents integrate with txtai's LLM pipeline system:

```python
# From llm.py - Agent uses LLM generation
def generator(self, text, maxlength, stream, stop, defaultrole, stripthink, **kwargs):
    """
    Runs LLM generation for agent reasoning and tool calls.
    """
    return self.generator(text, maxlength, stream, stop, defaultrole, stripthink, **kwargs)
```

### LLM Integration Methods

| Method | Purpose |
|--------|---------|
| `ischat()` | Check if the LLM supports chat mode |
| `isvision()` | Check if the LLM supports vision/multimodal |
| `generator()` | Execute LLM generation |

## Tool Execution

The tool execution is handled by the `forward` method in `FunctionTool`:

```python
def forward(self, *args, **kwargs):
    """
    Runs target function.

    Args:
        args: positional args
        kwargs: keyword args

    Returns:
        result
    """
    return self.target(*args, **kwargs)
```

### Execution Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `args` | tuple | Positional arguments for tool function |
| `kwargs` | dict | Keyword arguments for tool function |

## Best Practices

1. **Tool Descriptions**: Provide clear, detailed descriptions for each tool to help the LLM understand when and how to use it
2. **Input Schemas**: Define complete input schemas with types and descriptions
3. **Error Handling**: Tools should handle errors gracefully and return meaningful error messages
4. **Configuration Validation**: Use the factory pattern to validate tool configurations before execution

## See Also

- [Pipeline LLM](../pipeline/llm/) - LLM configuration and execution
- [Workflows](../workflow/) - Task orchestration
- [Embeddings](../embeddings/) - Vector search capabilities

---

<a id='database-integration'></a>

## Database Integration

### 相关页面

相关主题：[Embeddings Database](#embeddings-core), [Scoring and Retrieval Algorithms](#scoring-algorithms)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [setup.py](https://github.com/neuml/txtai/blob/main/setup.py)
- [src/python/txtai/pipeline/data/textractor.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/data/textractor.py)
- [src/python/txtai/embeddings/index/documents.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/embeddings/index/documents.py)
- [src/python/txtai/cloud/hub.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/cloud/hub.py)
- [examples/workflows.py](https://github.com/neuml/txtai/blob/main/examples/workflows.py)
- [src/python/txtai/workflow/task/template.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/workflow/task/template.py)
</details>

# Database Integration

## Overview

txtai provides comprehensive database integration capabilities that enable vector search combined with structured data storage. The database layer is a core component of the embeddings database architecture, which unifies vector indexes (sparse and dense), graph networks, and relational databases into a single powerful system.

资料来源：[setup.py:42-44]()

The database integration serves as the foundation for:

- Persistent storage of indexed documents and embeddings
- SQL-based querying alongside vector similarity search
- Integration with external data sources and cloud storage
- Workflow task management and state persistence

## Architecture

The database layer in txtai follows a modular architecture with several key components:

```mermaid
graph TD
    A[Embeddings Database] --> B[Database Layer]
    B --> C[SQLAlchemy Backend]
    B --> D[DuckDB Backend]
    C --> E[Schema Models]
    D --> E
    B --> F[Database Factory]
    F --> G[Database Client]
    G --> H[Query Engine]
    H --> I[Document Storage]
    H --> J[Vector Indexes]
```

### Key Components

| Component | Purpose | Source |
|-----------|---------|--------|
| Database Base | Abstract base class defining database operations | [src/python/txtai/database/base.py]() |
| Database Client | Client interface for executing queries | [src/python/txtai/database/client.py]() |
| Database Factory | Factory pattern for creating database instances | [src/python/txtai/database/factory.py]() |
| Schema Models | SQLAlchemy ORM models for data structures | [src/python/txtai/database/schema/__init__.py]() |

## Supported Databases

### SQLite/PostgreSQL with SQLAlchemy

txtai supports traditional relational databases through SQLAlchemy, providing cross-database compatibility. This enables seamless integration with:

- **SQLite**: Lightweight, file-based database for local development
- **PostgreSQL**: Production-grade relational database with advanced features
- **pgvector**: PostgreSQL extension for vector similarity search

资料来源：[setup.py:42-44]()

```python
# Example configuration
embeddings:
    path: sentence-transformers/all-MiniLM-L6-v2
    database:
        url: postgresql+psycopg2://user:pass@localhost:5432/txtai
```

### DuckDB

DuckDB is an embedded analytical database that provides high-performance OLAP capabilities. txtai includes DuckDB as a database extra for analytical workloads.

资料来源：[setup.py:42]()

```python
extras["database"] = ["duckdb>=0.8.0", "pillow>=7.1.2", "sqlalchemy>=2.0.20"]
```

## Database Configuration

### Configuration Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string | `sqlite:///txtai.db` | Database connection URL |
| `path` | string | None | Path for file-based databases |
| `indexed` | bool | True | Enable full-text indexing |
| `content` | bool | True | Store original document content |
| `objects` | string | `storage` | Object storage backend |
| `embeddings` | string | `embeddings` | Embeddings storage directory |

### Connection String Formats

```
# SQLite (default)
sqlite:///txtai.db

# PostgreSQL with pgvector
postgresql://user:pass@localhost:5432/txtai

# DuckDB
duckdb:///path/to/database.duckdb

# MySQL
mysql+pymysql://user:pass@localhost:3306/txtai
```

## Document Storage System

The document storage system provides efficient handling of indexed content with support for batch operations and streaming.

### Core Operations

```python
class DocumentStorage:
    def add(self, documents):
        """Add documents to storage"""
        
    def close(self):
        """Close and persist storage"""
        
    def delete(self, ids):
        """Remove documents by ID"""
        
    def get(self, ids):
        """Retrieve documents by ID"""
```

资料来源：[src/python/txtai/embeddings/index/documents.py:1-50]()

### Batch Processing

Documents are processed in batches for optimal memory usage and performance:

```python
# Batch-based document indexing
self.serializer.savestream(documents, self.documents)
self.batch += 1
self.size += len(documents)
```

## Integration with Pipeline Data Processing

The Textractor module demonstrates database integration with external data sources, supporting extraction from various file formats and URLs.

资料来源：[src/python/txtai/pipeline/data/textractor.py:1-60]()

### Supported Input Methods

| Method | Description |
|--------|-------------|
| URL Fetching | Extract text from web pages via HTTP |
| File Processing | Handle local and remote file paths |
| Cloud Storage | Integration with cloud object storage |
| Database Queries | Direct content retrieval from databases |

### Safe Open Mode

For security-sensitive environments, Textractor supports safe open mode that restricts access to local temporary directories and non-private URLs only:

```python
self.safeopen = os.path.realpath(
    tempfile.gettempdir() if isinstance(safeopen, bool) else safeopen
) if safeopen else safeopen
```

## Cloud Storage Integration

txtai extends database capabilities with cloud storage integration for managing large embeddings and document collections.

### HuggingFace Hub Integration

The cloud hub module provides seamless upload and download of index files to HuggingFace repositories:

资料来源：[src/python/txtai/cloud/hub.py:1-60]()

```python
# Automatic LFS tracking for large files
if "embeddings " not in content:
    content += "documents filter=lfs diff=lfs merge=lfs -text\n"
    content += "embeddings filter=lfs diff=lfs merge=lfs -text\n"
```

### Supported Cloud Providers

| Provider | Package | Features |
|----------|---------|----------|
| HuggingFace Hub | `huggingface-hub` | Model hosting, datasets |
| S3 Compatible | `apache-libcloud` | Object storage |
| Google Cloud | `apache-libcloud` | Cloud storage |
| Azure | `apache-libcloud` | Blob storage |

## Workflow Database Tasks

Workflows can include database operations as part of complex data processing pipelines.

### Task Configuration

```python
elif wtype == "tabular":
    data[wtype] = component
    tasks.append({"action": wtype})
```

资料来源：[examples/workflows.py:1-100]()

### Workflow Task Types

| Task Type | Description | Data Format |
|-----------|-------------|-------------|
| `tabular` | Database table operations | CSV, DataFrame |
| `service` | Database connection pooling | Configuration |
| `extract` | Data extraction to database | Various sources |

## Query Patterns

### Vector + SQL Hybrid Queries

txtai supports combining vector similarity search with traditional SQL filtering:

```python
# Semantic search with SQL filter
results = embeddings.search(
    "Find documents about machine learning",
    filter="category = 'technical' AND year > 2020"
)
```

### Batch Query Optimization

For high-throughput applications:

```python
# Batch query for multiple inputs
queries = ["query1", "query2", "query3"]
results = embeddings.batchsearch(queries, limit=10)
```

## Performance Considerations

### Database Selection Guide

| Use Case | Recommended Database |
|----------|----------------------|
| Local development | SQLite |
| Production workloads | PostgreSQL + pgvector |
| Analytical queries | DuckDB |
| Large-scale embeddings | S3 + PostgreSQL hybrid |

### Optimization Tips

1. **Index Management**: Regular index rebuilding for maintaining query performance
2. **Batch Operations**: Use batch operations for bulk inserts and updates
3. **Connection Pooling**: Configure appropriate pool sizes for concurrent access
4. **Object Storage**: Offload large embeddings to object storage while keeping metadata in database

## Installation

Install database dependencies using extras:

```bash
# Basic database support
pip install txtai[database]

# With all database backends
pip install txtai[all]

# Individual backends
pip install txtai[duckdb]      # DuckDB analytics
pip install txtai[ann]         # Vector indexes with pgvector
```

## See Also

- [Embeddings Configuration](embeddings/configuration/database.md) - Detailed configuration guide
- [Vector Search](../vectors/overview.md) - Combining database with vector search
- [Workflow Tasks](../workflow/overview.md) - Database tasks in workflows
- [Cloud Storage](../cloud/hub.md) - HuggingFace Hub integration

---

<a id='graph-networks'></a>

## Graph Networks

### 相关页面

相关主题：[Embeddings Database](#embeddings-core), [Database Integration](#database-integration)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/python/txtai/graph/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/graph/__init__.py)
- [src/python/txtai/graph/base.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/graph/base.py)
- [src/python/txtai/graph/factory.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/graph/factory.py)
- [src/python/txtai/graph/networkx.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/graph/networkx.py)
- [docs/embeddings/configuration/graph.md](https://github.com/neuml/txtai/blob/main/docs/embeddings/configuration/graph.md)
</details>

# Graph Networks

Graph Networks in txtai provide a powerful mechanism for building and traversing graph-based knowledge structures. This module enables semantic relationship mapping, topic modeling, and community detection capabilities within the broader txtai ecosystem.

## Overview

Graph Networks are a core component of txtai's architecture that combines vector search capabilities with graph-based relationship analysis. The graph module creates network representations that can be queried, traversed, and analyzed for patterns and communities.

```mermaid
graph TB
    subgraph "txtai Graph Module"
        GB[Graph Base Class]
        GX[NetworkX Backend]
        GF[Graph Factory]
        GT[Topics Module]
    end
    
    GB --> GX
    GB --> GT
    GF --> GB
    
    GX --> CD[Community Detection]
    GX --> NT[Node Traversal]
    GX --> RL[Relationship Links]
```

资料来源：[src/python/txtai/graph/base.py:1-20]()

## Architecture

### Base Graph Class

The `Graph` class serves as the foundational abstraction for all graph implementations. It provides a consistent interface for graph operations regardless of the underlying backend.

```python
class Graph:
    """
    Base class for Graph instances. This class builds graph networks. Supports topic modeling
    and relationship traversal.
    """
```

**Key Attributes:**

| Attribute | Type | Description |
|-----------|------|-------------|
| `config` | dict | Graph configuration settings |
| `backend` | object | Graph backend implementation |
| `categories` | list | Topic modeling categories |
| `topics` | object | Topics instance for topic analysis |
| `text` | str | Column name for text data (default: "text") |
| `object` | str | Column name for object data (default: "object") |
| `copyattributes` | bool | Flag to copy all attributes when True |
| `relationships` | str | Column name for manually-provided edges |
| `relations` | dict | Stored relationships dictionary |

资料来源：[src/python/txtai/graph/base.py:26-43]()

### NetworkX Backend

The NetworkX backend (`NetworkX` class) provides the actual graph implementation using the NetworkX library. This backend supports:

- **Community Detection**: Uses Louvain algorithm for partition detection
- **Graph Operations**: Node/edge addition, removal, and querying
- **Serialization**: Pickle-based serialization for persistence
- **Distance Computation**: Weight-based distance calculations between nodes

资料来源：[src/python/txtai/graph/networkx.py:1-30]()

## Core Methods

### Graph Creation and Management

```python
def create(self):
    """Creates the graph network."""
    raise NotImplementedError

def count(self):
    """Returns the total number of nodes in graph."""
    raise NotImplementedError
```

| Method | Description | Returns |
|--------|-------------|---------|
| `create()` | Initializes and returns a new graph network | Graph instance |
| `count()` | Returns total number of nodes | int |
| `scan(attribute, data)` | Iterates over nodes matching criteria | Node iterator or tuples |
| `node(nodeid)` | Retrieves node attributes | dict or None |

资料来源：[src/python/txtai/graph/base.py:45-65]()

### Node Operations

| Method | Description |
|--------|-------------|
| `addnode(node, **attrs)` | Adds a single node with attributes |
| `addnodes(nodes)` | Adds multiple nodes from iterable |
| `removenode(node)` | Removes a node from the graph |
| `hasnode(node)` | Checks if node exists |

资料来源：[src/python/txtai/graph/networkx.py:55-75]()

## Community Detection

The NetworkX backend implements advanced community detection using the Louvain algorithm:

```python
def communities(self, config):
    """
    Runs community detection on graph.

    Args:
        config: topic configuration

    Returns:
        list of [ids] per community
    """
    level = config.get("level", "best")
    results = list(louvain_partitions(self.backend, weight="weight", resolution=config.get("resolution", 100), seed=0))
    return results[0] if level == "first" else results[-1]
```

**Configuration Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `level` | str | "best" | Partition level to use ("first" or "best") |
| `resolution` | int | 100 | Louvain resolution parameter |

资料来源：[src/python/txtai/graph/networkx.py:35-55]()

## Distance Computation

The graph module calculates distances between nodes based on edge weights:

```python
def distance(self, source, target, attrs):
    """
    Computes distance between source and target nodes using weight.

    Returns:
        distance between source and target
    """
    distance = max(1.0 - attrs["weight"], 0.0)
    return distance if distance >= 0.15 else 1.00
```

**Distance Algorithm:**
- Distance = `1 - weight` (inverted weight)
- Minimum threshold: 0.15 (edges below this are considered near-duplicates)
- Maximum distance cap: 1.00

资料来源：[src/python/txtai/graph/networkx.py:60-75]()

## Serialization

### TAR File Loading (Legacy Format)

The NetworkX backend supports loading graphs from legacy TAR archives:

```python
def loadtar(self, path):
    """
    Loads a graph from the legacy TAR file.

    Args:
        path: path to graph
    """
    serializer = SerializeFactory.create("pickle")
    
    with TemporaryDirectory() as directory:
        archive = ArchiveFactory.create(directory)
        archive.load(path, "tar")
        self.backend = serializer.load(f"{directory}/graph")
```

**Serialization Features:**
- Pickle-based serialization for backwards compatibility
- TAR archive extraction to temporary directory
- Optional category loading from separate file

资料来源：[src/python/txtai/graph/networkx.py:80-100]()

## Configuration

Graph networks are configured through a dictionary passed to the constructor:

```yaml
graph:
    columns:
        text: "content"        # Column containing text data
        object: "data"        # Column containing object data
        relationships: "edges" # Column for manual edge definitions
    copyattributes: false      # Copy all attributes when True
```

**Column Configuration Options:**

| Key | Default | Description |
|-----|---------|-------------|
| `text` | "text" | Name of the text column |
| `object` | "object" | Name of the object column |
| `relationships` | "relationships" | Name of the relationships column |

资料来源：[docs/embeddings/configuration/graph.md]() and [src/python/txtai/graph/base.py:38-40]()

## Topic Modeling Integration

The Graph module integrates with txtai's topic modeling capabilities:

```mermaid
graph LR
    A[Input Data] --> B[Graph Indexing]
    B --> C[Topic Modeling]
    C --> D[Community Detection]
    D --> E[Category Assignment]
    
    C --> F[Topics Module]
    F --> G[Topic Analysis]
```

**Topics Integration:**
- `self.categories`: Stores discovered topic categories
- `self.topics`: Topics instance for analysis operations
- Community detection feeds into topic assignment

资料来源：[src/python/txtai/graph/base.py:30-32]()

## Error Handling

The Graph module requires the NetworkX library for operation:

```python
if not NETWORKX:
    raise ImportError('NetworkX is not available - install "graph" extra to enable')
```

To enable Graph Networks functionality, install the graph extra:

```bash
pip install txtai[graph]
```

资料来源：[src/python/txtai/graph/networkx.py:23-25]()

## Usage Example

```python
from txtai import Graph

# Create graph with default configuration
graph = Graph({})

# Create the graph network
graph.create()

# Add nodes with attributes
graph.addnode(1, text="example node", weight=0.5)
graph.addnode(2, text="another node", weight=0.7)

# Query nodes
nodes = list(graph.scan())
count = graph.count()

# Get community detection results
config = {"level": "best", "resolution": 100}
communities = graph.communities(config)
```

## Related Components

| Component | Description |
|-----------|-------------|
| `Embeddings` | Vector index that may utilize graphs for relationship mapping |
| `Topics` | Topic modeling module integrated with graph analysis |
| `Archive` | Serialization system for graph persistence |
| `Workflow` | Orchestration that can incorporate graph operations |

---

<a id='scoring-algorithms'></a>

## Scoring and Retrieval Algorithms

### 相关页面

相关主题：[Embeddings Database](#embeddings-core), [Database Integration](#database-integration)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [setup.py](https://github.com/neuml/txtai/blob/main/setup.py)
- [src/python/txtai/embeddings/index/documents.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/embeddings/index/documents.py)
- [src/python/txtai/api/cluster.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/api/cluster.py)
- [src/python/txtai/pipeline/llm/llm.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/pipeline/llm/llm.py)
- [src/python/txtai/embeddings/database/__init__.py](https://github.com/neuml/txtai/blob/main/src/python/txtai/embeddings/database/__init__.py)
</details>

# Scoring and Retrieval Algorithms

## Overview

txtai provides a sophisticated scoring and retrieval system that powers its semantic search capabilities. At its core, the system combines vector-based similarity search with traditional BM25 sparse retrieval, enabling hybrid search approaches that leverage both semantic understanding and exact term matching.

The scoring module is an integral part of the embeddings database architecture, which unifies vector indexes, graph networks, and relational databases into a single powerful knowledge source for LLM applications.

资料来源：[README.md]()

## Architecture

The scoring and retrieval system follows a modular factory pattern that allows different scoring algorithms to be plugged in based on configuration. The architecture is designed to support both standalone scoring operations and integrated retrieval within the embeddings database.

```mermaid
graph TD
    A[Query Input] --> B[Query Processing]
    B --> C{Scoring Algorithm}
    C --> D[BM25 Scoring]
    C --> E[Vector Similarity]
    C --> F[Hybrid Scoring]
    D --> G[Results Aggregation]
    E --> G
    F --> G
    G --> H[Ranked Results]
```

资料来源：[setup.py](), [src/python/txtai/scoring/factory.py]()

## Scoring Module Structure

The scoring subsystem is defined in the `txtai.scoring` package and includes base classes and concrete implementations for different retrieval algorithms.

### Key Components

| Component | Purpose | Location |
|-----------|---------|----------|
| `Base` | Abstract base class defining scoring interface | `scoring/base.py` |
| `BM25` | Traditional sparse retrieval algorithm | `scoring/bm25.py` |
| `Factory` | Factory for creating scoring instances | `scoring/factory.py` |
| Package `__init__` | Module exports and configuration | `scoring/__init__.py` |

资料来源：[setup.py:50-51]()

## BM25 Retrieval Algorithm

BM25 (Best Matching 25) is a probabilistic ranking function used for information retrieval. It extends the binary independence model to incorporate term frequency and document length normalization.

### Configuration Options

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `k1` | float | 1.5 | Term frequency saturation parameter |
| `b` | float | 0.75 | Length normalization factor |
| `avgdl` | float | auto | Average document length |

### How BM25 Works

1. **Term Frequency (TF)**: Measures how often a term appears in a document, with saturation to prevent over-weighting
2. **Inverse Document Frequency (IDF)**: Weights terms by their rarity across the corpus
3. **Document Length Normalization**: Adjusts scores based on document length relative to average

资料来源：[src/python/txtai/scoring/bm25.py](), [src/python/txtai/scoring/base.py]()

## Integration with Embeddings Database

The scoring algorithms are tightly integrated with the embeddings database system, which handles document indexing, storage, and retrieval.

```mermaid
graph LR
    A[Documents] --> B[Indexing Pipeline]
    B --> C[Vector Index]
    B --> D[BM25 Index]
    B --> E[Document Store]
    F[Query] --> G[Hybrid Search]
    C --> G
    D --> G
    E --> G
    G --> H[Scored Results]
```

### Document Storage

Documents are managed through a streaming serializer that writes to temporary files, enabling efficient handling of large document collections:

```python
# Add batch of documents
self.serializer.savestream(documents, self.documents)
self.batch += 1
self.size += len(documents)
```

资料来源：[src/python/txtai/embeddings/index/documents.py:18-23]()

## Search API

The scoring system exposes search capabilities through both local and cluster APIs, supporting single queries, batch searches, and hybrid scoring with configurable weights.

### Search Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `query` | str | Search query string |
| `limit` | int | Maximum number of results (default: 10) |
| `weights` | list | Hybrid score weights for combining methods |
| `index` | str | Specific index to search |
| `parameters` | dict | Named parameters for advanced queries |
| `graph` | bool | Include graph network results |

### Query Execution Flow

```mermaid
sequenceDiagram
    participant Client
    participant API
    participant Scorer
    participant Index
    
    Client->>API: search(query, limit)
    API->>Scorer: execute(query)
    Scorer->>Index: score(query)
    Index-->>Scorer: raw scores
    Scorer-->>API: scored results
    API->>API: aggregate(query, results)
    API->>API: sort and limit
    API-->>Client: ranked results
```

资料来源：[src/python/txtai/api/cluster.py:85-105]()

### Batch Search

For processing multiple queries efficiently, the API supports batch search operations:

```python
def batchsearch(self, queries, limit=None, weights=None, index=None, parameters=None, graph=False):
    """
    Finds documents most similar to the input queries.
    """
```

资料来源：[src/python/txtai/api/cluster.py:107-127]()

## Hybrid Scoring

txtai supports hybrid scoring that combines multiple retrieval methods. This is particularly valuable when you want to leverage both semantic similarity (vector-based) and exact term matching (BM25).

### Weight Configuration

The `weights` parameter controls how different scoring methods are combined:

| Weights | Effect |
|---------|--------|
| `[0.0, 1.0]` | Pure vector search |
| `[1.0, 0.0]` | Pure BM25 search |
| `[0.5, 0.5]` | Equal contribution |
| `[0.7, 0.3]` | Vector-biased hybrid |

### Result Aggregation

Results from multiple scoring methods are combined using weighted aggregation and sorted by the combined score:

```python
# Combine aggregate functions and sort
results = self.aggregate(query, results)
# Limit results
return results[: (limit if limit else 10)]
```

资料来源：[src/python/txtai/api/cluster.py:97-99]()

## Configuration Reference

### Scoring Extra Dependencies

To use the scoring module, install the `scoring` extra:

```bash
pip install txtai[scoring]
```

Required dependencies from `setup.py`:

| Package | Version | Purpose |
|---------|---------|---------|
| sqlalchemy | >=2.0.20 | Database operations |

资料来源：[setup.py:50-51]()

### Complete Installation Options

For full functionality including scoring:

```bash
pip install txtai[all]
```

This includes:
- `ann` - Approximate nearest neighbor indexes
- `vectors` - Vector similarity components
- `scoring` - Retrieval algorithms
- `api` - REST API service

资料来源：[setup.py:66-81]()

## Usage Examples

### Basic Embeddings Search

```python
import txtai

embeddings = txtai.Embeddings()
embeddings.index(["Correct", "Not what we hoped"])
embeddings.search("positive", 1)
# Returns: [(0, 0.29862046241760254)]
```

### Configuration via YAML

```yaml
# app.yml
embeddings:
    path: sentence-transformers/all-MiniLM-L6-v2
```

资料来源：[README.md]()

## Related Components

| Component | Relationship |
|-----------|--------------|
| **Embeddings Database** | Primary consumer of scoring algorithms |
| **Vector Indexes** | Works alongside scoring for hybrid search |
| **API Service** | Exposes scoring via REST endpoints |
| **Workflow Engine** | Uses scoring for document routing |

The scoring and retrieval algorithms form the computational backbone of txtai's search capabilities, enabling both high-precision keyword matching through BM25 and semantic understanding through vector similarity. This combination allows applications to benefit from the strengths of both traditional information retrieval and modern neural embeddings.

---

---

## Doramagic 踩坑日志

项目：neuml/txtai

摘要：发现 22 个潜在踩坑项，其中 0 个为 high/blocking；最高优先级：安装坑 - 来源证据：Add `txtai_minimal` package。

## 1. 安装坑 · 来源证据：Add `txtai_minimal` package

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Add `txtai_minimal` package
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_4e4050b59fdf4d51ac21e82d248e589e | https://github.com/neuml/txtai/issues/1090 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 2. 安装坑 · 来源证据：Add custom Captions implementation

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Add custom Captions implementation
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_ff6fe05065564e85a549d7656b85a852 | https://github.com/neuml/txtai/issues/1084 | 来源类型 github_issue 暴露的待验证使用条件。

## 3. 安装坑 · 来源证据：Add custom Questions pipeline implementation

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Add custom Questions pipeline implementation
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_fa68ea29089d497bb8278c3c360c363c | https://github.com/neuml/txtai/issues/1087 | 来源类型 github_issue 暴露的待验证使用条件。

## 4. 安装坑 · 来源证据：Add custom Sequences pipeline implementation

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Add custom Sequences pipeline implementation
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_0099afb385ef498d8020521bbf3e9e64 | https://github.com/neuml/txtai/issues/1086 | 来源类型 github_issue 暴露的待验证使用条件。

## 5. 安装坑 · 来源证据：Add custom Summary pipeline implementation

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Add custom Summary pipeline implementation
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_9d8f75c31fc6450d8659b310f25d0bf4 | https://github.com/neuml/txtai/issues/1085 | 来源类型 github_issue 暴露的待验证使用条件。

## 6. 安装坑 · 来源证据：Add minimal Docker build script

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Add minimal Docker build script
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_38e1e56a991b412bbc749d2c0b687d54 | https://github.com/neuml/txtai/issues/1092 | 来源讨论提到 docker 相关条件，需在安装/试用前复核。

## 7. 安装坑 · 来源证据：Make `transformers` package optional for minimal install

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Make `transformers` package optional for minimal install
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_d755829200de4d93b2f4d2f71b36aaf1 | https://github.com/neuml/txtai/issues/1091 | 来源类型 github_issue 暴露的待验证使用条件。

## 8. 安装坑 · 来源证据：Reduce dependencies to just `numpy` for minimal install

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Reduce dependencies to just `numpy` for minimal install
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_395b735968ab4ec8ad849070d43cd18f | https://github.com/neuml/txtai/issues/1093 | 来源类型 github_issue 暴露的待验证使用条件。

## 9. 安装坑 · 来源证据：Support minimal install for edge devices

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Support minimal install for edge devices
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_3ec8e039baf4412888ed3ac543ea1361 | https://github.com/neuml/txtai/issues/1089 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 10. 安装坑 · 来源证据：Various training pipeline fixes for v5

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Various training pipeline fixes for v5
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_b47325d7496248a993d980c3d59f3269 | https://github.com/neuml/txtai/issues/1088 | 来源类型 github_issue 暴露的待验证使用条件。

## 11. 安装坑 · 来源证据：Zero dependency minimal install

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Zero dependency minimal install
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_7839a465e7e64f94acfff001c1da978c | https://github.com/neuml/txtai/issues/1094 | 来源类型 github_issue 暴露的待验证使用条件。

## 12. 安装坑 · 来源证据：v9.9.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：v9.9.0
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_8b842ce68a1c4c1ab0a40a3ed1fd213c | https://github.com/neuml/txtai/releases/tag/v9.9.0 | 来源类型 github_release 暴露的待验证使用条件。

## 13. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 建议检查：将假设转成下游验证清单。
- 防护动作：假设必须转成验证项；没有验证结果前不能写成事实。
- 证据：capability.assumptions | github_repo:286301447 | https://github.com/neuml/txtai | README/documentation is current enough for a first validation pass.

## 14. 运行坑 · 来源证据：v9.7.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个运行相关的待验证问题：v9.7.0
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_e3f61e41f6e84b9b9ee33d5e18dbff63 | https://github.com/neuml/txtai/releases/tag/v9.7.0 | 来源类型 github_release 暴露的待验证使用条件。

## 15. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 建议检查：补 GitHub 最近 commit、release、issue/PR 响应信号。
- 防护动作：维护活跃度未知时，推荐强度不能标为高信任。
- 证据：evidence.maintainer_signals | github_repo:286301447 | https://github.com/neuml/txtai | last_activity_observed missing

## 16. 安全/权限坑 · 下游验证发现风险项

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：下游已经要求复核，不能在页面中弱化。
- 建议检查：进入安全/权限治理复核队列。
- 防护动作：下游风险存在时必须保持 review/recommendation 降级。
- 证据：downstream_validation.risk_items | github_repo:286301447 | https://github.com/neuml/txtai | no_demo; severity=medium

## 17. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 建议检查：把风险写入边界卡，并确认是否需要人工复核。
- 防护动作：评分风险必须进入边界卡，不能只作为内部分数。
- 证据：risks.scoring_risks | github_repo:286301447 | https://github.com/neuml/txtai | no_demo; severity=medium

## 18. 安全/权限坑 · 来源证据：Add LiteRT-LM LLM

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Add LiteRT-LM LLM
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_8731971fd8404df082f75ecf565fef8b | https://github.com/neuml/txtai/issues/1095 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 19. 安全/权限坑 · 来源证据：v9.6.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：v9.6.0
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_fe429d6152304930bda149c4fbd35318 | https://github.com/neuml/txtai/releases/tag/v9.6.0 | 来源类型 github_release 暴露的待验证使用条件。

## 20. 安全/权限坑 · 来源证据：v9.8.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：v9.8.0
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_8bebd6701b454baaae22d73522ce9704 | https://github.com/neuml/txtai/releases/tag/v9.8.0 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 21. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 建议检查：抽样最近 issue/PR，判断是否长期无人处理。
- 防护动作：issue/PR 响应未知时，必须提示维护风险。
- 证据：evidence.maintainer_signals | github_repo:286301447 | https://github.com/neuml/txtai | issue_or_pr_quality=unknown

## 22. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 建议检查：确认最近 release/tag 和 README 安装命令是否一致。
- 防护动作：发布节奏未知或过期时，安装说明必须标注可能漂移。
- 证据：evidence.maintainer_signals | github_repo:286301447 | https://github.com/neuml/txtai | release_recency=unknown

<!-- canonical_name: neuml/txtai; human_manual_source: deepwiki_human_wiki -->