# https://github.com/timescale/pgai 项目说明书生成时间：2026-06-23 15:48:12 UTC ## 目录 - [Overview, Architecture & Installation](#page-1) - [Vectorizer & Embedding Pipeline](#page-2) - [Semantic Catalog & Text-to-SQL](#page-3) - [Embedders, Worker & Operational Concerns](#page-4) ## Overview, Architecture & Installation ### 相关页面相关主题：[Vectorizer & Embedding Pipeline](#page-2), [Embedders, Worker & Operational Concerns](#page-4)

相关源码文件

以下源码文件用于生成本页说明： - [README.md](https://github.com/timescale/pgai/blob/main/README.md) - [projects/extension/README.md](https://github.com/timescale/pgai/blob/main/projects/extension/README.md) - [examples/simple_fastapi_app/README.md](https://github.com/timescale/pgai/blob/main/examples/simple_fastapi_app/README.md) - [examples/embeddings_from_documents/README.md](https://github.com/timescale/pgai/blob/main/examples/embeddings_from_documents/README.md) - [examples/text_to_sql/README.md](https://github.com/timescale/pgai/blob/main/examples/text_to_sql/README.md) - [examples/evaluations/litellm_vectorizer/README.md](https://github.com/timescale/pgai/blob/main/examples/evaluations/litellm_vectorizer/README.md) - [examples/evaluations/voyage_vectorizer/README.md](https://github.com/timescale/pgai/blob/main/examples/evaluations/voyage_vectorizer/README.md) - [scripts/vectorizer-load-test/README.md](https://github.com/timescale/pgai/blob/main/scripts/vectorizer-load-test/README.md)

# Overview, Architecture & Installation ## 1. 项目概述 pgai 是由 Timescale 开发的开源项目，定位为 **AI 应用的 PostgreSQL 集成层**。它将 PostgreSQL 转化为支持检索增强生成（RAG）与 Agentic 应用的检索引擎，让开发者可以在 SQL 层面完成向量嵌入、语义目录、模型调用与文本到 SQL 转换等工作。资料来源：[README.md:1-40]() 项目的核心目标可以归纳为三点： 1. **简化数据嵌入流程**：通过 "vectorizer（矢量器）" 机制，自动为 PostgreSQL 表或 S3 文档创建并维护向量嵌入，应用层无需关注调度与重试逻辑。资料来源：[README.md:42-60]() 2. **从数据库内直接调用 LLM**：在 SQL 中完成摘要、分类、文本生成等任务，例如 `ai.ollama_generate`、`ai.ollama_chat_complete`、`ai.text_to_sql` 等函数。资料来源：[projects/extension/README.md:30-60]() 3. **支持语义目录（Semantic Catalog）**：用自然语言描述数据库对象（表、视图、函数）并结合示例 SQL，为 text-to-SQL 提供上下文。资料来源：[examples/text_to_sql/README.md:1-30]() pgai 同时提供 Python 包（`pip install pgai`）与 PostgreSQL 扩展（`CREATE EXTENSION ai`），二者协同工作但可以独立部署。 ## 2. 核心架构 pgai 的架构由三层组成：**应用层、数据库层、无状态矢量器工作进程**。应用层负责业务数据写入与查询；数据库层通过 `ai` schema 提供向量、嵌入、语义目录等对象；矢量器工作进程读取矢量器配置，处理数据队列，并将结果写回。资料来源：[README.md:48-58]() ```mermaid flowchart LR App[应用
FastAPI / 应用代码] -->|INSERT/UPDATE/DELETE| DB[(PostgreSQL
ai schema
vectorizer 配置)] DB -->|轮询任务队列| Worker[无状态矢量器工作进程] Worker -->|调用 LLM/Embedding API| Provider[Ollama / OpenAI / Voyage / Cohere / LiteLLM] Provider -->|返回向量| Worker Worker -->|写入 chunk + embedding| DB App -->|向量检索 / RAG| DB ``` **关键设计原则**：应用层的数据写入（INSERT/UPDATE/DELETE）与嵌入生成解耦。LLM 端点的失败、限流或延迟不会阻塞业务事务。资料来源：[README.md:58-60]() 矢量器工作进程可以通过以下方式运行： - 在 Python 应用中以内置 `Worker` 启动（适用于单进程 FastAPI 等场景）。资料来源：[examples/simple_fastapi_app/README.md:30-50]() - 通过 `pgai vectorizer worker` CLI 命令在终端运行。 - 通过 Docker 容器独立部署（参见 `projects/pgai/compose-dev.yaml`）。资料来源：[examples/embeddings_from_documents/README.md:1-15]() ## 3. 矢量器流水线矢量器（vectorizer）是 pgai 的核心抽象，定义了一条从原始数据到嵌入向量的可配置流水线。流水线由五个阶段组成，依次串联执行。资料来源：[README.md:88-104]() | 阶段 | 作用 | 配置参考 | |------|------|----------| | Loading（加载） | 定义数据源：表中的列或指向文件/S3 的 URI | `ai.loading_column`、`ai.loading_uri` | | Parsing（解析） | 解析非文本文档（PDF、HTML、Markdown） | `ai.parsing_*` | | Chunking（切分） | 将文本切分为更小的片段 | `ai.chunking_*` | | Formatting（格式化） | 在嵌入前为每段文本添加前缀等元信息 | `ai.formatting_*` | | Embedding（嵌入） | 指定 LLM 提供方、模型、参数 | `ai.embedding_ollama`、`ai.embedding_openai` 等 | 支持的嵌入模型涵盖 Ollama、OpenAI、Voyage AI、Cohere、HuggingFace、Mistral、Azure OpenAI、AWS Bedrock、Vertex AI 等。资料来源：[README.md:106-118]() 矢量器一旦创建，会以**目标表（destination table）**或**视图**的形式暴露包含 `embedding` 列与 `chunk` 列的结果。应用通过 pgvector 操作符（如 `<=>` 余弦距离）进行语义检索。资料来源：[README.md:158-175]() ## 4. 安装与快速上手 ### 4.1 Python 包安装使用 pip 安装核心包： ```bash pip install pgai ``` 如需运行矢量器工作进程，安装带 `vectorizer-worker` 额外依赖的版本。资料来源：[README.md:60-66](), [examples/simple_fastapi_app/README.md:75-85]() ### 4.2 数据库扩展安装可通过 CLI 或 Python 代码安装数据库对象。CLI 方式： ```bash pgai install -d ``` Python 方式（在应用启动时执行）： ```python pgai.install(DB_URL) ``` 安装完成后，所有对象均位于 `ai` schema 中。资料来源：[README.md:60-70](), [examples/simple_fastapi_app/README.md:30-40]() ### 4.3 最小示例下面以 Wikipedia 语义搜索为例展示完整流程。资料来源：[README.md:120-175]() 1. 创建 `wiki` 表并写入数据。 2. 使用 `ai.create_vectorizer()` 创建矢量器： ```sql SELECT ai.create_vectorizer( 'wiki'::regclass, if_not_exists => true, loading => ai.loading_column(column_name=>'text'), embedding => ai.embedding_openai(model=>'text-embedding-ada-002', dimensions=>'1536'), destination => ai.destination_table(view_name=>'wiki_embedding') ); ``` 3. 运行一次工作进程以处理存量数据： ```python worker = Worker(DB_URL, once=True) worker.run() ``` 4. 对 `wiki_embedding` 视图执行语义搜索： ```sql SELECT w.id, w.title, w.chunk, w.embedding <=> $1 AS distance FROM wiki_embedding w ORDER BY distance LIMIT 2; ``` 5. 将检索结果作为上下文注入 LLM，即可完成 RAG。资料来源：[examples/simple_fastapi_app/README.md:50-90]() ### 4.4 高级场景 - **从 PDF/HTML/Markdown 文档生成嵌入**：使用 `ai.loading_uri` 配合解析器，详见 [examples/embeddings_from_documents/README.md](https://github.com/timescale/pgai/blob/main/examples/embeddings_from_documents/README.md)。 - **大规模性能压测**：使用 [scripts/vectorizer-load-test/README.md](https://github.com/timescale/pgai/blob/main/scripts/vectorizer-load-test/README.md) 中提供的脚本生成约 150 万行 `wiki` 表。 - **多模型对比评估**：参考 `examples/evaluations/litellm_vectorizer`（Cohere/Mistral/OpenAI）与 `examples/evaluations/voyage_vectorizer`（Voyage AI）的评测配置。资料来源：[examples/evaluations/litellm_vectorizer/README.md:1-40](), [examples/evaluations/voyage_vectorizer/README.md:1-15]() - **text-to-SQL 与语义目录**：参考 [examples/text_to_sql/README.md](https://github.com/timescale/pgai/blob/main/examples/text_to_sql/README.md)，使用 `ai.create_semantic_catalog` 与 `ai.text_to_sql_openai`。 ## 5. 已知问题与社区关注点 - **Ollama 版本兼容性**：`ai.ollama_embed` 在 Ollama 13.3+ 出现错误（Issue #921，pgai 扩展 0.11.2、pgai 库 0.12.1）。升级 Ollama 前需留意此兼容性。资料来源：[Issue #921](https://github.com/timescale/pgai/issues/921) - **`Worker` 导入问题**：`from pgai import Worker` 在某些 `pgai[vectorizer-worker]` 安装环境下会失败（Issue #925）。资料来源：[Issue #925](https://github.com/timescale/pgai/issues/925) - **HTTP 客户端连接泄漏**：`AsyncOpenAI`、`Ollama`、`VoyageAI` 嵌入器创建的 HTTP 客户端未显式关闭，长期运行可能导致文件描述符耗尽（Issue #919）。资料来源：[Issue #919](https://github.com/timescale/pgai/issues/919) - **依赖版本过旧**：`litellm`、`openai` 等关键依赖被锁定在较早的版本范围，与使用新版本 AI/ML 库的项目存在冲突（Issue #915）。资料来源：[Issue #915](https://github.com/timescale/pgai/issues/915) - **Ollama 缺失功能**：社区请求通过 Ollama 使用 ReRank 模型（如 qwen-reranker），目前 Ollama 仅支持嵌入，暂不支持重排序（Issue #866）。资料来源：[Issue #866](https://github.com/timescale/pgai/issues/866) - **RDS 支持**：Amazon RDS 上的 pgai 部署是长期社区诉求（Issue #304）。资料来源：[Issue #304](https://github.com/timescale/pgai/issues/304) ## 6. See Also - [Vectorizer Overview](/docs/vectorizer/overview.md) - [Vectorizer API Reference](/docs/vectorizer/api-reference.md) - [Vectorizer Worker 部署](/docs/vectorizer/worker.md) - [语义目录 (Semantic Catalog)](/docs/semantic_catalog/README.md) - [API 密钥管理](/projects/extension/docs/security/handling-api-keys.md) --- ## Vectorizer & Embedding Pipeline ### 相关页面相关主题：[Overview, Architecture & Installation](#page-1), [Embedders, Worker & Operational Concerns](#page-4)

相关源码文件

以下源码文件用于生成本页说明： - [README.md](https://github.com/timescale/pgai/blob/main/README.md) - [docs/vectorizer/api-reference.md](https://github.com/timescale/pgai/blob/main/docs/vectorizer/api-reference.md) - [docs/vectorizer/overview.md](https://github.com/timescale/pgai/blob/main/docs/vectorizer/overview.md) - [examples/embeddings_from_documents/README.md](https://github.com/timescale/pgai/blob/main/examples/embeddings_from_documents/README.md) - [examples/evaluations/litellm_vectorizer/README.md](https://github.com/timescale/pgai/blob/main/examples/evaluations/litellm_vectorizer/README.md) - [examples/simple_fastapi_app/README.md](https://github.com/timescale/pgai/blob/main/examples/simple_fastapi_app/README.md) - [projects/extension/README.md](https://github.com/timescale/pgai/blob/main/projects/extension/README.md)

# Vectorizer & Embedding Pipeline ## 概述与核心架构 pgai 向量器（Vectorizer）是一个可配置的端到端管道系统，用于将 PostgreSQL 表中的原始数据自动转换为向量嵌入（embedding），并保持源数据与嵌入结果的持续同步。其核心设计理念是将嵌入生成过程与应用程序的数据修改操作（INSERT/UPDATE/DELETE）解耦，从而避免 LLM 服务的间歇性故障或速率限制影响核心业务可用性资料来源：[README.md:88-92]()。系统在逻辑上由三类实体组成： - **应用程序（Application）**：用户业务代码，负责定义向量器配置、插入/更新数据。 - **PostgreSQL 数据库**：存储源数据及通过 `ai.create_vectorizer()` 创建的目标表/视图。 - **无状态向量器工作进程（Vectorizer Worker）**：从数据库中读取任务，调用 LLM 嵌入端点，将结果写回。这种"应用—数据库—Worker"三角架构的关键优势在于弹性：Worker 是无状态的、可水平扩展的，而数据修改与嵌入过程完全分离资料来源：[README.md:42-48]()。 ## 管道阶段详解向量器管道由五个按顺序应用的组件构成资料来源：[README.md:62-72]()。 ```mermaid flowchart LR A[源数据
PostgreSQL 列] --> B[Loading
加载] B --> C[Parsing
解析] C --> D[Chunking
分块] D --> E[Formatting
格式化] E --> F[Embedding
嵌入] F --> G[(目标表/视图
含 embedding 列)] H[Vectorizer Worker] -.轮询任务.-> A H -.写入结果.-> G ``` ### Loading（加载）定义嵌入数据的来源。可以是源表某列直接存储的文本，也可以是某列引用的 URI（指向本地文件或 S3 桶等远程对象）资料来源：[examples/embeddings_from_documents/README.md:18-32]()。 ### Parsing（解析）当数据为非文本文档（PDF、HTML、Markdown）时，定义解析方式。社区已提出通过 docling 为 PDF/DOCX 增加 OCR 支持的请求（[Issue #795](https://github.com/timescale/pgai/issues/795)）资料来源：[examples/embeddings_from_documents/README.md:34-38]()。 ### Chunking（分块）将文本切分为适合嵌入模型的片段。常见配置如 `ai.chunking_recursive_character_text_splitter('text', 512, 50)` 资料来源：[examples/evaluations/litellm_vectorizer/README.md:74-82]()。 ### Formatting（格式化）对每个块定义发送至嵌入端点前的格式。例如可将文档标题作为块的首行，提升检索上下文质量资料来源：[README.md:70-72]()。 ### Embedding（嵌入）指定 LLM 提供商、模型及参数。完整 API 参考见 `ai.embedding_*` 系列函数文档资料来源：[docs/vectorizer/api-reference.md]()。 ## 支持的嵌入模型 pgai 支持多种嵌入提供商，统一通过 `ai.embedding_*` 接口暴露资料来源：[README.md:74-86]()： - **Ollama**：本地/开源模型；社区已请求支持 ReRank 模型（如 qwen-reranker）（[Issue #866](https://github.com/timescale/pgai/issues/866)）。 - **OpenAI**：默认云端选项（含 Azure OpenAI 通过 LiteLLM）。 - **Voyage AI**：商业嵌入服务，另有快速入门文档资料来源：[docs/vectorizer/quick-start-voyage.md]()。 - **Cohere / Huggingface / Mistral**：通过 LiteLLM 统一接入。 - **AWS Bedrock / Vertex AI**：通过 LiteLLM 接入。已知社区问题：Ollama 服务端高于 13.3 时，`ollama_embed` 函数会出现兼容性错误（[Issue #921](https://github.com/timescale/pgai/issues/921)）资料来源：[README.md:74-86]()。 ## 向量器工作进程（Worker） Worker 是实际执行嵌入生成任务的无状态进程。安装方式为 `pip install "pgai[vectorizer-worker]"`，随后可通过 `from pgai import Worker` 在 Python 进程中启动，或通过 CLI、Docker 独立运行资料来源：[examples/simple_fastapi_app/README.md:30-36]()。在 FastAPI 示例中，Worker 与应用生命周期绑定，常驻后台持续轮询任务队列资料来源：[examples/simple_fastapi_app/README.md:42-50]()。系统支持并发批处理以高效生成嵌入，并内置对模型失败、速率限制、延迟尖峰的处理资料来源：[README.md:96-102]()。 ### 社区反馈的常见问题 - **导入失败**：`from pgai import Worker` 报错 `cannot import name 'Worker' from 'pgai'`（[Issue #925](https://github.com/timescale/pgai/issues/925)）。 - **HTTP 客户端连接泄漏**：AsyncOpenAI、Ollama、VoyageAI 等 embedder 创建的 HTTP 客户端未显式关闭，长时间运行会导致 `CLOSE_WAIT` 状态连接累积并耗尽文件描述符（[Issue #919](https://github.com/timescale/pgai/issues/919)）。 - **依赖版本过旧**：`litellm`、`openai` 等依赖的版本上限制约了新版本 AI/ML 库的集成（[Issue #915](https://github.com/timescale/pgai/issues/915)）。 ## 典型使用流程 1. **安装包与数据库对象**：`pip install pgai`，随后 `pgai.install(DB_URL)` 在 `ai` schema 下安装必要对象资料来源：[projects/extension/README.md:30-38]()。 2. **创建向量器**：通过 `ai.create_vectorizer()` 指定 source、loading、embedding、destination 等配置资料来源：[projects/extension/README.md:40-50]()。 3. **启动 Worker**：以一次性（`once=True`）或常驻方式运行，处理现有数据与新增变更资料来源：[projects/extension/README.md:52-60]()。 4. **查询结果**：向量器自动生成的目标视图（如 `wiki_embedding`）包含源表全部列及 `embedding`、`chunk` 列，可直接使用 pgvector 的 `<=>` 余弦距离运算符进行语义搜索与 RAG 资料来源：[examples/simple_fastapi_app/README.md:82-96]()。 ## See Also - [Vectorizer Overview](https://github.com/timescale/pgai/blob/main/docs/vectorizer/overview.md) - [Vectorizer API Reference](https://github.com/timescale/pgai/blob/main/docs/vectorizer/api-reference.md) - [Vectorizer Quick Start (OpenAI)](https://github.com/timescale/pgai/blob/main/docs/vectorizer/quick-start.md) - [Vectorizer Quick Start (Voyage AI)](https://github.com/timescale/pgai/blob/main/docs/vectorizer/quick-start-voyage.md) - [Vectorizer Worker 文档](https://github.com/timescale/pgai/blob/main/docs/vectorizer/worker.md) - [Document Embeddings 示例](https://github.com/timescale/pgai/blob/main/examples/embeddings_from_documents/README.md) - [Semantic Catalog（Text-to-SQL）](https://github.com/timescale/pgai/blob/main/docs/semantic_catalog/README.md) --- ## Semantic Catalog & Text-to-SQL ### 相关页面相关主题：[Vectorizer & Embedding Pipeline](#page-2), [Embedders, Worker & Operational Concerns](#page-4)

相关源码文件

# Semantic Catalog & Text-to-SQL ## 概述 pgai 在 PostgreSQL 的 `pg_catalog` 之上引入了"语义目录 (Semantic Catalog)"概念，使数据库对象（表、视图、函数等）可以附带自然语言描述与示例 SQL，从而为基于 LLM 的 text-to-SQL 提供语义检索基础。资料来源：[README.md:40-42]()。 `ai.text_to_sql` 函数会先在语义目录中做语义搜索，找出与问题相关的数据库对象和示例 SQL 语句，再把这些上下文交给 LLM 生成 SQL。资料来源：[examples/text_to_sql/README.md:14-22]()。 > 社区关注点：早在 issue #24 "Text-to-SQL" 中便提出能否嵌入 `pg_catalog` 驱动 text-to-SQL，语义目录正是该方向的核心实现。资料来源：[GitHub Issue #24](https://github.com/timescale/pgai/issues/24)。 ## 架构与数据流语义目录复用了 pgai 的 [Vectorizer 流水线](/docs/vectorizer/overview.md) 来生成并维护描述的向量嵌入，与 text-to-SQL 调用解耦。资料来源：[examples/text_to_sql/README.md:14-22]()。 ```mermaid flowchart LR A[数据库对象
tables/views/functions] --> B[ai.create_semantic_catalog] B --> C[描述表 + 示例 SQL] C --> D[Vectorizer 流水线] D --> E[嵌入向量
embedding 列] U[用户问题] --> F[ai.text_to_sql] E --> F F --> G[LLM 生成 SQL] ``` 调用流程说明： 1. 管理员执行 `ai.create_semantic_catalog` 初始化目录并配置 Vectorizer 与 text-to-SQL 提供方。资料来源：[examples/text_to_sql/README.md:42-58]()。 2. Vectorizer 负责把描述与示例 SQL 嵌入到向量表，并随数据变更自动更新。资料来源：[README.md:40-50]()。 3. 用户调用 `ai.text_to_sql('...')`，函数对问题进行嵌入，检索语义目录中相关对象和示例，组装提示词后调用配置的 LLM 生成 SQL。资料来源：[examples/text_to_sql/README.md:14-22]()。 ## 配置与启用 ### 启用特性开关 text-to-SQL 当前需要特性开关显式开启： ```sql select set_config('ai.enable_feature_flag_text_to_sql', 'true', false); create extension ai cascade; ``` 资料来源：[examples/text_to_sql/README.md:32-40]()。 ### 创建语义目录 `ai.create_semantic_catalog` 接受与 `ai.create_vectorizer` 相似的参数，同时指定 text-to-SQL 的提供方。可分别为嵌入和 SQL 生成选择不同提供方： ```sql select ai.create_semantic_catalog( embedding => ai.embedding_openai('text-embedding-3-small', 1024), text_to_sql => ai.text_to_sql_openai(model => 'o3-mini') ); ``` 资料来源：[examples/text_to_sql/README.md:42-58]()。 ### API 密钥 API 密钥需要按 [handling-api-keys 文档](/projects/extension/docs/security/handling-api-keys.md) 暴露给 pgai，常见做法是设置 `OPENAI_API_KEY` 环境变量。资料来源：[examples/text_to_sql/README.md:60-66]()。 ## 使用示例下面摘自 `postgres_air` 演示数据集的示例展示了一次完整的 text-to-SQL 调用。资料来源：[examples/text_to_sql/README.md:1-12]()。输入自然语言问题： ```sql select ai.text_to_sql('How many flights arrived in Houston, TX in June 2024?'); ``` 返回的 SQL： ```sql SELECT COUNT(*) AS num_flights FROM postgres_air.flight WHERE arrival_airport = 'IAH' AND scheduled_arrival >= '2024-06-01'::timestamptz AND scheduled_arrival < '2024-07-01'::timestamptz; ``` 执行该 SQL 得到 1273 条结果。资料来源：[examples/text_to_sql/README.md:1-12]()。 ## 已知问题与社区反馈 - **细化循环可能耗尽迭代次数**：issue #926 报告 `ai.text_to_sql` 在某些场景下会陷入"refinement iterations"循环，直到达到最大迭代次数后抛出异常。资料来源：[GitHub Issue #926](https://github.com/timescale/pgai/issues/926)。 - **Ollama 版本兼容性**：Ollama 升级到 13.3 以上后 `ollama_embed` 函数会出错，可能影响使用 Ollama 提供方时的嵌入和检索。资料来源：[GitHub Issue #921](https://github.com/timescale/pgai/issues/921)；以及更早的 [GitHub Issue #21](https://github.com/timescale/pgai/issues/21)。 - **HTTP 客户端连接泄漏**：AsyncOpenAI、Ollama、VoyageAI 等 embedder 的 HTTP 客户端未显式关闭，长期运行会触发 `Too many open files`。资料来源：[GitHub Issue #919](https://github.com/timescale/pgai/issues/919)。 - **Ollama 仅支持嵌入而不支持重排序**：当前文档中 Ollama 仅能用于 embedding，rerank 提供方暂不支持 qwen-reranker 等开源模型。资料来源：[GitHub Issue #866](https://github.com/timescale/pgai/issues/866)。 ## 配置选项一览 | 参数 | 示例值 | 作用 | | --- | --- | --- | | `embedding` | `ai.embedding_openai('text-embedding-3-small', 1024)` | 语义目录描述的嵌入模型与维度 | | `text_to_sql` | `ai.text_to_sql_openai(model => 'o3-mini')` | 生成 SQL 时使用的 LLM 提供方与模型 | | `ai.enable_feature_flag_text_to_sql` | `true` | 控制 text-to-SQL 特性是否启用 | 资料来源：[examples/text_to_sql/README.md:32-58]()。 ## See Also - [docs/vectorizer/overview.md](https://github.com/timescale/pgai/blob/main/docs/vectorizer/overview.md) - [docs/vectorizer/api-reference.md](https://github.com/timescale/pgai/blob/main/docs/vectorizer/api-reference.md) - [projects/extension/docs/security/handling-api-keys.md](https://github.com/timescale/pgai/blob/main/projects/extension/docs/security/handling-api-keys.md) --- ## Embedders, Worker & Operational Concerns ### 相关页面相关主题：[Overview, Architecture & Installation](#page-1), [Vectorizer & Embedding Pipeline](#page-2), [Semantic Catalog & Text-to-SQL](#page-3)

相关源码文件

以下源码文件用于生成本页说明： - [README.md](https://github.com/timescale/pgai/blob/main/README.md) - [projects/extension/README.md](https://github.com/timescale/pgai/blob/main/projects/extension/README.md) - [examples/simple_fastapi_app/README.md](https://github.com/timescale/pgai/blob/main/examples/simple_fastapi_app/README.md) - [examples/embeddings_from_documents/README.md](https://github.com/timescale/pgai/blob/main/examples/embeddings_from_documents/README.md) - [examples/evaluations/litellm_vectorizer/README.md](https://github.com/timescale/pgai/blob/main/examples/evaluations/litellm_vectorizer/README.md) - [examples/evaluations/voyage_vectorizer/README.md](https://github.com/timescale/pgai/blob/main/examples/evaluations/voyage_vectorizer/README.md) - [scripts/vectorizer-load-test/README.md](https://github.com/timescale/pgai/blob/main/scripts/vectorizer-load-test/README.md) - [Issue #919 — HTTP client connection leak](https://github.com/timescale/pgai/issues/919) - [Issue #921 — ollama_embed error on ollama higher 13.3](https://github.com/timescale/pgai/issues/921) - [Issue #925 — cannot import name 'Worker' from 'pgai'](https://github.com/timescale/pgai/issues/925) - [Issue #915 — Outdated dependency versions](https://github.com/timescale/pgai/issues/915)

# Embedders、Worker 与运维注意事项 ## 1. 概述与基本架构 pgai 的核心目标是把 PostgreSQL 变成可用于生产级 RAG 与 Agent 应用的检索引擎。其基本架构由三部分组成：应用程序、PostgreSQL 数据库，以及无状态的 **Vectorizer Worker**。应用程序定义 vectorizer 配置来嵌入来自 PostgreSQL 表列或 S3 文档等来源的数据；Worker 读取这些配置，将数据队列处理为嵌入向量与分块文本，并把结果写回数据库；应用程序随后查询这些数据来支撑 RAG 与语义检索。该架构的关键优势在于**弹性**：应用对数据的 INSERT/UPDATE/DELETE 操作与嵌入流程解耦，嵌入服务出现故障或延迟不会阻塞核心数据写入。资料来源：[README.md:1-120]() ```mermaid flowchart LR A[Application
INSERT/UPDATE] --> B[(PostgreSQL
ai schema)] B <-->|Poll config & queue| C[Vectorizer Worker] C -->|HTTP calls| D[Embedding Providers
OpenAI / Ollama / VoyageAI / LiteLLM] D --> C C -->|写入 embedding + chunk| B B -->|语义检索 / RAG| A ``` ## 2. Embedder 生态与配置 ### 2.1 支持的嵌入提供方 pgai 通过统一接口支持多种 embedding provider，主流选项包括 Ollama、OpenAI、Voyage AI、Cohere、HuggingFace、Mistral、Azure OpenAI、AWS Bedrock 与 Vertex AI。其中 Cohere、HuggingFace、Mistral、Azure OpenAI、AWS Bedrock、Vertex AI 均通过 LiteLLM 适配层接入。资料来源：[README.md:142-158]() ### 2.2 在 Python 中声明 vectorizer 以 Ollama 为例，可使用 `CreateVectorizer` SQL 语句构建器创建 vectorizer，指定源表 `wiki`、目标表 `wiki_embedding_storage`、加载方式为列加载、嵌入提供方为本地 Ollama 的 `all-minilm` 模型（384 维）。若 vectorizer 已存在则忽略异常。资料来源：[examples/simple_fastapi_app/README.md:60-84]() ```python vectorizer_statement = CreateVectorizer( source="wiki", target_table='wiki_embedding_storage', loading=LoadingColumnConfig(column_name='text'), embedding=EmbeddingOllamaConfig( model='all-minilm', dimensions=384, base_url="http://localhost:11434") ).to_sql() ``` 也可使用 LiteLLM 在同一 SQL 接口中切换不同模型进行评估，例如在 SEC filings 数据集上同时运行 OpenAI `text-embedding-3-small` 与 Voyage AI `finance-2` 的 vectorizer，并比较它们的检索质量。资料来源：[examples/evaluations/voyage_vectorizer/README.md:1-80]() ### 2.3 在 SQL 中直接调用模型除了 vectorizer 流水线，pgai 扩展允许直接在 SQL 中调用 LLM，例如 `ai.ollama_embed()` 用于生成查询向量，`ai.ollama_chat_complete()` 用于在 RAG 函数中组装 prompt 并生成回答。资料来源：[projects/extension/README.md:130-176]() ## 3. Vectorizer Worker 的安装与运行 ### 3.1 Python 中的安装与导入应用程序启动时执行 `pgai.install(DB_URL)` 会在数据库中创建 `ai` schema 下的必要对象；vectorizer worker 通常与 FastAPI 应用生命周期一同运行，或以独立进程、CLI、Docker 方式后台运行。资料来源：[examples/simple_fastapi_app/README.md:40-58]()、 [README.md:166-210]() ### 3.2 Worker 导入路径问题社区报告了 `pip install "pgai[vectorizer-worker]"` 之后 `from pgai import Worker` 失败的错误，说明 `Worker` 不在默认顶层命名空间中，需要从正确的子模块导入（如 `pgai.vectorizer.worker`），或者使用 CLI / Docker 镜像来运行 worker。资料来源：[Issue #925]() ### 3.3 临时运行与持续运行在快速验证时可用 `Worker(DB_URL, once=True)` 让 worker 只执行一轮后退出；在生产环境中则应作为常驻进程，通过轮询从 vectorizer 配置中领取任务。资料来源：[README.md:194-210]() ## 4. 运维注意事项与已知问题 ### 4.1 HTTP 客户端连接泄漏 `AsyncOpenAI`、`Ollama`、`VoyageAI` 等 embedder 在内部创建 HTTP 客户端但未显式关闭，长期运行后连接会在 `CLOSE_WAIT` 状态堆积，最终触发 “Too many open files” 错误并耗尽文件描述符。缓解办法包括为 worker 进程设置文件描述符上限、周期性重启 worker，以及在升级版本时关注上游对连接池的修复。资料来源：[Issue #919]() ### 4.2 Ollama 版本兼容性将 Ollama 升级到 13.3 以上的版本后，`ollama_embed` 函数会报错。pgai 扩展 0.11.2 与 pgai 库 0.12.1 在 PostgreSQL 17 / Ubuntu 24.04 上确认存在该问题。运维上应固定本地 Ollama 版本或等待 pgai 发布兼容性修复。资料来源：[Issue #921]() ### 4.3 依赖版本约束过紧 pgai 当前对 `litellm`（`>=1.65.0,<1.73.0`）、`openai`（`>=1.44,<2.0`）等关键库设置了较紧的版本上限，会与使用较新 AI/ML 库的下游项目产生依赖冲突。若要解决，需要等待版本放宽或使用兼容的依赖解析策略（例如 `[vectorizer-worker]` extra 与较新的 LLM 库并存）。资料来源：[Issue #915]() ### 4.4 自定义 base URL 与第三方推理服务使用 llama.cpp 这类 OpenAI 兼容推理服务时，由于 `embedding_openai` 默认走 tiktoken 计算 token，遇到未支持的 tokenizer 会报错。需要在 vectorizer worker 中支持自定义 base URL 配置，或在 LiteLLM 适配层下声明对应的 OpenAI 兼容端点。资料来源：[Issue #850]() ### 4.5 失败处理的设计原则 README 强调，LLM 端点具有间歇性故障与延迟波动，正确做法是确保主数据写入（INSERT/UPDATE/DELETE）**不**依赖嵌入操作，这样即使 embedding 服务降级也不会影响业务可用性。批量处理、限流与重试由 vectorizer 内置负责。资料来源：[README.md:160-166]() ## 5. 常见运行模式速查 | 场景 | 推荐做法 | 参考来源 | | --- | --- | --- | | 本地原型 | `pgai.install(DB_URL)` + `Worker(DB_URL, once=True)` | [README.md:194-210]() | | Web 应用 | 在 FastAPI 启动时 install，worker 跟随生命周期运行 | [examples/simple_fastapi_app/README.md:40-58]() | | 文档批量嵌入 | `ai.loading_uri` + 挂载文档目录到 worker 容器 | [examples/embeddings_from_documents/README.md:1-44]() | | 评估多模型 | 通过 LiteLLM 同时挂载多个 vectorizer，对比检索质量 | [examples/evaluations/litellm_vectorizer/README.md:1-40]() | | 负载测试 | `scripts/vectorizer-load-test/` 生成 ~1.5M 行并压测 worker | [scripts/vectorizer-load-test/README.md:1-22]() | ## See Also - [README.md](https://github.com/timescale/pgai/blob/main/README.md) — 项目总览与 vectorizer 流水线 - [docs/vectorizer/overview.md](https://github.com/timescale/pgai/blob/main/docs/vectorizer/overview.md) — Vectorizer 使用指南 - [docs/vectorizer/worker.md](https://github.com/timescale/pgai/blob/main/docs/vectorizer/worker.md) — Worker 部署细节 - [docs/vectorizer/api-reference.md](https://github.com/timescale/pgai/blob/main/docs/vectorizer/api-reference.md) — SQL API 参考 - [projects/extension/README.md](https://github.com/timescale/pgai/blob/main/projects/extension/README.md) — 扩展层的模型调用函数 --- --- ## Doramagic 踩坑日志项目：timescale/pgai 摘要：发现 13 个潜在踩坑项，其中 2 个为 high/blocking；最高优先级：安装坑 - 来源证据：[Bug]: semantic catalog text to sql stuck in refinement iterations。 ## 1. 安装坑 · 来源证据：[Bug]: semantic catalog text to sql stuck in refinement iterations - 严重度：high - 证据强度：source_linked - 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug]: semantic catalog text to sql stuck in refinement iterations - 对用户的影响：可能增加新用户试用和生产接入成本。 - 证据：community_evidence:github | https://github.com/timescale/pgai/issues/926 | 来源讨论提到 node 相关条件，需在安装/试用前复核。 ## 2. 安全/权限坑 · 来源证据：Outdated dependency versions are blocking adoption - 严重度：high - 证据强度：source_linked - 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Outdated dependency versions are blocking adoption - 对用户的影响：可能影响授权、密钥配置或安全边界。 - 证据：community_evidence:github | https://github.com/timescale/pgai/issues/915 | 来源讨论提到 python 相关条件，需在安装/试用前复核。 ## 3. 安装坑 · 来源证据：[Bug]: ollama_embed error on ollama higher 13.3 - 严重度：medium - 证据强度：source_linked - 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug]: ollama_embed error on ollama higher 13.3 - 对用户的影响：可能增加新用户试用和生产接入成本。 - 证据：community_evidence:github | https://github.com/timescale/pgai/issues/921 | 来源讨论提到 python 相关条件，需在安装/试用前复核。 ## 4. 安装坑 · 来源证据：cannot import name 'Worker' from 'pgai' - 严重度：medium - 证据强度：source_linked - 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：cannot import name 'Worker' from 'pgai' - 对用户的影响：可能增加新用户试用和生产接入成本。 - 证据：community_evidence:github | https://github.com/timescale/pgai/issues/925 | 来源类型 github_issue 暴露的待验证使用条件。 ## 5. 能力坑 · 来源证据：[Feature]: Use ReRank models (i.e. qwen-reranker) through Ollama - 严重度：medium - 证据强度：source_linked - 发现：GitHub 社区证据显示该项目存在一个能力理解相关的待验证问题：[Feature]: Use ReRank models (i.e. qwen-reranker) through Ollama - 对用户的影响：可能增加新用户试用和生产接入成本。 - 证据：community_evidence:github | https://github.com/timescale/pgai/issues/866 | 来源类型 github_issue 暴露的待验证使用条件。 ## 6. 能力坑 · 能力判断依赖假设 - 严重度：medium - 证据强度：source_linked - 发现：README/documentation is current enough for a first validation pass. - 对用户的影响：假设不成立时，用户拿不到承诺的能力。 - 证据：capability.assumptions | https://github.com/timescale/pgai | README/documentation is current enough for a first validation pass. ## 7. 运行坑 · 来源证据：HTTP client connection leak in embedders causes file descriptor exhaustion - 严重度：medium - 证据强度：source_linked - 发现：GitHub 社区证据显示该项目存在一个运行相关的待验证问题：HTTP client connection leak in embedders causes file descriptor exhaustion - 对用户的影响：可能增加新用户试用和生产接入成本。 - 证据：community_evidence:github | https://github.com/timescale/pgai/issues/919 | 来源类型 github_issue 暴露的待验证使用条件。 ## 8. 维护坑 · 维护活跃度未知 - 严重度：medium - 证据强度：source_linked - 发现：未记录 last_activity_observed。 - 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。 - 证据：evidence.maintainer_signals | https://github.com/timescale/pgai | last_activity_observed missing - 严重度：medium - 证据强度：source_linked - 发现：no_demo - 证据：downstream_validation.risk_items | https://github.com/timescale/pgai | no_demo; severity=medium ## 10. 安全/权限坑 · 存在评分风险 - 严重度：medium - 证据强度：source_linked - 发现：no_demo - 对用户的影响：风险会影响是否适合普通用户安装。 - 证据：risks.scoring_risks | https://github.com/timescale/pgai | no_demo; severity=medium ## 11. 安全/权限坑 · 来源证据：[Feature]: llama.cpp embedding worker integration - 严重度：medium - 证据强度：source_linked - 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：[Feature]: llama.cpp embedding worker integration - 对用户的影响：可能影响授权、密钥配置或安全边界。 - 证据：community_evidence:github | https://github.com/timescale/pgai/issues/850 | 来源类型 github_issue 暴露的待验证使用条件。 ## 12. 维护坑 · issue/PR 响应质量未知 - 严重度：low - 证据强度：source_linked - 发现：issue_or_pr_quality=unknown。 - 对用户的影响：用户无法判断遇到问题后是否有人维护。 - 证据：evidence.maintainer_signals | https://github.com/timescale/pgai | issue_or_pr_quality=unknown ## 13. 维护坑 · 发布节奏不明确 - 严重度：low - 证据强度：source_linked - 发现：release_recency=unknown。 - 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。 - 证据：evidence.maintainer_signals | https://github.com/timescale/pgai | release_recency=unknown