# https://github.com/lm-sys/FastChat 项目说明书

生成时间：2026-06-13 18:37:33 UTC

## 目录

- [FastChat 概览与分布式服务架构](#page-overview-architecture)
- [模型推理后端与 OpenAI 兼容 API](#page-inference-backends)
- [MT-Bench 评估与 Chatbot Arena](#page-evaluation-arena)
- [训练微调、数据流水线与部署运维](#page-training-deployment)

<a id='page-overview-architecture'></a>

## FastChat 概览与分布式服务架构

### 相关页面

相关主题：[模型推理后端与 OpenAI 兼容 API](#page-inference-backends), [训练微调、数据流水线与部署运维](#page-training-deployment)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/lm-sys/FastChat/blob/main/README.md)
- [playground/test_embedding/README.md](https://github.com/lm-sys/FastChat/blob/main/playground/test_embedding/README.md)
- [fastchat/llm_judge/README.md](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md)
- [fastchat/serve/monitor/classify/README.md](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/monitor/classify/README.md)
</details>

# FastChat 概览与分布式服务架构

## 项目定位与核心能力

FastChat 是一个面向大语言模型（LLM）聊天机器人训练、服务化与评估的开源平台，由 LMSYS 团队维护。它驱动了 Chatbot Arena（[lmarena.ai](https://lmarena.ai)），已为 70 多个 LLM 提供超过 1000 万次聊天请求服务，并收集了超过 150 万条人类偏好投票用于在线排行榜 [资料来源：[README.md:1-7]()]。其核心能力包括：

- **训练与评估**：提供 Vicuna、MT-Bench 等先进模型的训练与评测代码。
- **分布式多模型服务**：基于 Web UI 与 OpenAI 兼容 RESTful API 的多模型分布式服务系统。
- **生态集成**：提供 LangChain、vLLM、SGLang、LightLLM、HuggingFace Generation API 等多种集成路径 [资料来源：[README.md:11-15]()]。

FastChat 同时也是一个完整的训练框架，可基于 Stanford Alpaca 的代码对 Llama 等基座模型在 ShareGPT 风格对话数据上进行微调 [资料来源：[README.md:96-104]()]。

## 分布式服务架构

FastChat 的服务系统采用经典的 **Controller–Worker–WebServer** 三层分布式架构，三类进程各司其职，可水平扩展以承载高并发请求和多模型路由 [资料来源：[README.md:152-176]()]。下图展示了该架构的核心组件与请求流转路径。

```mermaid
flowchart LR
    U[用户浏览器 / API 客户端] -->|HTTP / OpenAI API| W[Gradio WebServer<br/>fastchat.serve.gradio_web_server]
    W -->|注册 & 心跳| C[Controller<br/>fastchat.serve.controller]
    W -->|路由生成请求| C
    C -->|转发到空闲 Worker| M1[Model Worker 1<br/>Vicuna-7B]
    C -->|转发到空闲 Worker| M2[Model Worker 2<br/>FastChat-T5]
    C -->|转发到空闲 Worker| M3[Model Worker 3<br/>vLLM / SGLang 后端]
    M1 -.注册./心跳.-> C
    M2 -.注册./心跳.-> C
    M3 -.注册./心跳.-> C
    M1 -->|推理结果| W
    M2 -->|推理结果| W
    M3 -->|推理结果| W
```

各组件职责说明：

| 组件 | 启动命令 | 职责 |
| --- | --- | --- |
| Controller | `python3 -m fastchat.serve.controller` | 维护 worker 注册表、心跳管理与请求路由 |
| Model Worker | `python3 -m fastchat.serve.model_worker --model-path ...` | 加载并托管一个或多个模型，对外暴露推理 API |
| WebServer | `python3 -m fastchat.serve.gradio_web_server` | 提供 Gradio UI 与 OpenAI 兼容 RESTful API |

Controller 不直接加载模型，它只负责协调；Model Worker 是真正的推理执行者，可以分布在不同 GPU 上以并行服务多模型或多副本 [资料来源：[README.md:152-176]()]。

## 部署与启动流程

典型的本地部署流程依次启动三个进程 [资料来源：[README.md:160-176]()]：

1. **启动 Controller**：作为分布式协调中心，默认监听 `http://localhost:21001`。
2. **启动 Model Worker**：通过 `--model-path` 指定 HuggingFace 仓库或本地权重路径；模型加载完成后 worker 会自动向 controller 注册，直到看到 `Uvicorn running on ...` 字样表示就绪。
3. **启动 Gradio WebServer**：作为用户界面或 API 入口，并通过 `test_message.py` 脚本验证 worker 连通性。

多 GPU 横向扩展时，可在同一 controller 下注册多个 worker，分别绑定不同的 `CUDA_VISIBLE_DEVICES`、端口与模型，实现单模型高吞吐或多模型并行 [资料来源：[README.md:189-200]()]。例如：

```
# worker 0
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.model_worker \
  --model-path lmsys/vicuna-7b-v1.5 \
  --controller http://localhost:21001 --port 31000 --worker http://localhost:31000
# worker 1
CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.model_worker \
  --model-path lmsys/fastchat-t5-3b-v1.0 \
  --controller http://localhost:21001 --port 31001 --worker http://localhost:31001
```

启动 Chatbot Arena（对战式 UI）时，还需要通过 `--register-api-endpoint-file api_endpoint.json` 注册外部 API 模型（OpenAI、Anthropic、Gemini、Mistral 等）[资料来源：[README.md:179-186]()]。

## 推理后端与模型生态

FastChat 默认的 Model Worker 基于 HuggingFace Transformers 实现，兼容性最佳但吞吐受限；社区推荐通过 [vLLM 集成文档](docs/vllm_integration.md) 接入高吞吐批处理后端 [资料来源：[README.md:201-204]()]。v0.2.36 版本进一步引入了 SGLang worker（支持视觉语言模型）与 LightLLM worker，OpenAI 兼容 API 服务端也新增了对图像输入的支持 [资料来源：v0.2.36 Release Notes]()。

支持的模型涵盖 Llama 2、Vicuna、Alpaca、Baize、ChatGLM、Dolly、Falcon、FastChat-T5、GPT4ALL、Guanaco、MTP、OpenAssistant、OpenChat、RedPajama、StableLM、WizardLM、xDAN-AI 等，新模型可通过 [docs/model_support.md](docs/model_support.md) 中的约定添加 [资料来源：[README.md:65-71]()]。命令行推理时通过 `--device` 参数支持 CPU、CUDA、Metal（Apple Silicon）、Intel XPU、Ascend NPU 等多种后端，并可叠加 `--load-8bit` 启用 8-bit 量化以降低显存占用 [资料来源：[README.md:121-149]()]。

## 评估与扩展能力

### MT-Bench 自动化评估

`fastchat/llm_judge` 子包实现了基于 LLM-as-a-judge 的 MT-Bench 评测流程：使用 GPT-4 等强模型作为裁判，对多轮开放问答的回答质量进行打分 [资料来源：[fastchat/llm_judge/README.md:1-13]()]。可通过 `gen_model_answer.py` 生成回答，再利用 `gen_judgment.py` 调用裁判模型输出评分。社区同时发布了 3.3K 条人类裁判标注用于校准 GPT-4 裁判与人类的一致性，达到 80% 以上一致率 [资料来源：[fastchat/llm_judge/README.md:55-61]()]。

### 嵌入与分类扩展

`playground/test_embedding` 演示了基于 Vicuna 与 OpenAI Embedding 的文本相似度、分类与语义检索用法 [资料来源：[playground/test_embedding/README.md:1-13]()]。`fastchat/serve/monitor/classify` 子模块则提供对 Chatbot Arena 数据进行主题分类（创意写作、视觉等）的扩展点，使用方可通过新增 `category.py` 子类并在 `config.yaml` 中注册实现自定义分类器 [资料来源：[fastchat/serve/monitor/classify/README.md:1-22]()]。

## 已知风险与社区关注点

部署分布式服务时需关注以下社区已披露的问题：

- **未鉴权 SSRF 与 Worker 伪造**：Controller 的 `/register_worker` 端点缺乏认证，攻击者可构造恶意 `worker_name` 触发服务端请求伪造，从而注册伪造 worker [资料来源：issue #3886](https://github.com/lm-sys/FastChat/issues/3886)。在公网部署时必须叠加反向代理鉴权或网络隔离。
- **OpenAI API `stop` 参数回归**：自 v0.2.5 起，`stop` 参数被直接写入 `conv.stop_str` 而非从请求读取，导致用户自定义停止符失效 [资料来源：issue #1048](https://github.com/lm-sys/FastChat/issues/1048)。如需该能力，请锁定 v0.2.3 或参考 PR 修复。
- **FastChat-T5 4K 上下文误解**：FastChat-T5 仅支持 2K 输入 + 2K 输出的总量为 4K，并非单回合 4K 上下文 [资料来源：issue #1711](https://github.com/lm-sys/FastChat/issues/1711)。

## See Also

- 模型支持与新增模型方法：[docs/model_support.md](docs/model_support.md)
- vLLM 高吞吐集成：[docs/vllm_integration.md](docs/vllm_integration.md)
- OpenAI 兼容 API 详解：[docs/openai_api.md](docs/openai_api.md)
- MT-Bench 评测流程：[fastchat/llm_judge/README.md](fastchat/llm_judge/README.md)
- 数据清洗流程：[docs/commands/data_cleaning.md](docs/commands/data_cleaning.md)

---

<a id='page-inference-backends'></a>

## 模型推理后端与 OpenAI 兼容 API

### 相关页面

相关主题：[FastChat 概览与分布式服务架构](#page-overview-architecture), [MT-Bench 评估与 Chatbot Arena](#page-evaluation-arena)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [fastchat/serve/model_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py)
- [fastchat/serve/vllm_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py)
- [fastchat/serve/sglang_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/sglang_worker.py)
- [fastchat/serve/lightllm_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/lightllm_worker.py)
- [fastchat/serve/dashinfer_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/dashinfer_worker.py)
- [fastchat/serve/mlx_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/mlx_worker.py)
- [fastchat/serve/openai_api_server.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py)
- [fastchat/serve/openai_api_protocol.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_protocol.py)
- [fastchat/serve/api.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/api.py)
- [fastchat/serve/api_provider.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/api_provider.py)
- [fastchat/serve/controller.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/controller.py)
- [fastchat/serve/gradio_web_server.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/gradio_web_server.py)
- [fastchat/conversation.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py)
</details>

# 模型推理后端与 OpenAI 兼容 API

FastChat 提供了一套可插拔的模型推理后端抽象层，并对外暴露 OpenAI 兼容的 RESTful API，使本地模型可以作为 OpenAI 服务的替代品被 `openai-python` SDK、LangChain、第三方 UI 等工具直接调用 [README.md](https://github.com/lm-sys/FastChat/blob/main/README.md)。本页聚焦于推理后端的实现、消息路由机制、OpenAI 协议转换以及后端选型要点。

## 一、整体架构与组件关系

FastChat 的分布式服务由三类进程组成：Web 服务器（Gradio 或 OpenAI 兼容入口）、Controller（注册中心）、Model Worker（一个或多个模型进程）[README.md](https://github.com/lm-sys/FastChat/blob/main/README.md)。所有推理后端（HF Transformers、vLLM、SGLang、LightLLM、DashInfer、MLX）均实现 `fastchat.serve.model_worker.BaseModelWorker` 抽象基类，统一暴露 `/worker_get_status`、`/generate_stream`、`/count_token` 等 HTTP 端点 [model_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py)。

```mermaid
flowchart LR
    Client[客户端<br/>openai-python / LangChain] -->|HTTP /v1/chat/completions| OAI[OpenAI API Server<br/>openai_api_server.py]
    Client -->|Web UI| GW[Gradio Web Server<br/>gradio_web_server.py]
    OAI -->|路由查询| Ctl[Controller<br/>controller.py]
    GW -->|路由查询| Ctl
    Ctl -->|注册/心跳| MW[Model Worker<br/>model_worker.py]
    MW --> Backend{推理后端}
    Backend --> HF[huggingface/transformers]
    Backend --> VLLM[vllm_worker.py]
    Backend --> SG[sglang_worker.py]
    Backend --> LL[lightllm_worker.py]
    Backend --> DI[dashinfer_worker.py]
    Backend --> MLX[mlx_worker.py]
```

## 二、推理后端（Model Worker）

`BaseModelWorker` 定义了所有后端必须实现的核心协议，包括 `load_model`、`generate_stream`、`get_conv_template`、`get_embeddings` 等方法 [model_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py)。各后端在以下维度上做出权衡：

| 后端 | 启动入口 | 特点 | 适用场景 |
| --- | --- | --- | --- |
| HuggingFace Transformers | `fastchat.serve.model_worker` | 默认实现，兼容性最佳 | 调试、小规模部署 |
| vLLM | `vllm_worker.py` | 高吞吐批处理，连续批处理 | 生产级高 QPS 服务 |
| SGLang | `sglang_worker.py` | 支持 VLM，结构化提示编程 | 多模态/低延迟 |
| LightLLM | `lightllm_worker.py` | 高吞吐 Token 调度 | 资源敏感型部署 |
| DashInfer | `dashinfer_worker.py` | 字节跳动内部推理引擎 | 大模型生产环境 |
| MLX | `mlx_worker.py` | Apple Silicon 优化 | macOS 本地推理 |

例如，`VLLMWorker` 复用 vLLM 引擎的 `OpenAIServingChat` 思路，构造 `AsyncEngineArgs` 后调用其 `engine.generate` 流式输出 [vllm_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py)；`SGLangWorker` 则借助 SGLang 的 RadixAttention 运行时，支持视觉语言模型（VLM），与 v0.2.36 版本中新增的 SGLang vision worker 能力对应 [sglang_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/sglang_worker.py)。

各 Worker 启动后立即向 Controller 的 `/register_worker` 端点发送心跳，但社区已报告该端点缺乏认证，存在 SSRF 与 worker/model 伪造风险 [#3886](https://github.com/lm-sys/FastChat/issues/3886)，生产环境部署时应在网络层或反向代理中加以限制。

## 三、OpenAI 兼容 API

### 协议层

`openai_api_protocol.py` 定义了与 OpenAI 官方 SDK 兼容的 Pydantic 数据模型，包括 `ChatCompletionRequest`、`ChatCompletionResponse`、`ChatMessage` 等，覆盖了 `model`、`messages`、`temperature`、`top_p`、`n`、`stream`、`stop`、`max_tokens`、`presence_penalty`、`frequency_penalty`、`user` 等参数 [openai_api_protocol.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_protocol.py)。自 v0.2.36 起，`messages` 字段支持 `image_url` 内容块，从而实现视觉语言输入 [openai_api_protocol.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_protocol.py)。

### 路由与对话模板

`openai_api_server.py` 在收到请求后会向 Controller 发起 `get_model_address` 查询，定位到承载指定模型名的 Worker 地址，然后通过 `Conversation` 模板（`fastchat.conversation`）将 OpenAI 风格的 `messages` 数组转换为各模型对应的 prompt 字符串 [openai_api_server.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py)。`stop` 参数的优先级与模板 `sep_style` 紧密相关：在 v0.2.5 之前，运行时可从请求覆盖 `conv.stop_str`，但从 v0.2.5 起模板内 `stop_str` 优先级提升，导致请求级 `stop` 失效 [#1048](https://github.com/lm-sys/FastChat/issues/1048)。

### 流式响应

启用 `stream=True` 时，Server 返回 `text/event-stream`（SSE）格式的分片，字段 `delta` 增量返回，结束帧 `finish_reason` 为 `stop`；非流式路径则在内部聚合 token 序列后一次性返回 [openai_api_server.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py)。`api.py` 中的 `get_response` 工具函数可被自定义代理复用，实现缓存、重试与日志注入 [api.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/api.py)。

### 外部 API 提供方

`api_provider.py` 将 OpenAI、Anthropic、Google Gemini、Mistral 等托管模型封装为与本地 Worker 行为一致的客户端，以便在 Gradio 多标签界面和 Arena 中混部 [api_provider.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/api_provider.py)，这是 Chatbot Arena 支持 70+ LLM 同台对战的关键 [README.md](https://github.com/lm-sys/FastChat/blob/main/README.md)。

## 四、选型与常见问题

**启动顺序**：必须先启动 Controller，再启动 Worker，最后启动 API Server/Gradio，否则 Worker 注册失败 [README.md](https://github.com/lm-sys/FastChat/blob/main/README.md)。可通过 `python3 -m fastchat.serve.test_message --model-name <name>` 验证 Worker 与 Controller 的连通性 [README.md](https://github.com/lm-sys/FastChat/blob/main/README.md)。

**内存与量化**：默认 HF Worker 内存占用较高，可通过 `--load-8bit` 启用 8-bit 量化，或切换到 vLLM/SGLang 提升吞吐 [README.md](https://github.com/lm-sys/FastChat/blob/main/README.md)。在 macOS 上 MLX Worker 借助 Apple Silicon 统一内存，可避免显式 VRAM 限制 [mlx_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/mlx_worker.py)。

**上下文长度**：FastChat-T5 标注支持 4K，但社区反馈只能 “2K 输入 + 2K 输出” 而非 4K 连续输入 [#1711](https://github.com/lm-sys/FastChat/issues/1711)，需要按模型自身位置编码配置 `max_position_embeddings`。

**已知差异**：`stop` 参数在某些模板上不会生效（见上文）；部分托管模型在 `api_provider.py` 中未实现工具调用（Function Calling）；Controller 的 `/register_worker` 无认证 [#3886](https://github.com/lm-sys/FastChat/issues/3886)。

---

**See Also**
- [服务器架构概览](../server_arch.md)（docs/server_arch.md）
- [vLLM 集成文档](https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md)
- [OpenAI API 文档](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md)
- [LangChain 集成](https://github.com/lm-sys/FastChat/blob/main/docs/langchain_integration.md)
- [支持的模型列表](https://github.com/lm-sys/FastChat/blob/main/docs/model_support.md)

---

<a id='page-evaluation-arena'></a>

## MT-Bench 评估与 Chatbot Arena

### 相关页面

相关主题：[FastChat 概览与分布式服务架构](#page-overview-architecture), [模型推理后端与 OpenAI 兼容 API](#page-inference-backends)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/lm-sys/FastChat/blob/main/README.md)
- [fastchat/llm_judge/README.md](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md)
- [fastchat/llm_judge/common.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/common.py)
- [fastchat/llm_judge/gen_model_answer.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/gen_model_answer.py)
- [fastchat/llm_judge/gen_judgment.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/gen_judgment.py)
- [fastchat/llm_judge/gen_api_answer.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/gen_api_answer.py)
- [fastchat/llm_judge/show_result.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/show_result.py)
- [fastchat/llm_judge/qa_browser.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/qa_browser.py)
</details>

# MT-Bench 评估与 Chatbot Arena

## 概述

MT-Bench 是一套用于评估聊天助手（chat assistant）的多轮开放式问题集，旨在通过「LLM-as-a-judge」的方式，使用强 LLM（例如 GPT-4）作为裁判，对模型的回复质量进行自动化打分。Chatbot Arena 是基于 FastChat 的并排对战（side-by-side battle）Web 界面，部署在 [lmarena.ai](https://lmarena.ai/)，供真实用户对两个匿名模型的回复进行投票比较，从而累积大量人类偏好数据形成 [LLM Elo 排行榜](https://lmarena.ai/?leaderboard)。

资料来源：[README.md:1-12]() 表明 FastChat 是支持训练、服务化与评估大语言模型聊天机器人的开放平台，Chatbot Arena 累计服务超过 1000 万次对话请求并获得超过 150 万次人类投票。资料来源：[README.md:13-21]() 进一步说明 FastChat 同时提供 Vicuna、MT-Bench 的训练与评估代码，以及分布式多模型服务系统。

社区中也存在与 Chatbot Arena 相关的疑问，例如 issue #3296 询问「gpt2-chatbot 是哪个模型」、issue #1504 要求加入 Falcon 40B 等，说明 Arena 在透明度与模型覆盖度上持续受到关注。

## MT-Bench 评估流程

MT-Bench 的核心评估流程包含两个阶段：第一阶段由被评估模型生成对 MT-Bench 问题的回答；第二阶段由裁判模型（通常是 GPT-4）对回答进行成对或单点评分。

### 阶段一：生成模型回答

`fastchat/llm_judge` 包提供完整的评估脚本。安装需要启用 `llm_judge` 扩展：

```bash
pip install -e ".[model_worker,llm_judge]"
```

资料来源：[fastchat/llm_judge/README.md:11-17]() 列出了安装步骤以及 `download_mt_bench_pregenerated.py` 预生成数据下载工具。

生成回答的命令如下：

```bash
python gen_model_answer.py --model-path lmsys/vicuna-7b-v1.5 --model-id vicuna-7b-v1.5
```

资料来源：[fastchat/llm_judge/README.md:30-45]() 说明 `[MODEL-PATH]` 可以是本地路径或 Hugging Face 仓库 ID，`[MODEL-ID]` 是用户自定义的模型别名；输出将保存到 `data/mt_bench/model_answer/[MODEL-ID].jsonl`。

对于 65B 等超大模型，可使用 `--num-gpus-per-model` 进行模型并行，并通过 `--num-gpus-total` 在多卡间并行化生成任务。对于未在 HuggingFace Transformers 中原生支持的推理后端，可改用 vLLM：

```bash
vllm serve [MODEL-PATH] --dtype auto
python gen_api_answer.py --model [MODEL-NAME] --openai-api-base http://localhost:8000/v1 --parallel 50
```

资料来源：[fastchat/llm_judge/README.md:64-78]() 详细说明了 vLLM 作为推理后端时的两步流程，使用 `--parallel` 控制并发数。

### 阶段二：裁判打分

资料来源：[fastchat/llm_judge/README.md:42-55]() 描述了裁判模型的工作方式：以 GPT-4 作为裁判，根据预设的评分提示词对模型回答进行评分，并生成结构化判断结果。完成后可使用 `show_result.py` 汇总分数，使用 `qa_browser.py --share` 启动本地 QA 浏览界面直观查看回答与判断。

```mermaid
flowchart LR
    A[MT-Bench 问题集] --> B[被评估模型<br/>生成回答]
    B --> C[gen_model_answer.py<br/>或 gen_api_answer.py]
    C --> D[model_answer JSONL]
    D --> E[裁判模型 GPT-4]
    E --> F[gen_judgment.py]
    F --> G[judgment JSONL]
    G --> H[show_result.py<br/>汇总分数]
    D --> I[qa_browser.py<br/>可视化浏览]
```

### 一致性度量

资料来源：[fastchat/llm_judge/README.md:80-89]() 指出项目发布了 3.3K 条人类对 6 个模型在 80 个 MT-Bench 问题上的标注，托管在 `lmsys/mt_bench_human_judgments` 数据集；并提供了 Colab notebook 计算人类与 GPT-4 裁判的吻合度，结果显示超过 80% 的吻合率，相当于人类两两之间的一致水平。

## Chatbot Arena 并排对战界面

Chatbot Arena 是 FastChat 提供的多模型并排对战 Web 界面，用户在不知模型身份的情况下对两个模型的回复进行投票（胜、平、负）。FastChat 支持 OpenAI、Anthropic、Gemini、Mistral 等主流 API 模型，以及本地 GPU 模型。

### 启动步骤

资料来源：[README.md:174-198]() 描述了完整的启动流程：先在 `api_endpoint.json` 中配置每个模型的 API 端点（如 `model_name`、`api_base`、`api_type`、`api_key`、`anony_only`），然后运行多标签 Gradio 服务器：

```bash
python3 -m fastchat.serve.gradio_web_server_multi --register-api-endpoint-file api_endpoint.json
```

对于本地 GPU 模型，需先启动 controller、model worker 和 gradio_web_server 三层服务。资料来源：[README.md:142-166]() 详细说明了分布式架构：controller 管理注册的 worker，worker 负责托管模型并通过 `/worker_get_status` 与 controller 通信，gradio webserver 作为用户接口层。

### 高级特性

资料来源：[README.md:200-217]() 列出以下扩展能力：

- 通过为不同 worker 分配不同 GPU 和端口，可向单一 controller 注册多个 worker，提升吞吐量或同时服务多个模型。
- 启动 `gradio_web_server_multi` 即可获得包含 Chatbot Arena 选项卡的 Gradio 多标签界面。
- 对于高吞吐批量服务需求，可使用 [vllm_integration.md](https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md) 集成 vLLM worker。
- 第三方 UI 可参考 [third_party_ui.md](https://github.com/lm-sys/FastChat/blob/main/docs/third_party_ui.md)。

## 数据集与发布物

资料来源：[README.md:13-21]() 列举了 FastChat 团队围绕 Chatbot Arena 发布的数据集：

| 数据集 | 描述 | 链接 |
| --- | --- | --- |
| Chatbot Arena Conversations | 33K 真实用户对话及偏好标签 | [lmsys/chatbot_arena_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) |
| LMSYS-Chat-1M | 大规模真实世界 LLM 对话数据集 | 见论文 [arxiv.org/abs/2309.11998](https://arxiv.org/abs/2309.11998) |
| MT-bench Human Annotations | 用于裁判一致性验证的人类标注 | [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments) |

资料来源：[fastchat/llm_judge/README.md:90-94]() 进一步确认上述数据集的可用性与引用规范，提示在论文中引用 MT-Bench 应使用 `zheng2023judging` BibTeX 条目。

## 已知问题与注意事项

根据社区上下文，需要注意以下几点：

1. **MT-Bench 数据集使用**：评估时必须确保模型加载了正确的对话模板，否则回答格式不一致会影响裁判评分。资料来源：[fastchat/llm_judge/README.md:38-42]() 强调需参考 `docs/model_support.md` 中支持的模型与新增模型方法。
2. **预生成数据查看**：首次使用需运行 `download_mt_bench_pregenerated.py` 下载 HuggingFace 上的预生成回答与判断，再使用 `qa_browser.py --share` 本地浏览。
3. **推理速度**：默认 transformers 模型 worker 兼容性高但吞吐量有限；建议对大模型评估时切换到 vLLM 后端以加速生成阶段。资料来源：[fastchat/llm_judge/README.md:62-78]()。

## 参见

- [MT-Bench 评估脚本包](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge)
- [Chatbot Arena 技术报告](https://arxiv.org/abs/2403.04132)
- [LMSYS-Chat-1M 论文](https://arxiv.org/abs/2309.11998)
- [FastChat 分布式服务架构](https://github.com/lm-sys/FastChat/blob/main/docs/server_arch.md)
- [支持的模型清单与新增方法](https://github.com/lm-sys/FastChat/blob/main/docs/model_support.md)
- [vLLM 集成文档](https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md)

---

<a id='page-training-deployment'></a>

## 训练微调、数据流水线与部署运维

### 相关页面

相关主题：[FastChat 概览与分布式服务架构](#page-overview-architecture), [模型推理后端与 OpenAI 兼容 API](#page-inference-backends)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/lm-sys/FastChat/blob/main/README.md)
- [fastchat/train/train.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train.py)
- [fastchat/train/train_mem.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_mem.py)
- [fastchat/train/train_lora.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_lora.py)
- [fastchat/serve/controller.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/controller.py)
- [fastchat/serve/model_worker.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py)
- [fastchat/serve/gradio_web_server.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/gradio_web_server.py)
- [fastchat/serve/cli.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/cli.py)
- [docs/commands/data_cleaning.md](https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md)
- [docs/server_arch.md](https://github.com/lm-sys/FastChat/blob/main/docs/server_arch.md)
- [docs/training.md](https://github.com/lm-sys/FastChat/blob/main/docs/training.md)
</details>

# 训练微调、数据流水线与部署运维

## 1. 概述与定位

FastChat 是一个面向大语言模型聊天机器人的开放平台，覆盖训练、监督微调（SFT）、评估、服务部署四大链路 [README.md:1-15](README.md)。其核心能力体现在：

- **训练与评估**：提供 Vicuna、MT-Bench 等模型的训练代码与评测流水线。
- **多模型分布式服务**：通过 Controller + Model Worker + Web Server 三层架构，可同时承载 70+ LLM 的并发推理请求。
- **生态集成**：兼容 OpenAI RESTful API、LangChain、Hugging Face Generation API 等多种接入方式。

本页聚焦从原始数据到生产部署的端到端路径，涵盖训练微调脚本、数据清洗流水线、以及多硬件/多后端的部署运维模式。

## 2. 训练微调

### 2.1 训练入口与脚本

FastChat 提供多种训练脚本以适配不同显存与硬件场景 [README.md:108-130](README.md)：

| 脚本 | 适用场景 | 关键特性 |
| --- | --- | --- |
| `fastchat/train/train.py` | 标准单机/多机训练 | 基于 Stanford Alpaca 改造 |
| `fastchat/train/train_mem.py` | 显存受限环境 | 集成 FlashAttention、DeepSpeed Zero |
| `fastchat/train/train_xformers.py` | 启用 xFormers 加速 | 显存与速度优化 |
| `fastchat/train/train_lora.py` | LoRA 轻量化微调 | 适用于 Llama 系列 |
| `fastchat/train/train_lora_t5.py` | FastChat-T5 LoRA 微调 | 适配 T5 编码器-解码器 |
| `fastchat/train/train_flant5.py` | Flan-T5 全参微调 | 全参数更新 |

典型启动命令示例 [README.md:108-115](README.md)：

```bash
torchrun --nproc-per-node 4 fastchat/train/train_mem.py \
  --model_name_or_path lmsys/vicuna-7b-v1.5 \
  --data_path data/dummy_conversation.json \
  --bf16 True --output_dir output
```

### 2.2 LoRA 与低资源训练

通过 `fastchat/train/train_lora.py` 可在单卡上对 7B/13B 模型进行轻量化微调，显著降低显存占用 [README.md:115-118](README.md)。配合 8-bit 量化（`bitsandbytes`）或 4-bit 量化（`--load-4bit`），可在 16GB 显存的消费级 GPU 上完成微调。

### 2.3 常见故障模式

社区中反馈的典型训练问题 [GitHub Issue #1711](https://github.com/lm-sys/FastChat/issues/1711)：FastChat-T5 官方宣称支持 4K 上下文，但实际仅支持 2K 输入 + 2K 输出，上下文拼接超过 2K 触发错误。这是 T5 编码器-解码器架构固有限制，而非配置错误。

## 3. 数据流水线

### 3.1 数据格式约定

FastChat 训练数据采用 ShareGPT 风格的多轮对话 JSON 格式 [README.md:50-58](README.md)，核心结构为：

```json
[
  {
    "id": "unique-id",
    "conversations": [
      {"from": "human", "value": "你好"},
      {"from": "gpt", "value": "你好，有什么可以帮你？"}
    ]
  }
]
```

可直接复用项目内置的 [data/dummy_conversation.json](data/dummy_conversation.json) 进行冒烟测试，再替换为真实业务数据。

### 3.2 ShareGPT 数据清洗

原始 ShareGPT 数据来自 ShareGPT.com 的公开 API，需经过 HTML → Markdown 转换、低质量过滤、过长对话分段等步骤 [README.md:51-55](README.md)。具体清洗命令与参数详见 [docs/commands/data_cleaning.md](docs/commands/data_cleaning.md)。

> **社区关切**：用户长期请求公开 ShareGPT 原始数据集 [GitHub Issue #90](https://github.com/lm-sys/FastChat/issues/90)，但项目方出于隐私与合规考虑未予发布，仅提供清洗后的衍生格式。

### 3.3 MT-Bench 评测数据

评测侧使用 MT-Bench 80 道多轮开放式问题，由 GPT-4 作为裁判打分 [fastchat/llm_judge/README.md:30-45](fastchat/llm_judge/README.md)。生成答案、生成评分、汇总分数三步解耦，支持多种评分模式（single-answer grading、pairwise-baseline、pairwise-all）。

## 4. 部署运维

### 4.1 分布式服务架构

```mermaid
flowchart LR
    User[用户浏览器] -->|HTTPS| Web[Gradio Web Server<br/>gradio_web_server.py]
    Web -->|REST| API[OpenAI-Compatible API<br/>api.py]
    API -->|注册/路由| Ctrl[Controller<br/>controller.py]
    Ctrl <-->|心跳/状态| MW1[Model Worker 1<br/>model_worker.py]
    Ctrl <-->|心跳/状态| MW2[Model Worker 2<br/>model_worker.py]
    Ctrl <-->|心跳/状态| MW3[Model Worker N<br/>vLLM/SGLang/LightLLM]
    MW1 -->|推理| GPU1[(GPU 集群)]
    MW2 -->|推理| GPU2[(GPU 集群)]
```

该架构由三类进程组成 [README.md:32-50](README.md)：Controller 负责 worker 注册与路由；Model Worker 承载模型推理；Web Server 提供 Gradio UI 与 OpenAI 兼容 API。Worker 可替换为 vLLM、SGLang、LightLLM 等高性能推理后端以提升吞吐 [Release v0.2.36](https://github.com/lm-sys/FastChat/releases/tag/v0.2.36)。

### 4.2 多硬件部署

`fastchat/serve/cli.py` 支持多种推理后端 [README.md:90-105](README.md)：

- **单 GPU**：约 14GB（Vicuna-7B）/28GB（Vicuna-13B）显存。
- **多 GPU**：通过 `--num-gpus` 与 `--max-gpu-memory` 控制模型切分。
- **CPU 纯推理**：需 30GB/60GB 内存，分别对应 7B/13B。
- **Intel XPU**：`--device xpu` 启用 Arc 加速。
- **Ascend NPU**：`--device npu` 启用昇腾加速。
- **Apple Metal**：通过 `--device mps` 启用。
- **显存不足时**：使用 `--load-8bit` 启用 8-bit 压缩，显存需求减半。

### 4.3 运维与安全注意事项

**安全告警**：Controller 的 `/register_worker` 端点未做身份校验，攻击者可通过伪造 `worker_name` 字段触发服务端请求伪造（SSRF）或注册伪造模型 [GitHub Issue #3886](https://github.com/lm-sys/FastChat/issues/3886)。生产部署必须：

1. 将 Controller 绑定到内网地址或通过反向代理加白名单。
2. 在 Worker 端开启 `WORKER_API_KEY` 校验（自 v0.2.30 起部分版本支持）。
3. 关闭对外暴露的 `21001`（Controller）与 `21002`（Worker）端口。

**监控与日志**：项目内置 `fastchat/serve/monitor/` 模块，提供基于 Embedding 的类别分类器（toxicity/quality 等），可用于在线审核 [fastchat/serve/monitor/classify/README.md:1-10](fastchat/serve/monitor/classify/README.md)。

### 4.4 云上弹性训练

通过 [SkyPilot](https://github.com/skypilot-org/skypilot) 可在 AWS、GCP、Azure、Lambda 等云上使用 Spot 实例进行 Vicuna 训练，显著降低成本 [README.md:130-135](README.md)。

## 5. 端到端最佳实践

| 阶段 | 建议 |
| --- | --- |
| 数据准备 | 使用 `data_cleaning.md` 流程清洗；保留 dummy 数据作为 CI 烟雾测试 |
| 模型训练 | 7B/13B 用 `train_mem.py` + bf16；显存紧张用 `train_lora.py` + 8bit |
| 评估验收 | 必跑 MT-Bench + GPT-4 judge；优先用 vLLM 后端提速 20× |
| 生产部署 | Controller 与 Worker 内网隔离；启用 SGLang/vLLM worker；监控 Toxicity |
| 灰度发布 | 通过 Arena 多模型对比收集用户偏好，迭代 prompt 与微调数据 |

## See Also

- [架构总览：Controller / Worker / WebServer 协同机制](docs/server_arch.md)
- [模型支持清单与新增模型流程](docs/model_support.md)
- [OpenAI 兼容 API 使用指南](docs/openai_api.md)
- [vLLM 集成](docs/vllm_integration.md)
- [第三方 UI 接入](docs/third_party_ui.md)
- [MT-Bench LLM 评测](fastchat/llm_judge/README.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Doramagic 踩坑日志

项目：lm-sys/FastChat

摘要：发现 10 个潜在踩坑项，其中 2 个为 high/blocking；最高优先级：维护坑 - 来源证据：The stop parameter in openai API doesn't work since v0.2.5。

## 1. 维护坑 · 来源证据：The stop parameter in openai API doesn't work since v0.2.5

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个维护/版本相关的待验证问题：The stop parameter in openai API doesn't work since v0.2.5
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 证据：community_evidence:github | https://github.com/lm-sys/FastChat/issues/1048 | 来源类型 github_issue 暴露的待验证使用条件。

## 2. 安全/权限坑 · 来源证据：FastChat-T5 4K context

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：FastChat-T5 4K context
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 证据：community_evidence:github | https://github.com/lm-sys/FastChat/issues/1711 | 来源类型 github_issue 暴露的待验证使用条件。

## 3. 身份坑 · 仓库名和安装名不一致

- 严重度：medium
- 证据强度：runtime_trace
- 发现：仓库名 `fastchat` 与安装入口 `fschat` 不完全一致。
- 对用户的影响：用户照着仓库名搜索包或照着包名找仓库时容易走错入口。
- 复现命令：`pip install fschat`
- 证据：identity.distribution | github_repo:615882673 | https://github.com/lm-sys/FastChat | repo=fastchat; install=fschat

## 4. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 证据：capability.assumptions | github_repo:615882673 | https://github.com/lm-sys/FastChat | README/documentation is current enough for a first validation pass.

## 5. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 证据：evidence.maintainer_signals | github_repo:615882673 | https://github.com/lm-sys/FastChat | last_activity_observed missing

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 证据：downstream_validation.risk_items | github_repo:615882673 | https://github.com/lm-sys/FastChat | no_demo; severity=medium

## 7. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 证据：risks.scoring_risks | github_repo:615882673 | https://github.com/lm-sys/FastChat | no_demo; severity=medium

## 8. 安全/权限坑 · 来源证据：Unauthenticated SSRF and worker/model spoofing via the controller /register_worker endpoint

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Unauthenticated SSRF and worker/model spoofing via the controller /register_worker endpoint
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 证据：community_evidence:github | https://github.com/lm-sys/FastChat/issues/3886 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 9. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 证据：evidence.maintainer_signals | github_repo:615882673 | https://github.com/lm-sys/FastChat | issue_or_pr_quality=unknown

## 10. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 证据：evidence.maintainer_signals | github_repo:615882673 | https://github.com/lm-sys/FastChat | release_recency=unknown

<!-- canonical_name: lm-sys/FastChat; human_manual_source: deepwiki_human_wiki -->