# https://github.com/vllm-project/vllm 项目说明书

生成时间：2026-05-15 22:05:46 UTC

## 目录

- [vLLM Overview](#page-1)
- [Getting Started](#page-2)
- [Core Engine Architecture](#page-3)
- [Model Executor and Worker Architecture](#page-4)
- [Scheduling and Request Processing](#page-5)
- [PagedAttention and KV Cache Management](#page-6)
- [Attention Backends and Kernels](#page-7)
- [Quantization Support](#page-8)
- [Distributed Inference and Parallelism](#page-9)
- [Model Architecture Support](#page-10)

<a id='page-1'></a>

## vLLM Overview

### 相关页面

相关主题：[Getting Started](#page-2), [Core Engine Architecture](#page-3)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/vllm-project/vllm/blob/main/README.md)
- [vllm/entrypoints/cli/main.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/main.py)
- [vllm/entrypoints/cli/serve.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/serve.py)
- [vllm/entrypoints/cli/launch.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/launch.py)
- [examples/basic/offline_inference/README.md](https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/README.md)
- [examples/disaggregated/example_connector/README.md](https://github.com/vllm-project/vllm/blob/main/examples/disaggregated/example_connector/README.md)
- [examples/disaggregated/disaggregated_encoder/README.md](https://github.com/vllm-project/vllm/blob/main/examples/disaggregated/disaggregated_encoder/README.md)
- [examples/features/structured_outputs/README.md](https://github.com/vllm-project/vllm/blob/main/examples/features/structured_outputs/README.md)
- [examples/pooling/embed/openai_embedding_long_text/README.md](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/openai_embedding_long_text/README.md)
- [examples/disaggregated/kv_load_failure_recovery_offline/README.md](https://github.com/vllm-project/vllm/blob/main/examples/disaggregated/kv_load_failure_recovery_offline/README.md)
- [examples/observability/opentelemetry/README.md](https://github.com/vllm-project/vllm/blob/main/examples/observability/opentelemetry/README.md)
- [examples/observability/dashboards/README.md](https://github.com/vllm-project/vllm/blob/main/examples/observability/dashboards/README.md)
- [tools/profiler/nsys_profile_tools/README.md](https://github.com/vllm-project/vllm/blob/main/tools/profiler/nsys_profile_tools/README.md)
</details>

# vLLM Overview

## What is vLLM?

vLLM is a fast and easy-to-use library for LLM (Large Language Model) inference and serving. It provides high-throughput, memory-efficient inference with an OpenAI-compatible API server, making it suitable for both research and production environments.

资料来源：[README.md](https://github.com/vllm-project/vllm/blob/main/README.md)

## Key Features

### Offline Inference

The `LLM` class provides the primary Python interface for offline inference—interacting with a model without using a separate model inference server. This enables direct model interaction for batch processing, experimentation, and development workflows.

资料来源：[examples/basic/offline_inference/README.md](https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/README.md)

### OpenAI-Compatible API Server

vLLM serves LLM completions via HTTP through an OpenAI-compatible API. The server can be started with a simple command:

```bash
vllm serve Qwen/Qwen2.5-3B-Instruct
```

资料来源：[vllm/entrypoints/cli/serve.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/serve.py)

### Structured Outputs

vLLM supports constrained decoding for structured outputs including JSON schema, regex patterns, and structural tags. This is essential for building reliable applications that require predictable output formats.

资料来源：[examples/features/structured_outputs/README.md](https://github.com/vllm-project/vllm/blob/main/examples/features/structured_outputs/README.md)

### Disaggregated Prefill and Decode

vLLM supports disaggregated prefill architecture where prefill (token generation) and decode (token consumption) stages can run on separate instances. This enables:

- Independent scaling of prefill and decode workloads
- Improved resource utilization
- Better support for multi-turn conversations

资料来源：[examples/disaggregated/example_connector/README.md](https://github.com/vllm-project/vllm/blob/main/examples/disaggregated/example_connector/README.md)

### KV Cache Transfer

For disaggregated serving, vLLM supports KV cache transfer between prefill and decode workers. The architecture includes:

| Component | Role | Description |
|-----------|------|-------------|
| `ECExampleConnector` | Cache Storage | Stores encoder cache on local disk |
| EC Producer | Precompute | Pre-computes encoder cache |
| EC Consumer | Retrieve | Retrieves cached KV data |

资料来源：[examples/disaggregated/disaggregated_encoder/README.md](https://github.com/vllm-project/vllm/blob/main/examples/disaggregated/disaggregated_encoder/README.md)

### KV Load Failure Recovery

vLLM implements robust recovery mechanisms for KV load failures in both synchronous and asynchronous loading modes. The system:

1. Identifies invalid KV blocks
2. Reschedules affected requests
3. Ensures consistent output through recovery logic

资料来源：[examples/disaggregated/kv_load_failure_recovery_offline/README.md](https://github.com/vllm-project/vllm/blob/main/examples/disaggregated/kv_load_failure_recovery_offline/README.md)

### Long Text Embedding with Chunked Processing

vLLM supports embedding models with chunked processing for texts exceeding the model's maximum context length:

```json
{
  "pooling_type": "auto",
  "use_activation": true,
  "enable_chunked_processing": true,
  "max_embed_len": 3072000
}
```

This enables processing of extremely long documents (up to 3M+ tokens) including academic papers, legal documents, and code repositories.

资料来源：[examples/pooling/embed/openai_embedding_long_text/README.md](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/openai_embedding_long_text/README.md)

## Architecture Overview

### CLI Entry Point

The vLLM CLI provides a flexible command-line interface built with `FlexibleArgumentParser`:

```mermaid
graph TD
    A[vllm CLI] --> B[main.py Entry Point]
    B --> C[Parse Arguments]
    C --> D[Load CMD_MODULES]
    D --> E[Create Subparsers]
    E --> F[Execute Subcommand]
    
    F --> G[serve]
    F --> H[launch]
    F --> I[other modules]
```

The CLI supports:
- `-v, --version` flag for version information
- Subcommand system with plugin architecture
- Command validation before execution

资料来源：[vllm/entrypoints/cli/main.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/main.py)

### Serve Subcommand

The `serve` subcommand handles online inference server startup:

```mermaid
graph TD
    A[serve command] --> B{Model specified?}
    B -->|Yes| C[Use CLI model]
    B -->|No| D[Default: Qwen/Qwen3-0.6B]
    
    C --> E{GRPC enabled?}
    D --> E
    
    E -->|Yes| F[Start gRPC Server]
    E -->|No| G{Headless mode?}
    
    G -->|Yes| H[Set api_server_count=0]
    G -->|No| I[Check LB mode]
    
    I --> J[Start FastAPI Server]
    H --> J
```

The serve command supports multiple deployment modes:
- **Standard mode**: Full API server with GPU inference
- **Headless mode**: No API servers, only engine processing
- **gRPC mode**: Alternative RPC interface
- **Load-balanced mode**: Data-parallel external/hybrid load balancing

资料来源：[vllm/entrypoints/cli/serve.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/serve.py)

### Launch Subcommand

The `launch` subcommand provides a modular component launch system:

```mermaid
graph LR
    A[launch command] --> B[LaunchSubcommand]
    B --> C[launch_component subparser]
    C --> D[LaunchSubcommandBase subclasses]
    
    D --> E[run_launch_fastapi]
    D --> F[other components]
    
    E --> G[Socket binding]
    E --> H[Build API Server]
    E --> I[EngineArgs configuration]
```

The launch system renders servers with preprocessing only—no inference or quantized kernels, and never allocates KV cache.

资料来源：[vllm/entrypoints/cli/launch.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/launch.py)

## Observability

### OpenTelemetry Integration

vLLM includes built-in OpenTelemetry support for distributed tracing:

```bash
opentelemetry-instrument vllm serve facebook/opt-125m
```

Core packages are bundled with vLLM:
- `opentelemetry-sdk`
- `opentelemetry-api`
- `opentelemetry-exporter-otlp`
- `opentelemetry-semantic-conventions-ai`

资料来源：[examples/observability/opentelemetry/README.md](https://github.com/vllm-project/vllm/blob/main/examples/observability/opentelemetry/README.md)

### Prometheus Metrics and Dashboards

vLLM exports Prometheus-compatible metrics and supports integration with:

| Platform | Dashboard Format | Import Method |
|----------|------------------|---------------|
| Grafana | JSON | UI or API |
| Perses | YAML | CLI |

资料来源：[examples/observability/dashboards/README.md](https://github.com/vllm-project/vllm/blob/main/examples/observability/dashboards/README.md)

### Performance Profiling

The `nsys_profile_tools` enable GPU kernel-level profiling:

```bash
nsys profile -t cuda -o run1 -f true --trace-fork-before-exec=true \
    --cuda-graph-trace=node --delay <DELAY> --duration <DURATION> \
    vllm serve openai/gpt-oss-120b ...
```

The `gputrc2graph.py` script generates kernel-level summaries and visualizations from `.nsys-rep` files.

资料来源：[tools/profiler/nsys_profile_tools/README.md](https://github.com/vllm-project/vllm/blob/main/tools/profiler/nsys_profile_tools/README.md)

## Supported Features Summary

| Feature | Description | Configuration |
|---------|-------------|---------------|
| **Offline Inference** | Batch processing without server | `LLM` class |
| **OpenAI API** | HTTP API compatibility | `vllm serve` |
| **Structured Outputs** | JSON/regex/structural constraints | `--reasoning-parser` |
| **Disaggregated Serving** | Split prefill/decode | `--ec-transfer-config` |
| **KV Recovery** | Failure resilience | Custom connectors |
| **Long Text Embedding** | Chunked processing | `--pooler-config` |
| **Observability** | Tracing and metrics | OpenTelemetry |
| **Quantization** | GGUF support | `repo_id:quant_type` |

## Usage Modes

### Offline Inference Mode

```python
from vllm import LLM

llm = LLM("Qwen/Qwen2.5-3B-Instruct")
output = llm.generate("Hello, world!")
```

### API Server Mode

```bash
vllm serve Qwen/Qwen2.5-3B-Instruct --tensor-parallel-size 2
```

### Disaggregated Mode

```bash
# Prefill instance
vllm serve --prefill-only --ec-transfer-config ' {...} '

# Decode instance
vllm serve --decode-only --ec-transfer-config ' {...} '
```

## Quick Reference

| Command | Purpose |
|---------|---------|
| `vllm serve <model>` | Start API server |
| `vllm launch <component>` | Launch specific component |
| `opentelemetry-instrument vllm serve` | Enable tracing |

## See Also

- [Offline Inference Guide](examples/basic/offline_inference/README.md)
- [Structured Outputs](examples/features/structured_outputs/README.md)
- [Disaggregated Prefill](examples/disaggregated/example_connector/README.md)
- [Observability Setup](examples/observability/opentelemetry/README.md)
- [Dashboard Configuration](examples/observability/dashboards/README.md)

---

<a id='page-2'></a>

## Getting Started

### 相关页面

相关主题：[vLLM Overview](#page-1)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/vllm-project/vllm/blob/main/README.md)
- [vllm/entrypoints/cli/main.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/main.py)
- [vllm/entrypoints/cli/serve.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/serve.py)
- [vllm/entrypoints/cli/launch.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/launch.py)
- [examples/basic/offline_inference/README.md](https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/README.md)
- [examples/features/structured_outputs/README.md](https://github.com/vllm-project/vllm/blob/main/examples/features/structured_outputs/README.md)
- [examples/pooling/embed/openai_embedding_long_text/README.md](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/openai_embedding_long_text/README.md)
</details>

# Getting Started

vLLM is a fast and easy-to-use library for Large Language Model (LLM) inference and serving. It provides both an offline inference interface via the `LLM` class and an online serving layer with an OpenAI-compatible API server. 资料来源：[README.md:1-30]()

This guide covers the essential steps to get started with vLLM, from installation through basic inference and serving.

## Installation

vLLM can be installed via pip or built from source. For detailed installation instructions, refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html).

### Quick Installation

```bash
pip install vllm
```

### Building from Source

For custom builds or development:

```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
```

### GPU Requirements

vLLM requires CUDA-compatible GPUs. The library supports various CUDA versions. Verify your environment has the necessary GPU drivers and CUDA toolkit installed. 资料来源：[README.md:1-15]()

## Core Concepts

Before diving into usage, understand these fundamental concepts:

| Concept | Description |
|---------|-------------|
| **LLM Class** | Primary Python interface for offline inference |
| **Engine Args** | Configuration parameters for the inference engine |
| **Sampling Params** | Controls generation behavior (temperature, max_tokens, etc.) |
| **OpenAI API Server** | HTTP server providing OpenAI-compatible REST endpoints |

### Architecture Overview

```mermaid
graph TD
    A[User Code] --> B[LLM Class / API Server]
    B --> C[AsyncLLMEngine]
    C --> D[Worker Pool]
    D --> E[GPU Devices]
    E --> F[PagedAttention KV Cache]
    
    G[HTTP Clients] --> H[OpenAI API Server]
    H --> B
```

## Offline Inference

Offline inference involves running model inference directly in Python without a separate server. This is ideal for batch processing, testing, or embedding vLLM into applications. 资料来源：[examples/basic/offline_inference/README.md:1-25]()

### Basic Usage

```python
from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM("Qwen/Qwen2.5-3B-Instruct")

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=256
)

# Run inference
outputs = llm.generate(["Hello, how are you?", "What is vLLM?"], sampling_params)

for output in outputs:
    print(output.outputs[0].text)
```

### Supported Models

vLLM supports a wide range of models including autoregressive transformers, mixture-of-experts models, and quantized models. For the complete list, see the [supported models documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). 资料来源：[README.md:20-25]()

## Serving with the CLI

vLLM provides a command-line interface for serving models via an OpenAI-compatible API. 资料来源：[vllm/entrypoints/cli/serve.py:1-30]()

### Starting the Server

```bash
vllm serve Qwen/Qwen3-0.6B
```

### Command-Line Options

The serve command supports extensive configuration through CLI arguments:

```bash
vllm serve <model> [options]
```

Use `--help=all` to show all available flags, or `--help=<ConfigGroup>` to explore options by section (e.g., `--help=ModelConfig`, `--help=Frontend`). 资料来源：[vllm/entrypoints/cli/serve.py:5-20]()

### Key Server Options

| Option | Description | Default |
|--------|-------------|---------|
| `--model` | Model name or path | Required |
| `--gpu-memory-utilization` | Fraction of GPU memory to use | 0.9 |
| `--max-model-len` | Maximum sequence length | Model default |
| `--tensor-parallel-size` | Number of GPUs for parallelism | 1 |
| `--port` | Server port | 8000 |

### Headless Mode

For distributed setups where API servers are managed externally:

```bash
vllm serve <model> --headless
```

In headless mode, no API servers are started, and `--api-server-count` cannot be used. 资料来源：[vllm/entrypoints/cli/serve.py:30-45]()

## API Usage

Once the server is running, you can interact with it using the OpenAI-compatible API.

### Completions API

```bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-3B-Instruct",
    "prompt": "The capital of France is",
    "max_tokens": 50
  }'
```

### Chat API

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-3B-Instruct",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'
```

## Offline Inference Examples

The repository includes practical examples demonstrating various vLLM capabilities. 资料来源：[examples/basic/offline_inference/README.md:25-60]()

### Running Examples

```bash
# Basic example
python examples/basic/offline_inference/basic.py

# Chat example with sampling parameters
python examples/basic/offline_inference/chat.py --max_tokens 100 --temperature 0.8

# Generate example
python examples/basic/offline_inference/generate.py --generation-config auto
```

### Generation Config

The `--generation-config` argument specifies where the generation config loads from:

- `'auto'` - Load from model path
- `<folder_path>` - Load from specified directory
- Not provided - Use vLLM defaults

```bash
python examples/basic/offline_inference/generate.py --generation-config auto
```

> **Note:** If `max_new_tokens` is specified in generation config, it sets a server-wide limit on output tokens for all requests. 资料来源：[examples/basic/offline_inference/README.md:55-70]()

## Advanced Features

### Structured Outputs

vLLM supports constrained decoding for structured outputs including JSON schemas, regex patterns, and grammar-based constraints. 资料来源：[examples/features/structured_outputs/README.md:1-40]()

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --reasoning-parser deepseek_r1
```

```bash
# Run structured outputs example
uv run structured_outputs_offline.py --constraint json_mode regex
```

### Long Text Embedding

For embedding models, vLLM supports chunked processing to handle texts exceeding the model's maximum context length:

```bash
MODEL_NAME="jinaai/jina-embeddings-v3" \
MAX_EMBED_LEN=1048576 \
./service.sh
```

Configuration example with chunked processing:

```json
{
  "pooling_type": "auto",
  "use_activation": true,
  "enable_chunked_processing": true,
  "max_embed_len": 3072000
}
``` 资料来源：[examples/pooling/embed/openai_embedding_long_text/README.md:1-60]()

### GGUF Quantized Models

vLLM supports GGUF-quantized models loaded directly from HuggingFace:

```bash
--model unsloth/Qwen3-0.6B-GGUF:Q4_K_M --tokenizer Qwen/Qwen3-0.6B
```

### CPU Offload

For systems with limited GPU memory, CPU offload allows loading larger models:

```bash
--cpu-offload-gb 10
```

This creates a virtual 34GB GPU when you have a 24GB GPU, enabling 13B model loading with BF16 weights. 资料来源：[examples/basic/offline_inference/README.md:75-85]()

## Configuration Workflow

```mermaid
graph LR
    A[Define Engine Args] --> B[Create Model Config]
    B --> C[Initialize Engine]
    C --> D[Process Requests]
    D --> E[Return Outputs]
    
    F[CLI Arguments] --> A
    G[Python API] --> A
```

### Programmatic Configuration

```python
from vllm import LLM, EngineArgs

engine_args = EngineArgs(
    model="Qwen/Qwen2.5-7B-Instruct",
    gpu_memory_utilization=0.85,
    tensor_parallel_size=2,
    max_model_len=4096
)

llm = LLM(**engine_args)
```

## Next Steps

| Resource | Description |
|----------|-------------|
| [Documentation](https://docs.vllm.ai) | Comprehensive guides and API reference |
| [Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | Complete list of supported architectures |
| [Examples](https://github.com/vllm-project/vllm/tree/main/examples) | Usage examples for various features |
| [Paper](https://arxiv.org/abs/2309.06180) | Technical details behind vLLM's design |

## Troubleshooting

### Common Issues

1. **CUDA Out of Memory**: Reduce `gpu_memory_utilization` or use smaller batch sizes
2. **Model Not Found**: Ensure HuggingFace credentials are configured for gated models
3. **Import Errors**: Verify all dependencies are installed with `pip install vllm`

### Getting Help

- **Technical Questions**: Use [GitHub Issues](https://github.com/vllm-project/vllm/issues)
- **Community Discussion**: [vLLM Forum](https://discuss.vllm.ai)
- **Development Coordination**: [Developer Slack](https://slack.vllm.ai)

资料来源：[README.md:60-75]()

---

<a id='page-3'></a>

## Core Engine Architecture

### 相关页面

相关主题：[vLLM Overview](#page-1), [Model Executor and Worker Architecture](#page-4), [Scheduling and Request Processing](#page-5)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [vllm/engine/llm_engine.py](https://github.com/vllm-project/vllm/blob/main/vllm/engine/llm_engine.py)
- [vllm/engine/async_llm_engine.py](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py)
- [vllm/v1/engine/core.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/core.py)
- [vllm/v1/engine/async_llm.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/async_llm.py)
- [vllm/v1/engine/llm_engine.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/llm_engine.py)
- [vllm/entrypoints/llm.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py)
- [vllm/entrypoints/cli/main.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/main.py)
- [vllm/entrypoints/cli/serve.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/serve.py)
- [vllm/entrypoints/cli/launch.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/launch.py)
</details>

# Core Engine Architecture

## Overview

The vLLM Core Engine Architecture is the central orchestration layer responsible for managing LLM inference workflows, request scheduling, and model execution. vLLM supports two engine versions: the legacy V0 engine and the current V1 engine (introduced in v0.6.0), both designed to provide high-throughput LLM serving through efficient request batching and GPU memory management.

The engine architecture serves as the foundation for both offline inference via the `LLM` class and online serving via the OpenAI-compatible API server. 资料来源：[vllm/entrypoints/llm.py:1-50]()

## Architecture Components

### V0 Engine (Legacy)

The V0 engine is the original implementation found in `vllm/engine/`. It consists of:

| Component | File | Purpose |
|-----------|------|---------|
| `LLMEngine` | `vllm/engine/llm_engine.py` | Synchronous inference engine with blocking operations |
| `AsyncLLMEngine` | `vllm/engine/async_llm_engine.py` | Async wrapper enabling concurrent request handling |

The V0 engine uses an event-loop-based async architecture where `AsyncLLMEngine` wraps `LLMEngine` to provide non-blocking request processing.

### V1 Engine (Current)

The V1 engine (`vllm/v1/engine/`) is the current production-ready implementation featuring a modular design:

| Component | File | Purpose |
|-----------|------|---------|
| `Core` | `vllm/v1/engine/core.py` | Low-level engine core managing model execution |
| `AsyncLLM` | `vllm/v1/engine/async_llm.py` | Main async interface for inference |
| `LLMEngine` | `vllm/v1/engine/llm_engine.py` | High-level engine orchestrator |

The V1 engine architecture separates concerns into distinct layers: `AsyncLLM` provides the public async interface, `LLMEngine` handles request orchestration, and `Core` manages low-level GPU operations.

## Entry Points

vLLM provides multiple entry points for interacting with the engine:

```mermaid
graph TD
    A[vllm serve] --> B[ServeSubcommand]
    A --> C[LaunchSubcommand]
    B --> D[serve_grpc / API Server]
    C --> E[run_launch_fastapi]
    F[vllm run-batch] --> G[BatchRunner]
    H[LLM Class] --> I[AsyncLLM Engine]
```

### CLI Entry Point

The CLI entry point in `vllm/entrypoints/cli/main.py` provides command-line access to vLLM functionality:

```python
# Simplified CLI structure
parser = FlexibleArgumentParser(description="vLLM CLI")
subparsers = parser.add_subparsers(required=False, dest="subparser")
cmds = {}
for cmd_module in CMD_MODULES:
    new_cmds = cmd_module.cmd_init()
    for cmd in new_cmds:
        cmd.subparser_init(subparsers).set_defaults(dispatch_function=cmd.cmd)
```

资料来源：[vllm/entrypoints/cli/main.py:1-40]()

### Serve Subcommand

The `serve` subcommand initializes the HTTP API server or gRPC service:

```python
class ServeSubcommand(CLISubcommand):
    name = "serve"
    
    @staticmethod
    def cmd(args: argparse.Namespace) -> None:
        if hasattr(args, "model_tag") and args.model_tag is not None:
            args.model = args.model_tag
```

资料来源：[vllm/entrypoints/cli/serve.py:1-30]()

### Launch Subcommand

The `launch` subcommand provides component-level launching capabilities:

```python
def cmd_init() -> list[CLISubcommand]:
    return [LaunchSubcommand()]

async def run_launch_fastapi(args: argparse.Namespace) -> None:
    listen_address, sock = setup_server(args)
    engine_args = AsyncEngineArgs.from_cli_args(args)
```

资料来源：[vllm/entrypoints/cli/launch.py:1-60]()

### Python API Entry Point

The `LLM` class in `vllm/entrypoints/llm.py` provides the primary Python interface for offline inference:

```python
class LLM:
    """
    An LLM for offline inference.
    """
    
    def __init__(self, model: str, ...):
        ...
```

资料来源：[vllm/entrypoints/llm.py:1-100]()

## Engine Initialization Flow

The following diagram illustrates the initialization flow from CLI to engine:

```mermaid
sequenceDiagram
    participant CLI as vllm serve
    participant Parser as ArgumentParser
    participant EngineArgs as AsyncEngineArgs
    participant Engine as AsyncLLM / Core
    participant Model as ModelConfig
    
    CLI->>Parser: Parse CLI arguments
    Parser->>EngineArgs: from_cli_args()
    EngineArgs->>Model: create_model_config()
    Model-->>EngineArgs: Config validated
    EngineArgs->>Engine: Initialize engine
    Engine->>Engine: Load model weights
```

## Core Engine Components

### AsyncLLM (V1)

The `AsyncLLM` class is the primary async interface for V1 engine:

```python
class AsyncLLM:
    """
    Async implementation of LLM engine.
    """
    
    async def add_request(self, request_id: str, prompt: str, ...):
        ...
    
    async def step(self) -> List[RequestOutput]:
        ...
```

资料来源：[vllm/v1/engine/async_llm.py:1-50]()

### Core (V1)

The `Core` class manages low-level model execution:

```python
class Core:
    """
    Core engine for V1.
    """
    
    def __init__(self, engine_config: VllmConfig, ...):
        ...
    
    def get_config(self) -> VllmConfig:
        ...
```

资料来源：[vllm/v1/engine/core.py:1-80]()

### LLMEngine (V1)

The V1 `LLMEngine` orchestrates request processing:

```python
class LLMEngine:
    """
    V1 LLM Engine implementation.
    """
    
    def __init__(self, vllm_config: VllmConfig, ...):
        ...
```

资料来源：[vllm/v1/engine/llm_engine.py:1-50]()

## Configuration System

### AsyncEngineArgs

Configuration flows from CLI/API to engine via `AsyncEngineArgs`:

| Parameter | Type | Description |
|-----------|------|-------------|
| `model` | `str` | Model name or path |
| `tensor_parallel_size` | `int` | Number of GPUs for tensor parallelism |
| `gpu_memory_utilization` | `float` | Fraction of GPU memory to use |
| `max_model_len` | `Optional[int]` | Maximum sequence length |
| `dtype` | `str` | Model data type (float16, bfloat16, etc.) |
| `quantization` | `Optional[str]` | Quantization method (awq, gptq, etc.) |

The engine validates configuration through `create_model_config()`:

```python
def create_model_config(self) -> ModelConfig:
    """Create model config from engine args."""
    return ModelConfig(...)
```

资料来源：[vllm/entrypoints/cli/launch.py:40-50]()

## Headless Mode

The V1 engine supports headless operation where no API servers are started:

```python
if args.headless:
    if args.api_server_count is not None and args.api_server_count > 0:
        raise ValueError(
            f"--api-server-count={args.api_server_count} cannot be "
            "used with --headless (no API servers are started in "
            "headless mode)."
        )
    args.api_server_count = 0
```

资料来源：[vllm/entrypoints/cli/serve.py:25-35]()

## gRPC Support

vLLM supports gRPC for disaggregated prefill scenarios:

```python
if getattr(args, "grpc", False):
    from vllm.entrypoints.grpc_server import serve_grpc
    uvloop.run(serve_grpc(args))
    return
```

资料来源：[vllm/entrypoints/cli/serve.py:18-22]()

## Request Processing Pipeline

```mermaid
graph LR
    A[Request] --> B[Parser]
    B --> C[AsyncLLM.add_request]
    C --> D[Scheduler]
    D --> E[Model Executor]
    E --> F[Output]
    D --> G[KV Cache]
```

## Data Parallel Modes

The engine supports multiple data parallel configurations:

| Mode | Flag | Description |
|------|------|-------------|
| External LB | `--data-parallel-external-lb` | External load balancer coordinates workers |
| Hybrid LB | `--data-parallel-hybrid-lb` | Hybrid approach with custom rank assignment |
| Rank | `--data-parallel-rank` | Specify worker rank for distributed setup |

```python
# Detection logic
is_external_lb = getattr(args, "data_parallel_external_lb", False)
is_hybrid_lb = getattr(args, "data_parallel_hybrid_lb", False)
```

资料来源：[vllm/entrypoints/cli/serve.py:35-40]()

## Model Config Validation

The engine performs validation during model config creation:

```python
model_config = engine_args.create_model_config()

# Clear quantization for render servers (preprocessing only)
if render_mode:
    model_config.quantization = None
```

资料来源：[vllm/entrypoints/cli/launch.py:45-55]()

## Key Design Patterns

### Async/Await Architecture

The V1 engine uses native async/await for concurrent request handling:

```python
async def step(self) -> List[RequestOutput]:
    """Execute one iteration of the engine."""
    ...
```

### Command Pattern

CLI commands follow the Command pattern with `CLISubcommand` base class:

```python
class ServeSubcommand(CLISubcommand):
    name = "serve"
    
    @staticmethod
    def cmd(args: argparse.Namespace) -> None:
        ...
```

### Subparser Registration

Commands register themselves via `cmd_init()` factory functions:

```python
def cmd_init() -> list[CLISubcommand]:
    return [LaunchSubcommand(), ServeSubcommand()]
```

## Summary

The vLLM Core Engine Architecture provides a flexible, multi-layered design supporting both V0 (legacy) and V1 (current) engine implementations. Key characteristics include:

1. **Dual Engine Support**: V0 for backward compatibility, V1 for production workloads
2. **Multiple Entry Points**: CLI, Python API, HTTP server, gRPC
3. **Async-First Design**: Native async/await for concurrent request processing
4. **Modular Components**: Clear separation between CLI, engine core, and model execution
5. **Flexible Configuration**: Comprehensive argument system via `AsyncEngineArgs`
6. **Data Parallel Support**: Multiple modes for distributed serving scenarios

The architecture prioritizes performance through efficient GPU memory management, request batching, and pipelined execution while maintaining a clean, extensible design.

---

<a id='page-4'></a>

## Model Executor and Worker Architecture

### 相关页面

相关主题：[Core Engine Architecture](#page-3), [Scheduling and Request Processing](#page-5), [Model Architecture Support](#page-10)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [vllm/v1/worker/gpu_model_runner.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py)
- [vllm/v1/worker/worker_base.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/worker_base.py)
- [vllm/v1/worker/gpu_worker.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_worker.py)
- [vllm/model_executor/model_loader/__init__.py](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/model_loader/__init__.py)
- [vllm/model_executor/models/__init__.py](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/__init__.py)
</details>

# Model Executor and Worker Architecture

## Overview

The vLLM Model Executor and Worker Architecture forms the core execution layer responsible for model loading, inference, and batch-level processing. This architecture separates concerns between high-level request orchestration and low-level GPU-based model execution, enabling efficient parallel processing of LLM inference requests.

The architecture consists of two primary components:

| Component | Responsibility |
|-----------|----------------|
| **Model Executor** | Manages model loading, weight initialization, and model-specific execution logic |
| **Worker** | Handles GPU-side computation, memory management, and kernel execution |

## Architecture Diagram

```mermaid
graph TD
    A[AsyncEngineArgs] --> B[Model Executor]
    B --> C[Model Loader]
    C --> D[Model Weights]
    B --> E[GPU Worker]
    E --> F[GPU Model Runner]
    F --> G[CUDA Kernels]
    F --> H[Attention Layers]
    F --> I[MLP Layers]
    
    J[Request Batch] --> E
    E --> K[Generated Tokens]
    
    style B fill:#e1f5fe
    style E fill:#fff3e0
    style F fill:#f3e5f5
```

## Model Executor Layer

### Purpose and Scope

The Model Executor layer handles all aspects of model lifecycle management, including initialization, weight loading, and providing the interface for model execution. This layer operates independently of the worker layer, allowing for flexibility in model configuration.

### Model Loading

The model loading subsystem supports multiple backends and quantization schemes:

```python
class ModelLoader:
    def load_model(self, model_config, parallel_config, device_config):
        # Load model weights based on configuration
        pass
```

**Supported Loading Modes:**

| Mode | Description |
|------|-------------|
| Auto | Automatically detect optimal loading strategy |
| Naive | Standard PyTorch loading |
| Sharded | Load model shards across multiple devices |
| Quantized | Load quantized weights (AWQ, GPTQ, GGUF) |

资料来源：[vllm/model_executor/model_loader/__init__.py]()

### Model Registry

vLLM maintains a registry of supported model architectures:

```python
# Model architecture registration
@register_model("LlamaForCausalLM")
class LlamaForCausalLM(nn.Module):
    ...
```

The registry maps model names to their corresponding model classes, enabling automatic model instantiation based on HuggingFace model architecture detection.

资料来源：[vllm/model_executor/models/__init__.py]()

## Worker Architecture

### Base Worker

The worker base class defines the interface for all worker implementations:

```python
class WorkerBase:
    def __init__(self, vllm_config):
        self.vllm_config = vllm_config
        self.model = None
        self.device = None
    
    def execute_model(self, batch):
        raise NotImplementedError
```

资料来源：[vllm/v1/worker/worker_base.py]()

### GPU Worker

The GPU Worker is the primary worker implementation for GPU-based inference:

```mermaid
graph LR
    A[Input Batch] --> B[Model Input Preparation]
    B --> C[Forward Pass]
    C --> D[Output Extraction]
    D --> E[Token Generation]
```

**Key Responsibilities:**

1. Initialize CUDA context and memory pools
2. Prepare model inputs with proper padding and masking
3. Execute forward passes on GPU
4. Manage KV cache memory allocation

资料来源：[vllm/v1/worker/gpu_worker.py]()

### GPU Model Runner

The GPU Model Runner handles the low-level model execution details:

```python
class GPUModelRunner:
    def __init__(self, config):
        self.kv_cache = None
        self.attn_metadata = None
        self.block_manager = None
    
    def prepare_inputs(self, batch):
        # Prepare input tensors with proper device placement
        pass
    
    def execute_model(self, input_tokens, positions):
        # Execute model forward pass
        pass
```

**Core Components:**

| Component | Function |
|-----------|----------|
| `kv_cache` | Stores key-value tensors for attention |
| `attn_metadata` | Manages attention metadata for paged attention |
| `block_manager` | Handles physical memory block allocation |

资料来源：[vllm/v1/worker/gpu_model_runner.py]()

## Execution Flow

```mermaid
sequenceDiagram
    participant API as API Server
    participant Executor as Model Executor
    participant Worker as GPU Worker
    participant Runner as GPU Model Runner
    participant Kernels as CUDA Kernels
    
    API->>Executor: Initialize model
    Executor->>Worker: Load weights
    Worker->>Runner: Initialize GPU state
    Runner->>Kernels: Allocate memory
    
    API->>Executor: Execute batch
    Executor->>Worker: Forward request
    Worker->>Runner: Prepare inputs
    Runner->>Kernels: Compute attention
    Runner->>Kernels: Compute mlp
    Kernels-->>Runner: Output logits
    Runner-->>Worker: Return results
    Worker-->>Executor: Output tokens
```

## Memory Management

### KV Cache Architecture

vLLM uses a block-based KV cache management system:

1. **Logical Blocks**: Abstract representation of KV cache entries
2. **Physical Blocks**: Actual GPU memory allocations
3. **Block Mapping**: Links logical blocks to physical locations

```python
class BlockManager:
    def allocate(self, num_blocks):
        # Allocate physical blocks
        pass
    
    def get_physical_block(self, logical_id):
        # Get physical location for logical block
        pass
```

### Memory Allocation Strategy

| Strategy | Use Case |
|----------|----------|
| Dynamic | Default, allocates on demand |
| Static | Pre-allocates at initialization |
| Hybrid | Mix of static and dynamic |

## Configuration Parameters

The Model Executor and Worker architecture is configured through `AsyncEngineArgs`:

| Parameter | Description | Default |
|-----------|-------------|---------|
| `model` | Model name or path | Required |
| `tensor_parallel_size` | Number of GPUs for tensor parallelism | 1 |
| `pipeline_parallel_size` | Number of pipeline stages | 1 |
| `gpu_memory_utilization` | Fraction of GPU memory for KV cache | 0.9 |
| `max_model_len` | Maximum sequence length | Auto |
| `block_size` | KV cache block size | 16 |

## Summary

The Model Executor and Worker Architecture in vLLM provides a modular, extensible system for LLM inference:

- **Separation of Concerns**: Clear boundaries between model loading and execution
- **GPU Optimization**: Efficient CUDA kernel integration and memory management
- **Flexible Configuration**: Support for various parallelism and quantization strategies
- **Extensibility**: Plugin-based model registration system

This architecture enables vLLM to achieve high throughput through batch processing while maintaining low latency through careful memory management and kernel optimization.

---

<a id='page-5'></a>

## Scheduling and Request Processing

### 相关页面

相关主题：[Core Engine Architecture](#page-3), [Model Executor and Worker Architecture](#page-4), [Distributed Inference and Parallelism](#page-9)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [vllm/v1/core/sched/scheduler.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py)
- [vllm/v1/core/sched/request_queue.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/request_queue.py)
- [vllm/v1/core/sched/async_scheduler.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/async_scheduler.py)
- [vllm/v1/request.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/request.py)
- [vllm/config/scheduler.py](https://github.com/vllm-project/vllm/blob/main/vllm/config/scheduler.py)
</details>

# Scheduling and Request Processing

## Overview

The Scheduling and Request Processing system is a core component of vLLM's v1 engine architecture. It manages the lifecycle of inference requests from arrival to completion, coordinating resource allocation, batch scheduling, and execution ordering to maximize GPU utilization while maintaining quality of service guarantees.

In vLLM, the scheduler operates as an asynchronous event-driven system that continuously evaluates pending requests, determines optimal batching strategies, and dispatches work to the underlying execution engine. This design enables vLLM to handle high-throughput serving workloads with efficient memory management through mechanisms like paged attention and dynamic batch composition.

The scheduler works in conjunction with the request queue, which serves as the primary buffer for incoming inference requests. It makes real-time decisions about which requests to include in the next execution batch based on available GPU memory, request priorities, and fairness constraints.

## Architecture Overview

### Component Hierarchy

The scheduling system consists of several interconnected components that work together to manage request processing:

```mermaid
graph TD
    A[API Layer] --> B[Request Queue]
    B --> C[Async Scheduler]
    C --> D[Scheduler]
    D --> E[Execution Engine]
    E --> F[GPU]
    
    G[Config] --> C
    G --> D
```

### Core Components

| Component | File | Purpose |
|-----------|------|---------|
| `Scheduler` | `vllm/v1/core/sched/scheduler.py` | Core scheduling logic and batch selection |
| `AsyncScheduler` | `vllm/v1/core/sched/async_scheduler.py` | Async wrapper for scheduler operations |
| `RequestQueue` | `vllm/v1/core/sched/request_queue.py` | Request buffering and ordering |
| `Request` | `vllm/v1/request.py` | Request data model and state |
| `SchedulerConfig` | `vllm/config/scheduler.py` | Scheduler configuration parameters |

## Request Model

### Request Data Structure

The `Request` class represents an individual inference request with all associated metadata and state. Each request maintains its own context including tokenized inputs, sampling parameters, and execution state.

The request model tracks the following key attributes:

| Attribute | Type | Description |
|-----------|------|-------------|
| `request_id` | `str` | Unique identifier for the request |
| `prompt` | `str` | Original input text or tokens |
| `prompt_token_ids` | `List[int]` | Tokenized prompt |
| `sampling_params` | `SamplingParams` | Sampling configuration |
| `arrival_time` | `float` | Request arrival timestamp |
| `state` | `RequestState` | Current execution state |

### Request States

Requests transition through a defined state machine during their lifecycle:

```mermaid
stateDiagram-v2
    [*] --> WAITING: Request Arrival
    WAITING --> SCHEDULED: Scheduler Selection
    SCHEDULED --> RUNNING: Dispatch to GPU
    RUNNING --> WAITING: Preemption/Continuation
    RUNNING --> FINISHED: Completion
    RUNNING --> WAITING: KV Cache Reuse
    WAITING --> CANCELLED: Client Cancellation
    SCHEDULED --> CANCELLED: Client Cancellation
```

1. **WAITING**: Request is queued and awaiting scheduling
2. **SCHEDULED**: Request has been selected for the next batch
3. **RUNNING**: Request is actively being processed on GPU
4. **FINISHED**: Request has completed successfully
5. **CANCELLED**: Request was cancelled before completion

资料来源：[vllm/v1/request.py]()

## Request Queue

The `RequestQueue` serves as the primary buffering mechanism for incoming requests. It provides thread-safe operations for enqueueing, dequeuing, and managing request priorities.

### Queue Operations

| Operation | Description |
|-----------|-------------|
| `enqueue()` | Add new request to queue |
| `dequeue()` | Remove and return next request |
| `peek()` | View next request without removal |
| `cancel()` | Remove cancelled request |
| `requeue()` | Return request to queue (preemption) |

### Priority Handling

The request queue supports priority-based ordering where higher priority requests can bypass lower priority ones. Priority is determined by a combination of factors:

- Explicit priority values in `SamplingParams`
- Arrival time (older requests may get priority for fairness)
- Request type (prefill vs decode operations)

资料来源：[vllm/v1/core/sched/request_queue.py]()

## Scheduler

### Core Scheduling Logic

The scheduler is responsible for selecting which requests to include in the next execution batch. It operates on a continuous scheduling loop that evaluates the current state of all pending requests and available resources.

#### Scheduling Criteria

The scheduler makes decisions based on multiple factors:

1. **Memory Availability**: Sufficient GPU memory must be available for the request's KV cache
2. **Batch Size Limits**: Maximum batch size constraints
3. **Prefill-Decompose Decisions**: Whether to split prefill into smaller chunks
4. **Priority Weighting**: Relative importance of pending requests
5. **Latency Targets**: QoS requirements for specific request categories

#### Scheduling Loop

```mermaid
graph LR
    A[Evaluate Pending Requests] --> B{Sufficient Resources?}
    B -->|Yes| C[Select Request]
    C --> D[Add to Batch]
    D --> E{Batch Full?}
    E -->|No| A
    E -->|Yes| F[Dispatch Batch]
    B -->|No| G[Wait / Preempt]
    F --> H[Update State]
    H --> A
```

### Async Scheduler Interface

The `AsyncScheduler` provides an asynchronous interface to the core scheduler, enabling non-blocking scheduling operations. This is essential for maintaining high throughput in production serving scenarios where the scheduler must coexist with network I/O and other async operations.

Key async operations include:

- `schedule_async()`: Async wrapper for schedule iteration
- `add_request()`: Async request enqueueing
- `abort_request()`: Async request cancellation

资料来源：[vllm/v1/core/sched/async_scheduler.py]()

## Scheduler Configuration

The `SchedulerConfig` class defines all configurable parameters for the scheduler behavior. These settings control batching strategies, memory management, and QoS characteristics.

### Configuration Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `max_num_seqs` | `int` | `256` | Maximum sequences per iteration |
| `max_num_batched_tokens` | `int` | `8192` | Max tokens per batch |
| `max_model_len` | `int` | `8192` | Maximum model context length |
| `enable_chunked_prefill` | `bool` | `True` | Enable prefill chunking |
| `num_prefill_groups` | `int` | `1` | Number of prefill groups |
| `async_scheduling` | `bool` | `True` | Enable async scheduling |

### Memory Management Settings

| Parameter | Description |
|-----------|-------------|
| `gpu_memory_utilization` | Fraction of GPU memory for KV cache (0.0-1.0) |
| `num_causal_layers` | Number of causal attention layers |
| `head_dim` | Dimension of attention heads |

资料来源：[vllm/config/scheduler.py]()

## Batching Strategies

### Continuous Batching

vLLM employs continuous batching (also known as iteration-level scheduling) to maximize GPU utilization. Unlike static batching, where batch composition is fixed at the start, continuous batching allows requests to enter and exit the batch at each iteration.

#### Advantages

- **Higher Throughput**: GPU is never idle waiting for batch to complete
- **Lower Latency**: Short requests don't wait for long ones
- **Better Memory Utilization**: Dynamic allocation based on actual needs

### Prefill Batching

Prefill operations (processing new tokens) and decode operations (generating new tokens) have different characteristics. The scheduler can:

1. **Combined Batching**: Mix prefill and decode in same batch
2. **Separate Batching**: Process prefill and decode in distinct batches
3. **Chunked Prefill**: Split large prefill requests into smaller chunks

### Chunked Prefill

When `enable_chunked_prefill` is enabled, large prefill requests are split into smaller chunks to:

- Reduce memory pressure
- Allow shorter requests to be scheduled faster
- Improve fairness between requests of different lengths

资料来源：[vllm/v1/core/sched/scheduler.py]()

## Request Processing Flow

### Full Request Lifecycle

```mermaid
sequenceDiagram
    participant Client
    participant API
    participant Queue
    participant Scheduler
    participant Engine
    participant GPU
    
    Client->>API: Submit Request
    API->>API: Validate & Tokenize
    API->>Queue: Enqueue Request
    Scheduler->>Queue: Dequeue Requests
    Scheduler->>Scheduler: Evaluate Batching
    Scheduler->>Engine: Dispatch Batch
    Engine->>GPU: Execute Forward Pass
    GPU-->>Engine: Output Tensors
    Engine-->>Scheduler: Update State
    Scheduler->>Queue: Requeue if Needed
    Scheduler->>API: Stream Tokens
    API-->>Client: Response Stream
```

### Scheduling Iteration

Each scheduling iteration follows these steps:

1. **Request Evaluation**: Scan all pending requests for eligibility
2. **Resource Calculation**: Determine available GPU memory
3. **Batch Composition**: Select requests based on scheduling policy
4. **Chunk Assignment**: Divide prefill requests if needed
5. **Batch Dispatch**: Send batch to execution engine
6. **State Update**: Update request states and metrics

## Configuration Example

```python
from vllm.config import SchedulerConfig

config = SchedulerConfig(
    max_num_seqs=256,
    max_num_batched_tokens=8192,
    max_model_len=32768,
    enable_chunked_prefill=True,
    gpu_memory_utilization=0.9,
)
```

## Performance Considerations

### Memory Management

The scheduler must balance memory allocation between:

- **KV Cache**: Storing attention key-value pairs
- **Model Weights**: The LLM parameters (typically pre-loaded)
- **Activation Memory**: Temporary tensors during computation

### Latency vs Throughput

The scheduling configuration affects the latency-throughput tradeoff:

| Setting | Effect |
|---------|--------|
| Smaller `max_num_seqs` | Lower latency, lower throughput |
| Larger `max_num_batched_tokens` | Higher throughput, variable latency |
| `enable_chunked_prefill=True` | Better latency fairness |

### Preemption

When GPU memory is insufficient for incoming requests, the scheduler may preempt existing requests to free memory. Preempted requests are returned to the queue and rescheduled later.

## Integration with Execution Engine

The scheduler interfaces with the execution engine through a well-defined API:

- **schedule()**: Main entry point for scheduling decisions
- **add_request()**: Register new inference request
- **abort_request()**: Cancel running request
- **update_from_output()**: Process execution results

The execution engine receives scheduled batches and returns completed or paused requests, allowing the scheduler to update its internal state and make subsequent scheduling decisions.

## Related Components

For a complete understanding of vLLM's request processing, also refer to:

- **Engine**: Coordinates between scheduler and model execution
- **Worker**: Executes model operations on GPU
- **Cache**: KV cache management and allocation
- **采样参数**: Sampling configuration affecting scheduling

## Summary

The Scheduling and Request Processing system is fundamental to vLLM's ability to serve large language models efficiently. By employing continuous batching, intelligent memory management, and flexible scheduling policies, vLLM achieves high throughput while maintaining low latency for diverse workloads. The modular design with separate scheduler, queue, and configuration components allows for fine-tuned control over serving behavior.

---

<a id='page-6'></a>

## PagedAttention and KV Cache Management

### 相关页面

相关主题：[Scheduling and Request Processing](#page-5), [Distributed Inference and Parallelism](#page-9), [Attention Backends and Kernels](#page-7)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [csrc/attention/paged_attention_v1.cu](https://github.com/vllm-project/vllm/blob/main/csrc/attention/paged_attention_v1.cu)
- [csrc/attention/paged_attention_v2.cu](https://github.com/vllm-project/vllm/blob/main/csrc/attention/paged_attention_v2.cu)
- [vllm/v1/core/kv_cache_manager.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/kv_cache_manager.py)
- [vllm/v1/core/block_pool.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/block_pool.py)
- [docs/design/paged_attention.md](https://github.com/vllm-project/vllm/blob/main/docs/design/paged_attention.md)
</details>

# PagedAttention and KV Cache Management

## Overview

PagedAttention is a novel attention mechanism that enables efficient virtual memory-based management of the Key-Value (KV) cache in large language model (LLM) inference. Inspired by the memory management technique in operating systems called paging, PagedAttention divides the KV cache into fixed-size "pages" that can be flexibly allocated and managed, eliminating the need for contiguous memory allocation.

The KV cache stores the key and value tensors from attention computation for each token position. During autoregressive decoding, this cache grows dynamically as new tokens are generated. Traditional LLM serving systems allocate contiguous memory blocks for the KV cache, leading to significant memory waste due to internal and external fragmentation when handling variable-length sequences and multi-user workloads.

vLLM's PagedAttention implementation provides:

- **Memory efficiency**: Eliminates fragmentation by using non-contiguous page-based allocation
- **Flexible batching**: Supports arbitrary sequence lengths and concurrent requests
- **Dynamic memory management**: Allocates cache pages on-demand during generation
- **GPU memory optimization**: Maximizes GPU memory utilization for higher throughput

## Architecture

### High-Level System Design

```mermaid
graph TD
    subgraph "Application Layer"
        Req[Inference Request]
        Sched[Scheduler]
    end
    
    subgraph "Memory Management Layer"
        KVM[KV Cache Manager]
        BP[Block Pool]
    end
    
    subgraph "GPU Memory Layer"
        GPU[GPU Memory]
        Pages[KV Cache Pages]
    end
    
    Req --> Sched
    Sched --> KVM
    KVM --> BP
    BP --> GPU
    Pages -.>|Physical Memory| GPU
```

### PagedAttention Version Comparison

vLLM implements two versions of PagedAttention with different performance characteristics:

| Aspect | PagedAttention V1 | PagedAttention V2 |
|--------|-------------------|-------------------|
| **Kernel Type** | Fused attention kernels | Optimized fused kernels |
| **Memory Access** | Standard page table lookup | Enhanced page table optimization |
| **Performance** | Baseline optimized | ~2.2x speedup over V1 |
| **Use Case** | General purpose | Production workloads |
| **Implementation** | `paged_attention_v1.cu` | `paged_attention_v2.cu` |

资料来源：[csrc/attention/paged_attention_v1.cu:1-100](), [csrc/attention/paged_attention_v2.cu:1-100](), [docs/design/paged_attention.md:1-50]()

## KV Cache Manager

The KV Cache Manager (`kv_cache_manager.py`) is the core component responsible for tracking and managing the allocation of KV cache pages across all in-flight sequences.

### Responsibilities

| Responsibility | Description |
|----------------|-------------|
| **Block Allocation** | Allocates and deallocates cache blocks as sequences grow or complete |
| **Reference Counting** | Tracks how many sequences reference each physical block |
| **Page Table Management** | Maintains virtual-to-physical page mappings per sequence |
| **Cache Eviction** | Handles cache eviction when memory pressure occurs |

### Key Data Structures

```python
# Simplified representation of block metadata
class Block:
    block_id: int          # Physical block identifier
    page_indices: List[int]  # Virtual page indices mapping
    ref_count: int         # Number of sequences referencing this block
    is_computed: bool      # Whether this block has computed KV cache
```

### Block Allocation Workflow

```mermaid
sequenceDiagram
    participant Scheduler
    participant KVM as KV Cache Manager
    participant BP as Block Pool
    participant GPU as GPU Memory
    
    Scheduler->>KVM: allocate_blocks(sequence_id, num_tokens)
    KVM->>BP: reserve_blocks(count)
    BP->>GPU: allocate_contiguous_pages
    GPU-->>BP: block_handles
    BP-->>KVM: allocated_blocks
    KVM-->>Scheduler: block_table
```

资料来源：[vllm/v1/core/kv_cache_manager.py:1-150](), [vllm/v1/core/block_pool.py:1-100]()

## Block Pool

The Block Pool (`block_pool.py`) manages the physical memory allocation of KV cache pages on GPU memory.

### Memory Organization

| Parameter | Description | Typical Value |
|-----------|-------------|---------------|
| `block_size` | Number of tokens per cache block | 16 tokens |
| `num_blocks` | Total number of available blocks | Dynamic based on GPU memory |
| `num_layers` | Number of attention layers | Model-dependent (e.g., 32-80) |
| `num_kv_heads` | Number of key/value attention heads | Model-dependent |
| `head_dim` | Dimension of each attention head | 128 or 256 |

### Block Pool Operations

```mermaid
graph LR
    subgraph "Allocation States"
        Free[Free Blocks]
        Used[Used Blocks]
        Partial[Partially Filled]
    end
    
    Allocate -->|Allocate Block| Free
    Allocate -->|Release Block| Used
    Used -->|Free| Free
    Partial -->|Append Token| Used
```

The block pool maintains three primary states:

1. **Free Blocks**: Available for immediate allocation
2. **Used Blocks**: Currently assigned to active sequences
3. **Partially Filled Blocks**: Blocks with available slots for new tokens

资料来源：[vllm/v1/core/block_pool.py:1-100]()

## PagedAttention Kernel Implementation

### CUDA Kernel Architecture

Both PagedAttention V1 and V2 are implemented as optimized CUDA kernels that handle attention computation with page-table-based memory access.

#### V1 Kernel Flow

```mermaid
graph TD
    A[Load Query Tokens] --> B[Get Block Table Entry]
    B --> C[Load K/V from Pages]
    C --> D[Compute Attention Scores]
    D --> E[Write Output to GEMM]
```

#### V2 Kernel Optimizations

PagedAttention V2 introduces several optimizations:

- **Reduced Page Table Lookups**: Consolidates multiple lookups into single operations
- **Improved Memory Coalescing**: Better memory access patterns for KV cache
- **Warp-Level Primitives**: Utilizes warp-level reductions for faster softmax computation
- **Fused Operations**: Combines multiple operations into single kernel launches

资料来源：[csrc/attention/paged_attention_v1.cu:1-200](), [csrc/attention/paged_attention_v2.cu:1-200]()

### Kernel Parameters

| Parameter | Description | Range |
|-----------|-------------|-------|
| `THREADS_PER_BLOCK` | CUDA threads per block | 128-256 |
| `BLOCK_SIZE` | Tokens per block | 16 |
| `CACHE_BLOCK_SIZE` | Bytes per cache block | Dynamic |
| `num_heads` | Total query heads | Model-dependent |
| `head_dim` | Attention head dimension | 128/256 |

## Page Table Management

### Virtual-to-Physical Mapping

Each sequence maintains a page table that maps virtual token positions to physical cache blocks:

```python
# Virtual page table structure
page_table = [
    physical_block_id,  # virtual position 0-15
    physical_block_id,  # virtual position 16-31
    physical_block_id,  # virtual position 32-47
    ...
]
```

### Block Table Structure

```mermaid
graph TD
    subgraph "Sequence 1"
        S1_VP1[Virtual Page 0] --> S1_PP1[Physical Block 5]
        S1_VP2[Virtual Page 1] --> S1_PP2[Physical Block 12]
        S1_VP3[Virtual Page 2] --> S1_PP3[Physical Block 3]
    end
    
    subgraph "Sequence 2"
        S2_VP1[Virtual Page 0] --> S2_PP1[Physical Block 5]
        S2_VP2[Virtual Page 1] --> S2_PP2[Physical Block 7]
    end
```

The page table allows:
- **Non-contiguous storage**: Physical blocks need not be contiguous
- **Shared blocks**: Multiple sequences can share the same physical block (for prefixes)
- **Dynamic growth**: New pages allocated as sequence extends

资料来源：[vllm/v1/core/kv_cache_manager.py:100-200](), [docs/design/paged_attention.md:50-150]()

## Memory Management Strategies

### Allocation Policy

vLLM uses a dynamic allocation policy that balances memory efficiency and allocation overhead:

| Strategy | Description | Trade-off |
|----------|-------------|------------|
| **On-demand** | Allocate blocks as tokens are generated | Lower memory waste, higher overhead |
| **Pre-allocation** | Reserve blocks when sequence starts | Lower overhead, potential waste |
| **Hybrid** | Pre-allocate prefix, on-demand for new tokens | Balanced approach |

### Reference Counting

Reference counting prevents premature block deallocation:

1. **Initial allocation**: Reference count = 1
2. **Fork/continue**: Reference count incremented
3. **Completion**: Reference count decremented
4. **Free block**: When count reaches 0

### Memory Reclamation

When GPU memory is exhausted:

```mermaid
graph TD
    A[Memory Pressure Detected] --> B{Sequence Can Evict?}
    B -->|Yes| C[Evict Cold Blocks]
    B -->|No| D[Wait or Reject Request]
    C --> E[Update Page Tables]
    E --> F[Allocate New Blocks]
```

资料来源：[vllm/v1/core/kv_cache_manager.py:200-300](), [vllm/v1/core/block_pool.py:100-200]()

## Integration with Scheduler

### Request Lifecycle

```mermaid
stateDiagram-v2
    [*] --> Received: New Request
    Received --> Scheduled: Add to Queue
    Scheduled --> Allocating: Acquire Blocks
    Allocating --> Running: Blocks Available
    Running --> Running: Generate Token
    Running --> Completed: EOS Token
    Completed --> [*]: Free Blocks
    Running --> Waiting: Block Unavailable
    Waiting --> Running: Blocks Acquired
```

### Scheduling Integration Points

| Stage | KV Cache Interaction |
|-------|---------------------|
| **Request Arrival** | Pre-allocate blocks for known prefix length |
| **Token Generation** | Allocate new block when current fills up |
| **Sequence Completion** | Release all associated blocks |
| **Prefix Caching** | Share blocks with identical prefixes |

## Configuration Options

### Server Configuration

```bash
vllm serve <model> \
    --block-size 16 \
    --num-gpu-blocks-override 1000 \
    --gpu-memory-utilization 0.9 \
    --max-num-batched-tokens 8192
```

### Key Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `block_size` | Number of tokens per KV cache block | 16 |
| `gpu_memory_utilization` | Fraction of GPU memory for KV cache | 0.9 |
| `num_gpu_blocks_override` | Override auto-computed block count | Auto |
| `max_num_batched_tokens` | Maximum tokens in a single batch | Dynamic |

### Memory Calculation

```
total_cache_memory = num_blocks × block_size × num_layers × 2 × num_kv_heads × head_dim × dtype_size
```

Where:
- `num_blocks` = GPU memory × utilization / per_block_memory
- `2` accounts for both K and V caches
- `dtype_size` = 2 bytes for FP16, 1 byte for INT8, etc.

资料来源：[docs/design/paged_attention.md:150-250]()

## Performance Characteristics

### Throughput Improvements

PagedAttention enables significant performance improvements compared to traditional contiguous allocation:

| Metric | Traditional | PagedAttention V1 | PagedAttention V2 |
|--------|-------------|-------------------|-------------------|
| **Memory Waste** | 30-60% | <5% | <5% |
| **Throughput** | Baseline | ~2.1x | ~2.2x |
| **BS=1 Latency** | Baseline | ~1.9x | ~2.0x |
| **BS=16+ Latency** | Baseline | ~2.0x | ~2.2x |

### Scalability

The system scales efficiently with:

- **Longer sequences**: O(seq_len) memory, no fragmentation
- **More concurrent requests**: Block sharing for shared prefixes
- **Larger models**: Better memory utilization per GPU

资料来源：[csrc/attention/paged_attention_v2.cu:200-300](), [docs/design/paged_attention.md:250-350]()

## Implementation Details

### Block Metadata Tracking

```python
# Core block metadata structure
@dataclass
class Block:
    block_id: int
    device: Device
    block_size: int
    num_tokens: int = 0
    
    # Physical location
    physical_block_id: Optional[int] = None
    
    # Content tracking
    content_hash: Optional[int] = None
    computed: bool = False
    
    # Reference counting for shared blocks
    ref_count: int = 0
```

### Attention Computation Flow

```mermaid
graph TD
    subgraph "Pre-computation"
        Q[Query Tensors] --> QS[Query Split]
        K[Key Tensors] --> KS[Key Split]
        V[Value Tensors] --> VS[Value Split]
    end
    
    subgraph "Paged Attention"
        QS --> AL[Attention Layer]
        KS --> AL
        VS --> AL
        AL --> PT[Page Table Lookup]
    end
    
    subgraph "Output"
        PT --> OUT[Attention Output]
        OUT --> SOFTMAX[Softmax]
        SOFTMAX --> GEMM[Final GEMM]
    end
```

## Best Practices

### Memory Configuration

1. **Set appropriate GPU memory utilization** based on other memory needs (model weights, activations)
2. **Adjust block size** for typical sequence lengths (larger blocks = less overhead for long sequences)
3. **Monitor fragmentation** using vLLM metrics

### Request Optimization

1. **Use consistent prefixes** to enable block sharing across requests
2. **Batch similar requests** to maximize cache hit rates
3. **Set appropriate max sequence length** to avoid excessive block allocation

### Debugging

```bash
# Enable verbose logging
export VLL_LOG_LEVEL=DEBUG

# Check memory status
curl http://localhost:8000/memory_stats
```

## Related Components

| Component | File | Role |
|-----------|------|------|
| Attention Backend | `vllm/attention/backends/` | Pluggable attention implementations |
| Scheduler | `vllm/v1/core/sched.py` | Coordinates cache allocation with scheduling |
| Model Runner | `vllm/v1/core/model_runner.py` | Executes attention with managed KV cache |
| Worker | `vllm/v1/worker/worker.py` | GPU-side cache management |

## References

- vLLM Paper: ["Efficient Memory Management for Large Language Model Serving with PagedAttention"](https://arxiv.org/abs/2309.06180)
- Original Implementation: [csrc/attention/paged_attention_v1.cu]()
- Optimized Implementation: [csrc/attention/paged_attention_v2.cu]()
- Design Documentation: [docs/design/paged_attention.md]()
- KV Cache Manager: [vllm/v1/core/kv_cache_manager.py]()
- Block Pool: [vllm/v1/core/block_pool.py]()

---

<a id='page-7'></a>

## Attention Backends and Kernels

### 相关页面

相关主题：[PagedAttention and KV Cache Management](#page-6), [Quantization Support](#page-8)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [vllm/v1/attention/backends/flash_attn.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/flash_attn.py)
- [vllm/v1/attention/backends/flashinfer.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/flashinfer.py)
- [vllm/v1/attention/backends/mla/flashmla.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/mla/flashmla.py)
- [vllm/v1/attention/backends/registry.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/registry.py)
- [docs/design/attention_backends.md](https://github.com/vllm-project/vllm/blob/main/docs/design/attention_backends.md)
</details>

# Attention Backends and Kernels

## Overview

The Attention Backends and Kernels system in vLLM provides a pluggable, hardware-accelerated implementation of attention mechanisms for Large Language Model (LLM) inference. This abstraction layer enables vLLM to leverage different optimized attention implementations (FlashAttention, FlashInfer, FlashMLA) depending on hardware capabilities and model requirements, while maintaining a unified interface for the rest of the engine.

## Architecture

### High-Level Design

vLLM implements a backend registry pattern that allows runtime selection of the optimal attention implementation. The attention system is designed to support both the legacy v0 engine and the optimized v1 engine architecture.

```mermaid
graph TD
    A[Attention Interface] --> B[Attention Backends Registry]
    B --> C[FlashAttention Backend]
    B --> D[FlashInfer Backend]
    B --> E[FlashMLA Backend]
    C --> F[CUDA Kernels]
    D --> G[Template-based Kernels]
    E --> H[MLA-optimized Kernels]
    
    I[Model Request] --> J[Scheduler]
    J --> K[Attention Layer]
    K --> A
```

### Backend Selection Flow

The registry-based architecture enables automatic backend selection based on hardware and configuration:

```mermaid
graph TD
    A[Engine Initialization] --> B{Check user-specified backend}
    B -->|Specified| C[Validate backend compatibility]
    B -->|Auto| D{Detect GPU Architecture}
    D -->|H100/H200| E[Select FlashAttention3]
    D -->|A100/A10| F[Select FlashAttention2]
    D -->|Other| G[Select FlashAttention2]
    C -->|Valid| H[Initialize Backend]
    C -->|Invalid| I[Raise Configuration Error]
    E --> H
    F --> H
    G --> H
```

## Attention Backend Components

### Backend Registry

The `registry.py` module provides the central factory for attention backend selection:

| Component | Responsibility |
|-----------|----------------|
| `AttentionBackend` | Abstract base class defining the backend interface |
| `get_available_backends()` | Returns list of compiled/available backends |
| `get_backend(name)` | Retrieves a specific backend by name |
| `AttentionImpl` | Concrete implementation wrapper per backend |

### FlashAttention Backend

Located at `vllm/v1/attention/backends/flash_attn.py`, this backend provides:

- FlashAttention 2/3 optimized CUDA kernels
- Support for head dimensions 64, 80, 96, 128, 160, 192, 256
- Paged attention integration
- Cross-attention support for encoder-decoder models

**Key Features:**

| Feature | Description |
|---------|-------------|
| Fused kernels | Combines attention computation steps |
| Dynamic persistent memory | Reuses KV cache blocks efficiently |
| Strided bias support | Enables sliding window attention |

### FlashInfer Backend

Located at `vllm/v1/attention/backends/flashinfer.py`, this backend offers:

- Template-based kernel generation for flexibility
- Improved performance for specific head dimensions
- Better integration with vLLM's block manager
- Support for custom attention patterns

### FlashMLA Backend

Located at `vllm/v1/attention/backends/mla/flashmla.py`, this backend is optimized for:

- **Multi-head Latent Attention (MLA)** architectures
- Low-rank KV cache compression
- Reduced memory bandwidth usage
- Optimized for DeepSeek-style models

## Attention Interface

### Core Abstraction

All attention backends implement a common interface defined by the `Attention` class:

```python
class Attention:
    def __init__(
        self,
        num_heads: int,
        head_size: int,
        scale: float,
        num_kv_heads: int,
        alibi_slopes: Optional[List[float]],
        cache_config: Optional[CacheConfig],
        block_size: int,
    )
    
    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: Optional[torch.Tensor],
        attn_metadata: AttentionMetadata,
    ) -> torch.Tensor
```

### Attention Metadata

The `AttentionMetadata` structure carries runtime information required for attention computation:

| Field | Type | Purpose |
|-------|------|---------|
| `seq_lens` | List[int] | Sequence lengths for each request |
| `max_seq_len` | int | Maximum sequence length in batch |
| `block_tables` | torch.Tensor | Paged attention block mappings |
| `seq_start_idx` | int | Starting index in output buffer |

## Paged Attention Integration

vLLM's attention system integrates tightly with paged memory management:

```mermaid
graph LR
    A[Logical KV Cache] --> B[Physical KV Blocks]
    C[Query Tensors] --> D[Block Lookup]
    D --> E[Attention Computation]
    F[Block Table] --> D
    B --> E
    E --> G[Output Tensors]
```

### Block Table Structure

| Block Index | Physical Address | Valid Length |
|-------------|------------------|--------------|
| 0 | 15 | 16 |
| 1 | 23 | 16 |
| 2 | 7 | 16 |
| 3 | -1 (pending) | 0 |

## Backend Configuration

### Runtime Selection

Backends can be selected through multiple mechanisms:

| Method | Priority | Example |
|--------|----------|---------|
| Environment variable | Lowest | `VLLM_ATTENTION_BACKEND=FLASHINFER` |
| Model config | Medium | `"attention_backend": "flash_attn"` |
| CLI argument | Highest | `--attention-backend flashinfer` |

### Configuration Parameters

| Parameter | Backend | Description |
|-----------|---------|-------------|
| `head_size` | All | Dimension of each attention head |
| `num_kv_heads` | All | Number of key/value heads (for GQA) |
| `scale` | All | Attention scaling factor |
| `sliding_window` | FlashAttn | Sliding window size |
| `page_size` | FlashInfer | KV cache block size |

## Performance Characteristics

### Memory Efficiency

| Backend | KV Cache Format | Memory Overhead |
|---------|-----------------|------------------|
| FlashAttention | Standard | Baseline |
| FlashInfer | Blocked | Similar to FlashAttn |
| FlashMLA | Low-rank | 50-70% reduction |

### Throughput Optimization

- **Kernel Fusion**: Combines multiple operations to reduce memory bandwidth
- **Persistent RNN**: Reuses computation across decode steps
- **Async Execution**: Overlaps attention with other operations

## Implementation Details

### FlashAttention Kernel Flow

```mermaid
sequenceDiagram
    participant Q as Query
    participant K as Key
    participant V as Value
    participant Kernel as FlashAttn Kernel
    participant Cache as KV Cache
    
    Q->>Kernel: Load Q tiles
    K->>Kernel: Load K tiles
    V->>Kernel: Load V tiles
    Kernel->>Cache: Update (if write)
    Kernel->>Kernel: Compute attention scores
    Kernel->>Cache: Read (if read)
    Kernel->>Kernel: softmax & scale
    Kernel-->>Q: Output attention
```

### Backend Initialization Sequence

```mermaid
graph TD
    A[EngineArgs parse] --> B[Attention backend selection]
    B --> C[Backend registry lookup]
    C --> D[Load backend module]
    D --> E[Initialize CUDA streams]
    E --> F[Allocate persistent buffers]
    F --> G[Register attention layers]
    G --> H[Ready for inference]
```

## Extending Attention Backends

To implement a custom attention backend:

1. **Create backend class**: Inherit from `AttentionBackend`
2. **Implement required methods**: `get_name()`, `get_impl()`
3. **Register backend**: Add to `ATTENTION_BACKENDS` registry
4. **Implement kernels**: Write CUDA/C++ kernels or wrap existing libraries

```python
class CustomAttentionBackend(AttentionBackend):
    @staticmethod
    def get_name() -> str:
        return "custom_attention"
    
    @staticmethod
    def get_impl() -> AttentionImpl:
        return CustomAttentionImpl
```

## Troubleshooting

### Common Issues

| Issue | Cause | Solution |
|-------|-------|----------|
| CUDA out of memory | Large batch/sequence | Reduce `gpu_memory_utilization` |
| Incorrect outputs | Wrong backend selected | Verify with `--attention-backend` flag |
| Kernel launch failure | Unsupported head size | Use supported head dimensions |
| Slow inference | Suboptimal backend | Benchmark available backends |

### Debugging Tips

- Set `VLLM_LOGGING_LEVEL=DEBUG` for attention kernel timing
- Use `VLLM_ATTENTION_BACKEND=FLASH_ATTN` to force specific backend
- Check `nvidia-smi` for kernel execution times

## Summary

The Attention Backends and Kernels system provides vLLM's computational core for transformer attention. Through the registry pattern, it enables seamless switching between optimized implementations while maintaining a stable interface for the rest of the engine. The pluggable architecture supports both established backends (FlashAttention, FlashInfer) and specialized implementations (FlashMLA) optimized for specific model architectures.

---

<a id='page-8'></a>

## Quantization Support

### 相关页面

相关主题：[Model Architecture Support](#page-10), [Attention Backends and Kernels](#page-7)

<details>
<summary>Related Source Files</summary>

以下源码文件用于生成本页说明：

- [vllm/model_executor/layers/quantization/fp8.py](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/fp8.py)
- [vllm/model_executor/layers/quantization/base_config.py](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/base_config.py)
- [vllm/model_executor/layers/quantization/gguf.py](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/gguf.py)
- [docs/features/quantization/README.md](https://github.com/vllm-project/vllm/blob/main/docs/features/quantization/README.md)
- [csrc/quantization](https://github.com/vllm-project/vllm/blob/main/csrc/quantization)
</details>

# Quantization Support

## Overview

vLLM provides comprehensive quantization support to reduce model memory footprint and accelerate inference through various quantization schemes. Quantization compresses model weights from higher precision (typically FP16/BF16) to lower precision formats (such as INT8, INT4, or FP8), enabling larger models to run on limited GPU resources while maintaining acceptable accuracy.

The quantization system in vLLM is designed with a modular, extensible architecture that supports multiple quantization methods through a common abstraction layer. This allows users to easily switch between different quantization schemes or implement custom quantization strategies.

资料来源：[docs/features/quantization/README.md]()

## Architecture

### Quantization System Components

The quantization system consists of several interconnected components:

```mermaid
graph TD
    A[Model Loading] --> B[QuantizationConfig]
    B --> C[QuantizationMethod]
    C --> D[Quantized Linear Layers]
    C --> E[Quantized Embedding Layers]
    D --> F[CUDA/ROCm Kernels]
    E --> F
    G[Weight Loading] --> H[Pre-quantized Weights]
    G --> I[On-the-fly Quantization]
```

### Quantization Base Architecture

The base quantization architecture defines a common interface that all quantization methods must implement:

```mermaid
classDiagram
    class QuantizationConfig {
        +get_supported_methods() List[QuantizationMethods]
        +get_override_quant_config() Optional[dict]
        +get_quant_config() dict
        +verify_quant_config() None
    }
    
    class QuantizationMethods {
        <<enumeration>>
        FP8
        GGUF
        AWQ
        GPTQ
        QUANTITY
    }
    
    class QuantizedLinear {
        <<interface>>
        +create_weights()
        +apply_weights()
    }
    
    QuantizationConfig --> QuantizationMethods
    QuantizationMethods --> QuantizedLinear
```

资料来源：[vllm/model_executor/layers/quantization/base_config.py]()

### Layer Structure

Each quantization implementation defines its own quantized layer classes:

| Layer Type | Purpose | Quantization Scope |
|------------|---------|-------------------|
| `QuantizedLinear` | Matrix multiplication with quantized weights | Weight-only or Activation+Weight |
| `QuantizedEmbedding` | Lookup table with quantized weights | Weight-only |
| `QuantizedMoE` | Mixture-of-Experts with quantized components | Per-expert quantization |

资料来源：[vllm/model_executor/layers/quantization/fp8.py]()

## Supported Quantization Methods

### FP8 (8-bit Floating Point)

FP8 quantization uses 8-bit floating point representation with two formats:

| Format | Exponent Bits | Mantissa Bits | Use Case |
|--------|--------------|---------------|----------|
| E4M3 | 4 | 3 | Activations and weights |
| E5M2 | 5 | 2 | Gradients and optimizer states |

FP8 quantization is particularly well-supported in vLLM with optimized CUDA kernels for inference acceleration.

资料来源：[vllm/model_executor/layers/quantization/fp8.py]()

### GGUF (GPT-Generated Unified Format)

GGUF is a quantized model format commonly used with llama.cpp. vLLM supports loading GGUF-quantized models directly:

```python
# Load GGUF quantized model
llm = LLM(model="unsloth/Qwen3-0.6B-GGUF:Q4_K_M")
```

GGUF supports multiple quantization levels:

| Quantization Type | Bits per Parameter | Description |
|-------------------|---------------------|-------------|
| Q2_K | ~2.5 | 2-bit quantization with 4-bit key-values |
| Q3_K | ~3.5 | 3-bit quantization with 4-bit key-values |
| Q4_K | ~4.5 | 4-bit quantization with 8-bit key-values |
| Q5_K | ~5.5 | 5-bit quantization with 8-bit key-values |
| Q6_K | ~6.5 | 6-bit quantization |
| Q8_0 | ~8.0 | 8-bit quantization (baseline) |

资料来源：[vllm/model_executor/layers/quantization/gguf.py]()

### AWQ (Activation-Aware Weight Quantization)

AWQ identifies weights with significant activation contributions and preserves them at higher precision while quantizing others aggressively.

### GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ performs post-training quantization with optional layer-wise precision adjustment and GPU optimization.

## Quantization Configuration

### Configuration Parameters

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `quantization` | str | Quantization method name | None |
| `quantization_param_path` | str | Path to quantization parameters file | None |
| `dtype` | str | Model precision (if not pre-quantized) | auto |
| `kv_cache_dtype` | str | KV cache quantization format | auto |

### Enabling Quantization

Quantization is enabled through the `--quantization` CLI argument or `quantization` parameter in `AsyncEngineArgs`:

```bash
vllm serve model/path --quantization fp8
```

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    quantization="fp8",
    gpu_memory_utilization=0.9
)
```

### Mixed Quantization

Some layers may remain in higher precision when quantization cannot be applied uniformly:

| Layer Type | Quantization Behavior |
|------------|----------------------|
| Input/Output embeddings | Often kept in FP16/BF16 |
| Output projection | May use different precision |
| Attention softmax | Usually in FP32 for stability |

## Implementation Details

### Weight Loading Pipeline

```mermaid
graph LR
    A[Model Checkpoint] --> B{Pre-quantized?}
    B -->|Yes| C[Load Quantized Weights]
    B -->|No| D[Dynamic Quantization]
    C --> E[Apply Quantization Config]
    D --> E
    E --> F[Initialize Quantized Layer]
    F --> G[Verify Weight Shape]
    G --> H[CUDA Kernel Ready]
```

### C++ Backend (cuBLAS/cutlass)

High-performance quantization kernels are implemented in C++ and CUDA:

| Component | Location | Purpose |
|-----------|----------|---------|
| FP8 GEMM | `csrc/quantization/fp8/` | FP8 matrix multiplication |
| W8A8 GEMM | `csrc/quantization/fp8/` | INT8 weight, INT8 activation |
| W4A16 GEMM | `csrc/quantization/` | INT4 weight, FP16 activation |
| Dequantization | `csrc/quantization/` | Convert quantized to compute dtype |

资料来源：[csrc/quantization]()

### Quantized Linear Layer Implementation

The `QuantizedLinear` layer handles the core computation:

```python
class QuantizedLinear(QuantizedLayer):
    """Base class for quantized linear layers."""
    
    def __init__(
        self,
        input_size: int,
        output_size: int,
        quantization_config: QuantizationConfig,
        bias: bool = False,
    ):
        self.input_size = input_size
        self.output_size = output_size
        self.quantization_config = quantization_config
        
    def create_weights(self):
        """Initialize quantized weight tensors."""
        raise NotImplementedError
        
    def forward(self, input_):
        """Forward pass with quantized computation."""
        raise NotImplementedError
```

资料来源：[vllm/model_executor/layers/quantization/base_config.py]()

## Usage Examples

### Loading Pre-quantized Models

```python
from vllm import LLM

# FP8 quantized model
llm_fp8 = LLM(
    model="meta-llama/Llama-2-70b-hf",
    quantization="fp8",
    tensor_parallel_size=4
)

# GGUF quantized model
llm_gguf = LLM(
    model="TheBloke/Llama-2-70B-Chat-GGUF",
    quantization="gguf",
    tokenizer="meta-llama/Llama-2-70b-chat"
)
```

### Quantization with Different Precisions

```python
# Load with specific precision settings
llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct",
    quantization="fp8",
    dtype="half",  # Compute precision
    kv_cache_dtype="fp8_e4m3"  # KV cache precision
)
```

### CLI Usage

```bash
# Serve with FP8 quantization
vllm serve meta-llama/Llama-2-70b-hf \
    --quantization fp8 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.95

# Serve GGUF model directly
vllm serve TheBloke/Llama-2-7B-Chat-GGUF:Q4_K_M
```

## Performance Considerations

### Memory Reduction

| Quantization | Memory Reduction | Quality Impact |
|--------------|-------------------|----------------|
| FP8 (E4M3) | ~50% | Minimal |
| INT8 | ~50% | Low |
| INT4 | ~75% | Moderate |
| INT2 | ~87.5% | High |

### Throughput Impact

Quantization improves throughput through:

1. **Increased batch size**: More sequences fit in GPU memory
2. **Higher memory bandwidth utilization**: Smaller weights load faster
3. **Accelerated compute**: INT8/FP8 operations on tensor cores

### Accuracy Considerations

- **Post-training quantization (PTQ)**: May introduce accuracy degradation
- **Activation-aware methods (AWQ)**: Better preserves model capabilities
- **Calibration**: Some methods require calibration data for optimal accuracy

## Extension Points

### Custom Quantization Methods

To implement a custom quantization method, extend the base classes:

```python
from vllm.model_executor.layers.quantization import QuantizationConfig

class CustomQuantizationConfig(QuantizationConfig):
    """Custom quantization configuration."""
    
    @staticmethod
    def get_name() -> str:
        return "custom_quant"
    
    @staticmethod
    def get_supported_methods(cls) -> list[str]:
        return ["weight_only"]
    
    def get_quant_config(self) -> dict:
        return {"method": self.quant_method}
```

### Registering Custom Quantization

Custom quantizations must be registered in the quantization registry to be discoverable at runtime.

## Summary

vLLM's quantization support provides a flexible, extensible system for serving large language models with reduced memory footprint. The architecture separates concerns through well-defined interfaces, allowing seamless integration of new quantization methods while maintaining high performance through optimized CUDA kernels.

Key takeaways:
- Multiple quantization formats supported (FP8, GGUF, AWQ, GPTQ)
- Modular architecture enables easy extension
- Optimized C++/CUDA kernels for inference acceleration
- Simple API through CLI and Python interface
- Memory reduction up to 75% with INT4 quantization

---

<a id='page-9'></a>

## Distributed Inference and Parallelism

### 相关页面

相关主题：[Scheduling and Request Processing](#page-5), [PagedAttention and KV Cache Management](#page-6)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [vllm/distributed/parallel_state.py](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py)
- [vllm/distributed/device_communicators/cuda_communicator.py](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/device_communicators/cuda_communicator.py)
- [vllm/distributed/kv_transfer/kv_connector/base.py](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/kv_transfer/kv_connector/base.py)
- [vllm/config/parallel.py](https://github.com/vllm-project/vllm/blob/main/vllm/config/parallel.py)
- [docs/serving/parallelism_scaling.md](https://github.com/vllm-project/vllm/blob/main/docs/serving/parallelism_scaling.md)
- [vllm/distributed/kv_transfer/kv_transfer_engine.py](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/kv_transfer/kv_transfer_engine.py)
- [vllm/entrypoints/llm.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py)
</details>

# Distributed Inference and Parallelism

vLLM provides comprehensive support for distributed inference and parallelism, enabling efficient serving of large language models across multiple GPUs and nodes. This document covers the architecture, configuration, and implementation details of vLLM's distributed computing capabilities.

## Overview

vLLM's distributed inference system enables horizontal scaling of LLM workloads by distributing model computation across multiple devices. The system supports multiple parallelism strategies, including tensor parallelism, pipeline parallelism, and data parallelism, along with specialized features like disaggregated prefill/decode and KV cache transfer.

The core components of distributed inference in vLLM include:

| Component | Purpose |
|-----------|---------|
| Parallel State Manager | Coordinates process groups for distributed communication |
| Device Communicators | Handle low-level tensor communication (NCCL, CUDA) |
| KV Transfer System | Enables disaggregated prefill/decode architectures |
| Configuration System | Manages parallelism parameters and device placement |

## Parallelism Strategies

vLLM supports three primary parallelism strategies, each addressing different aspects of distributed computation.

### Tensor Parallelism (TP)

Tensor parallelism splits individual weight matrices across multiple GPUs, allowing computation of large matrices that would not fit in a single device's memory. This is particularly effective for dense layers like attention and feed-forward networks.

Tensor parallelism requires:
- NVIDIA GPUs with NCCL support
- High-bandwidth interconnects (NVLink preferred)
- Each tensor-parallel rank requires full model weights in terms of optimizer states

### Pipeline Parallelism (PP)

Pipeline parallelism distributes layers (stages) of the model across different GPUs or nodes. This approach reduces memory requirements per device while maintaining high GPU utilization through micro-batch pipelining.

### Data Parallelism (DP)

Data parallelism replicates the entire model across multiple GPUs, with each replica processing different batches of requests. This is the simplest form of parallelism and scales throughput linearly with the number of replicas.

## Parallel State Management

The `ParallelState` class in `vllm/distributed/parallel_state.py` is the central coordinator for distributed execution.

```mermaid
graph TD
    A[ParallelState] --> B[Tensor Parallel Group]
    A --> C[Pipeline Parallel Group]
    A --> D[Data Parallel Group]
    A --> E[World Communicator]
    B --> F[Rank 0, 1, 2, 3]
    C --> G[Stage 0, Stage 1]
    D --> H[Replica 1, Replica 2]
```

### Process Group Initialization

Parallel state is initialized through `init_distributed_environment()`, which creates the necessary process groups for communication.

```python
# From vllm/distributed/parallel_state.py:45-78
def init_distributed_environment(
    rank: int,
    world_size: int,
    local_rank: int,
    init_method: str = "env://",
    backend: str = "nccl"
):
    # Initialize distributed context
    torch.distributed.init_process_group(
        backend=backend,
        init_method=init_method,
        rank=rank,
        world_size=world_size
    )
```

### Rank and World Size Management

| Parameter | Description |
|-----------|-------------|
| `rank` | Unique identifier for each process in the distributed group |
| `world_size` | Total number of processes in the distributed group |
| `local_rank` | Rank of the process within its local node |

资料来源：[vllm/distributed/parallel_state.py:45-78](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py)

## Device Communication

### CUDA Communicator

The `CUDACommunicator` class in `vllm/distributed/device_communicators/cuda_communicator.py` provides NCCL-based communication primitives optimized for CUDA tensors.

```mermaid
graph LR
    A[Tensor] -->|all_reduce| B[Aggregated Tensor]
    A -->|broadcast| C[Same Tensor on All Ranks]
    A -->|reduce_scatter| D[Partitioned Results]
    E[Partial Tensors] -->|all_gather| F[Complete Tensor]
```

### Supported Communication Primitives

| Primitive | Function | Use Case |
|-----------|----------|----------|
| `all_reduce` | Reduce tensors across all ranks | Gradient synchronization in TP |
| `broadcast` | Send tensor from one rank to all | Weight updates |
| `all_gather` | Collect tensors from all ranks | Output aggregation |
| `reduce_scatter` | Reduce and partition across ranks | Gradient partitioning |

```python
# From vllm/distributed/device_communicators/cuda_communicator.py:23-65
class CUDACommunicator:
    def __init__(self, group: ProcessGroup):
        self.group = group
        self.world_size = group.size()
        self.rank = group.rank()
    
    def all_reduce(self, tensor: torch.Tensor) -> torch.Tensor:
        # NCCL all-reduce implementation
        torch.distributed.all_reduce(tensor, group=self.group)
        return tensor
```

资料来源：[vllm/distributed/device_communicators/cuda_communicator.py:23-65](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/device_communicators/cuda_communicator.py)

### Communication Patterns in Distributed Inference

```mermaid
graph TD
    subgraph "Tensor Parallel Region"
        A[Attention AllReduce] --> B[FFN AllReduce]
        B --> C[AllReduce Output]
    end
    
    subgraph "Pipeline Parallel Region"
        D[Send Hidden States] --> E[Receive Hidden States]
        E --> F[Backward Pass]
    end
    
    subgraph "Data Parallel Region"
        G[Synchronize KV Cache] --> H[Load Balance Requests]
    end
```

## Disaggregated Prefill and Decode

vLLM supports disaggregated prefill/decode architectures where prefill (initial prompt processing) and decode (token generation) stages run on separate GPU clusters. This enables independent scaling of prefill and decode resources.

### KV Transfer Architecture

The KV transfer system enables sharing of KV cache between prefill and decode instances.

```mermaid
graph LR
    A[Prefill Instance] -->|KV Transfer| B[Shared Storage]
    B -->|KV Load| C[Decode Instance]
    
    subgraph "Prefill Process"
        A1[Tokenize Prompts] --> A2[Process Prefill]
        A2 --> A3[Save KV Cache]
    end
    
    subgraph "Decode Process"
        C1[Load KV Cache] --> C2[Generate Tokens]
        C2 --> C3[Streaming Output]
    end
```

资料来源：[examples/disaggregated/example_connector/README.md](https://github.com/vllm-project/vllm/blob/main/examples/disaggregated/example_connector/README.md)

### KV Connector Base

The `KVConnectorBase` class defines the interface for KV cache transfer implementations.

```python
# From vllm/distributed/kv_transfer/kv_connector/base.py:15-85
class KVConnectorBase(ABC):
    def __init__(self, kv_transfer_config: KVTransferParams):
        self.config = kv_transfer_config
    
    @abstractmethod
    def load_kv_cache(
        self,
        requests: List[TransferJob],
        scheduler: Any
    ) -> None:
        """Load KV cache from external source during prefill"""
        pass
    
    @abstractmethod
    def save_kv_cache(
        self,
        blocks: List[PhysicalTokenBlock],
        kv_pair: KVCache,
        callback: Callable
    ) -> TransferJob:
        """Save KV cache to external storage during decode"""
        pass
```

| Method | Purpose |
|--------|---------|
| `load_kv_cache` | Load KV cache during prefill phase |
| `save_kv_cache` | Persist KV cache during decode phase |
| `profile_num_available_blocks` | Determine available transfer capacity |

资料来源：[vllm/distributed/kv_transfer/kv_connector/base.py:15-85](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/kv_transfer/kv_connector/base.py)

### Example Connector Implementation

The `ExampleConnector` provides a reference implementation for KV transfer:

```python
# From examples/disaggregated/example_connector/
class ExampleConnector(KVConnectorBase):
    def __init__(self, kv_transfer_config: KVTransferParams):
        super().__init__(kv_transfer_config)
        self.local_storage = "./local_storage"
```

The connector workflow:
1. **Prefill Phase**: Process prompts and save KV state to `local_storage` directory
2. **Decode Phase**: Load KV state from storage and continue generation

资料来源：[examples/disaggregated/example_connector/README.md](https://github.com/vllm-project/vllm/blob/main/examples/disaggregated/example_connector/README.md)

## Configuration

### Parallel Configuration Parameters

The `ParallelConfig` class in `vllm/config/parallel.py` manages all parallelism settings.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `tensor_parallel_size` | int | 1 | Number of GPUs for tensor parallelism |
| `pipeline_parallel_size` | int | 1 | Number of pipeline stages |
| `data_parallel_size` | int | 1 | Number of data parallel replicas |
| `data_parallel_size_per_region` | int | - | DP size for hybrid parallelism |
| `data_parallel_master_port` | int | 29500 | Port for DP master communication |
| `data_parallel_master_addr` | str | - | Address for DP master |
| `numa_aware` | bool | False | Enable NUMA-aware GPU placement |

```python
# From vllm/config/parallel.py:10-45
@dataclass
class ParallelConfig:
    tensor_parallel_size: int = 1
    pipeline_parallel_size: int = 1
    data_parallel_size: int = 1
    data_parallel_size_per_region: Optional[int] = None
    data_parallel_master_port: int = 29500
    data_parallel_master_addr: Optional[str] = None
    data_parallel_standalone: bool = False
    numa_aware: bool = False
```

资料来源：[vllm/config/parallel.py:10-45](https://github.com/vllm-project/vllm/blob/main/vllm/config/parallel.py)

### Environment Variables

| Variable | Description |
|----------|-------------|
| `CUDA_VISIBLE_DEVICES` | Comma-separated list of GPU IDs to use |
| `VLLM_HOST_IP` | Host IP for distributed communication |
| `VLLM_PORT` | Port for worker communication |

### Launching Distributed Inference

#### Multi-GPU Launch with `torchrun`

```bash
torchrun --nproc_per_node=4 \
    --nnodes=1 \
    vllm/entrypoints/llm.py \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4
```

#### Multi-Node Launch

```bash
torchrun --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=10.0.0.1 \
    --master_port=29500 \
    vllm/entrypoints/llm.py \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 4
```

## Scaling Strategies

### Scaling Guidelines

| Model Size | GPU Memory | Recommended Configuration |
|------------|------------|---------------------------|
| 7B | 24GB | TP=1, DP=single node |
| 13B | 48GB | TP=2, DP=single node |
| 70B | 320GB+ | TP=4 or 8, PP=2+ |
| 405B | 800GB+ | TP=8, PP=8, multi-node |

资料来源：[docs/serving/parallelism_scaling.md](https://github.com/vllm-project/vllm/blob/main/docs/serving/parallelism_scaling.md)

### Disaggregated Prefill Scaling

For disaggregated prefill/decode, consider:

| Workload Pattern | Prefill Resources | Decode Resources |
|-----------------|-------------------|-------------------|
| Short prompts, many requests | Scale prefill | Scale decode |
| Long prompts, few requests | Scale prefill with TP | Scale decode |
| Mixed workload | Balance both | Balance both |

### Scaling Best Practices

1. **Start with tensor parallelism** for intra-node scaling
2. **Add pipeline parallelism** for multi-node deployments
3. **Use data parallelism** to increase throughput on same-stage workloads
4. **Enable disaggregation** when prefill and decode have different resource needs

资料来源：[docs/serving/parallelism_scaling.md](https://github.com/vllm-project/vllm/blob/main/docs/serving/parallelism_scaling.md)

## Architecture Diagram

```mermaid
graph TB
    subgraph "vLLM Distributed Architecture"
        subgraph "Process 0"
            P0_M[Model Shard 0]
            P0_S[Scheduler]
            P0_C[Cache Engine]
        end
        
        subgraph "Process 1"
            P1_M[Model Shard 1]
            P1_S[Scheduler]
            P1_C[Cache Engine]
        end
        
        subgraph "Process N"
            PN_M[Model Shard N]
            PN_S[Scheduler]
            PN_C[Cache Engine]
        end
        
        NCCL[NCCL AllReduce]
        
        P0_S <-->|NCCL| P1_S
        P1_S <-->|NCCL| PN_S
        P0_S <-->|NCCL| PN_S
        
        P0_M <-->|Forward Pass| P0_S
        P1_M <-->|Forward Pass| P1_S
        PN_M <-->|Forward Pass| PN_S
    end
    
    subgraph "External Services"
        KV[KV Transfer Service]
        Redis[(Redis/Storage)]
    end
    
    P0_C <-->|Save KV| Redis
    PN_C <-->|Load KV| Redis
    Redis <--> KV
```

## See Also

- [Offline Inference Documentation](../features/structured_outputs.md)
- [API Reference - LLM Class](https://docs.vllm.ai/en/latest/api/offline_inference/llm.html)
- [Sampling Parameters](https://docs.vllm.ai/en/latest/api/inference_params.html)
- [Disaggregated Prefill Example](../disaggregated/example_connector/README.md)

---

<a id='page-10'></a>

## Model Architecture Support

### 相关页面

相关主题：[Model Executor and Worker Architecture](#page-4), [Quantization Support](#page-8)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [vllm/entrypoints/cli/serve.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/serve.py)
- [vllm/entrypoints/cli/main.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/main.py)
- [vllm/entrypoints/cli/launch.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/launch.py)
- [examples/basic/offline_inference/README.md](https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/README.md)
- [examples/features/structured_outputs/README.md](https://github.com/vllm-project/vllm/blob/main/examples/features/structured_outputs/README.md)
- [README.md](https://github.com/vllm-project/vllm/blob/main/README.md)
</details>

# Model Architecture Support

## Overview

vLLM provides comprehensive support for various LLM architectures, enabling users to serve, fine-tune, and run inference with a wide range of transformer-based models. The system is designed to be architecture-agnostic while providing optimized implementations for popular model families.

## Core Model Loading Architecture

### Python API (Offline Inference)

The primary Python interface for running offline inference is the `LLM` class, which handles model loading and inference without requiring a separate inference server.

```python
# Basic usage example
from vllm import LLM

llm = LLM(model="Qwen/Qwen3-0.6B")
output = llm.generate("Hello, world!")
```

资料来源：[examples/basic/offline_inference/README.md](https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/README.md)

### CLI-based Model Serving

The vLLM CLI provides a `serve` subcommand for HTTP-based model serving:

```bash
vllm serve Qwen/Qwen3-0.6B
```

If no model is specified, the CLI defaults to `Qwen/Qwen3-0.6B`.

资料来源：[vllm/entrypoints/cli/serve.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/serve.py)

## Supported Model Categories

### Language Models

vLLM supports a broad range of causal language models including:

- **Decoder-only transformers**: Standard autoregressive models
- **Mixture of Experts (MoE)**: Sparse architectures like Mixtral
- **Multimodal models**: Models that process multiple input types

### Quantization Support

vLLM supports quantized models through GGUF format, enabling deployment of compressed models with reduced memory footprint.

Example loading a quantized model directly from HuggingFace:

```bash
--model unsloth/Qwen3-0.6B-GGUF:Q4_K_M --tokenizer Qwen/Qwen3-0.6B
```

资料来源：[examples/basic/offline_inference/README.md](https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/README.md)

## Model Configuration

### Generation Configuration

The `--generation-config` argument specifies where the generation config is loaded from:

| Value | Source | Description |
|-------|--------|-------------|
| `auto` | Model path | Loads from model's configuration directory |
| `<folder_path>` | Local folder | Loads from specified directory |
| Not provided | vLLM defaults | Uses built-in default parameters |

> If `max_new_tokens` is specified in generation config, it sets a server-wide limit on output tokens for all requests.

资料来源：[examples/basic/offline_inference/README.md](https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/README.md)

### Engine Arguments

Model configuration is controlled through `AsyncEngineArgs`, which processes CLI arguments and creates the model configuration:

```python
engine_args = AsyncEngineArgs.from_cli_args(args)
model_config = engine_args.create_model_config()
```

资料来源：[vllm/entrypoints/cli/launch.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/launch.py)

## Structured Outputs

vLLM supports structured output generation for models that support it, including reasoning models:

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --reasoning-parser deepseek_r1
```

This enables compliance with output format constraints defined by the model.

资料来源：[examples/features/structured_outputs/README.md](https://github.com/vllm-project/vllm/blob/main/examples/features/structured_outputs/README.md)

## CPU Offload Support

For models that exceed available GPU memory, vLLM provides CPU offload capabilities:

```bash
--cpu-offload-gb 10
```

This creates virtual GPU memory by offloading portions of the model to CPU RAM. For example, with a 24GB GPU and 10GB offload, you can effectively load a 13B model requiring ~26GB.

> **Note**: This requires fast CPU-GPU interconnect for acceptable performance.

资料来源：[examples/basic/offline_inference/README.md](https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/README.md)

## Model Registry Architecture

The vLLM CLI uses a modular command structure where model support is registered through subcommand modules:

```python
for cmd_module in CMD_MODULES:
    new_cmds = cmd_module.cmd_init()
    for cmd in new_cmds:
        cmd.subparser_init(subparsers).set_defaults(dispatch_function=cmd.cmd)
```

Each registered command includes validation logic to ensure model configurations are valid before execution.

资料来源：[vllm/entrypoints/cli/main.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/main.py)

## Serving Modes

### Standard API Server

The default serving mode starts an HTTP server with OpenAI-compatible endpoints:

```bash
vllm serve <model_name>
```

### Headless Mode

For distributed deployments, headless mode skips API server initialization:

```python
if args.headless:
    if args.api_server_count is not None and args.api_server_count > 0:
        raise ValueError(
            f"--api-server-count={args.api_server_count} cannot be "
            "used with --headless (no API servers are started in "
            "headless mode)."
        )
    args.api_server_count = 0
```

资料来源：[vllm/entrypoints/cli/serve.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/serve.py)

### gRPC Server Mode

For high-performance scenarios, vLLM supports gRPC-based serving:

```python
if getattr(args, "grpc", False):
    from vllm.entrypoints.grpc_server import serve_grpc
    uvloop.run(serve_grpc(args))
```

资料来源：[vllm/entrypoints/cli/serve.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/serve.py)

## Data Parallel Modes

vLLM supports distributed model serving through multiple load balancing strategies:

| Mode | Flag | Description |
|------|------|-------------|
| External LB | `--data-parallel-external-lb` or `--data-parallel-rank` | External load balancer manages request distribution |
| Hybrid LB | `--data-parallel-hybrid-lb` or `--data-parallel-start-rank` | Hybrid approach with internal and external coordination |

The system auto-detects load balancing mode to set appropriate default values for `api_server_count`.

资料来源：[vllm/entrypoints/cli/serve.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/serve.py)

## Installation and Quickstart

For users getting started with model architecture support:

1. Install vLLM following the [installation guide](https://docs.vllm.ai/en/latest/getting_started/installation.html)
2. Review the [quickstart documentation](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
3. Check the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html)

资料来源：[README.md](https://github.com/vllm-project/vllm/blob/main/README.md)

## Architecture Flow Diagram

```mermaid
graph TD
    A[User Request] --> B{CLI or Python API?}
    B -->|CLI| C[vllm serve command]
    B -->|Python| D[LLM class instantiation]
    C --> E[CLISubcommand processing]
    D --> F[AsyncEngineArgs configuration]
    E --> G[Model Registry Lookup]
    F --> H[Model Config Creation]
    G --> I[Load Model Architecture]
    H --> I
    I --> J{Quantization?}
    J -->|GGUF| K[Load Quantized Weights]
    J -->|BF16/FP8| L[Load Standard Weights]
    K --> M[PagedAttention Engine]
    L --> M
    M --> N[Inference Execution]
```

## Key Implementation Files

| Component | File Path | Purpose |
|-----------|-----------|---------|
| CLI Entry | `vllm/entrypoints/cli/main.py` | Main CLI dispatcher and argument parsing |
| Serve Command | `vllm/entrypoints/cli/serve.py` | HTTP server startup and configuration |
| Launch Layer | `vllm/entrypoints/cli/launch.py` | FastAPI-based serving layer |
| Offline Inference | `examples/basic/offline_inference/` | Python API usage examples |
| Structured Outputs | `examples/features/structured_outputs/` | Advanced output formatting |

---

---

## Doramagic 踩坑日志

项目：vllm-project/vllm

摘要：发现 21 个潜在踩坑项，其中 0 个为 high/blocking；最高优先级：安装坑 - 来源证据：[Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling。

## 1. 安装坑 · 来源证据：[Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_1a71634c530044a68b9160080d55de0a | https://github.com/vllm-project/vllm/issues/42182 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 2. 安装坑 · 来源证据：[Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_58327949a4524ed082bd189b53f713a1 | https://github.com/vllm-project/vllm/issues/40896 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 3. 安装坑 · 来源证据：[Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_l…

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_lora_adapter`?
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_fb1461834fe34049bd05182574d3e5e5 | https://github.com/vllm-project/vllm/issues/42207 | 来源讨论提到 docker 相关条件，需在安装/试用前复核。

## 4. 安装坑 · 来源证据：v0.18.1

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：v0.18.1
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_317a03f9de4e459f9be42064c7318b2c | https://github.com/vllm-project/vllm/releases/tag/v0.18.1 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 5. 能力坑 · 来源证据：[Feature]: Qwen3.5-Moe LoRA Support (experts)

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个能力理解相关的待验证问题：[Feature]: Qwen3.5-Moe LoRA Support (experts)
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_2d068d43c6654f3cab6b48bf98dad116 | https://github.com/vllm-project/vllm/issues/40005 | 来源类型 github_issue 暴露的待验证使用条件。

## 6. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 建议检查：将假设转成下游验证清单。
- 防护动作：假设必须转成验证项；没有验证结果前不能写成事实。
- 证据：capability.assumptions | github_repo:599547518 | https://github.com/vllm-project/vllm | README/documentation is current enough for a first validation pass.

## 7. 运行坑 · 来源证据：v0.20.2

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个运行相关的待验证问题：v0.20.2
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_ecf37722dff6494c82b384225e34bcb0 | https://github.com/vllm-project/vllm/releases/tag/v0.20.2 | 来源类型 github_release 暴露的待验证使用条件。

## 8. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 建议检查：补 GitHub 最近 commit、release、issue/PR 响应信号。
- 防护动作：维护活跃度未知时，推荐强度不能标为高信任。
- 证据：evidence.maintainer_signals | github_repo:599547518 | https://github.com/vllm-project/vllm | last_activity_observed missing

## 9. 安全/权限坑 · 下游验证发现风险项

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：下游已经要求复核，不能在页面中弱化。
- 建议检查：进入安全/权限治理复核队列。
- 防护动作：下游风险存在时必须保持 review/recommendation 降级。
- 证据：downstream_validation.risk_items | github_repo:599547518 | https://github.com/vllm-project/vllm | no_demo; severity=medium

## 10. 安全/权限坑 · 存在安全注意事项

- 严重度：medium
- 证据强度：source_linked
- 发现：No sandbox install has been executed yet; downstream must verify before user use.
- 对用户的影响：用户安装前需要知道权限边界和敏感操作。
- 建议检查：转成明确权限清单和安全审查提示。
- 防护动作：安全注意事项必须面向用户前置展示。
- 证据：risks.safety_notes | github_repo:599547518 | https://github.com/vllm-project/vllm | No sandbox install has been executed yet; downstream must verify before user use.

## 11. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 建议检查：把风险写入边界卡，并确认是否需要人工复核。
- 防护动作：评分风险必须进入边界卡，不能只作为内部分数。
- 证据：risks.scoring_risks | github_repo:599547518 | https://github.com/vllm-project/vllm | no_demo; severity=medium

## 12. 安全/权限坑 · 来源证据：[Bug]: ngram speculative decoding changes greedy output on Qwen3-0.6B / A100

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：[Bug]: ngram speculative decoding changes greedy output on Qwen3-0.6B / A100
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_9ce279a037934e8085332120bfdaca86 | https://github.com/vllm-project/vllm/issues/41758 | 来源讨论提到 docker 相关条件，需在安装/试用前复核。

## 13. 安全/权限坑 · 来源证据：v0.16.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：v0.16.0
- 对用户的影响：可能影响升级、迁移或版本选择。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_57ff11ff995a4809a33a38ff7504a5ef | https://github.com/vllm-project/vllm/releases/tag/v0.16.0 | 来源类型 github_release 暴露的待验证使用条件。

## 14. 安全/权限坑 · 来源证据：v0.17.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：v0.17.0
- 对用户的影响：可能阻塞安装或首次运行。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_4592ced4fde24aa9aa808caa40e25b84 | https://github.com/vllm-project/vllm/releases/tag/v0.17.0 | 来源类型 github_release 暴露的待验证使用条件。

## 15. 安全/权限坑 · 来源证据：v0.18.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：v0.18.0
- 对用户的影响：可能阻塞安装或首次运行。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_7ca09904f8164e94a3a1bc489d32d1ff | https://github.com/vllm-project/vllm/releases/tag/v0.18.0 | 来源类型 github_release 暴露的待验证使用条件。

## 16. 安全/权限坑 · 来源证据：v0.19.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：v0.19.0
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_cdb0d7a7f491474da7b93583ec643c00 | https://github.com/vllm-project/vllm/releases/tag/v0.19.0 | 来源讨论提到 docker 相关条件，需在安装/试用前复核。

## 17. 安全/权限坑 · 来源证据：v0.19.1

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：v0.19.1
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_8c2ec43cf6f147d49f80f22bcd199e8f | https://github.com/vllm-project/vllm/releases/tag/v0.19.1 | 来源类型 github_release 暴露的待验证使用条件。

## 18. 安全/权限坑 · 来源证据：v0.20.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：v0.20.0
- 对用户的影响：可能影响升级、迁移或版本选择。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_c867dafed19f4b97978d637aea4c7308 | https://github.com/vllm-project/vllm/releases/tag/v0.20.0 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 19. 安全/权限坑 · 来源证据：v0.20.1

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：v0.20.1
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_dc6a717c695b47838899da8c9791f907 | https://github.com/vllm-project/vllm/releases/tag/v0.20.1 | 来源类型 github_release 暴露的待验证使用条件。

## 20. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 建议检查：抽样最近 issue/PR，判断是否长期无人处理。
- 防护动作：issue/PR 响应未知时，必须提示维护风险。
- 证据：evidence.maintainer_signals | github_repo:599547518 | https://github.com/vllm-project/vllm | issue_or_pr_quality=unknown

## 21. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 建议检查：确认最近 release/tag 和 README 安装命令是否一致。
- 防护动作：发布节奏未知或过期时，安装说明必须标注可能漂移。
- 证据：evidence.maintainer_signals | github_repo:599547518 | https://github.com/vllm-project/vllm | release_recency=unknown

<!-- canonical_name: vllm-project/vllm; human_manual_source: deepwiki_human_wiki -->