Doramagic Project Pack · Human Manual
vllm
Related topics: Getting Started, Core Engine Architecture
vLLM Overview
Related topics: Getting Started, Core Engine Architecture
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Getting Started, Core Engine Architecture
vLLM Overview
What is vLLM?
vLLM is a fast and easy-to-use library for LLM (Large Language Model) inference and serving. It provides high-throughput, memory-efficient inference with an OpenAI-compatible API server, making it suitable for both research and production environments.
Sources: README.md
Key Features
Offline Inference
The LLM class provides the primary Python interface for offline inference—interacting with a model without using a separate model inference server. This enables direct model interaction for batch processing, experimentation, and development workflows.
Sources: examples/basic/offline_inference/README.md
OpenAI-Compatible API Server
vLLM serves LLM completions via HTTP through an OpenAI-compatible API. The server can be started with a simple command:
vllm serve Qwen/Qwen2.5-3B-Instruct
Sources: vllm/entrypoints/cli/serve.py
Structured Outputs
vLLM supports constrained decoding for structured outputs including JSON schema, regex patterns, and structural tags. This is essential for building reliable applications that require predictable output formats.
Sources: examples/features/structured_outputs/README.md
Disaggregated Prefill and Decode
vLLM supports disaggregated prefill architecture where prefill (token generation) and decode (token consumption) stages can run on separate instances. This enables:
- Independent scaling of prefill and decode workloads
- Improved resource utilization
- Better support for multi-turn conversations
Sources: examples/disaggregated/example_connector/README.md
KV Cache Transfer
For disaggregated serving, vLLM supports KV cache transfer between prefill and decode workers. The architecture includes:
| Component | Role | Description |
|---|---|---|
ECExampleConnector | Cache Storage | Stores encoder cache on local disk |
| EC Producer | Precompute | Pre-computes encoder cache |
| EC Consumer | Retrieve | Retrieves cached KV data |
Sources: examples/disaggregated/disaggregated_encoder/README.md
KV Load Failure Recovery
vLLM implements robust recovery mechanisms for KV load failures in both synchronous and asynchronous loading modes. The system:
- Identifies invalid KV blocks
- Reschedules affected requests
- Ensures consistent output through recovery logic
Sources: examples/disaggregated/kv_load_failure_recovery_offline/README.md
Long Text Embedding with Chunked Processing
vLLM supports embedding models with chunked processing for texts exceeding the model's maximum context length:
{
"pooling_type": "auto",
"use_activation": true,
"enable_chunked_processing": true,
"max_embed_len": 3072000
}
This enables processing of extremely long documents (up to 3M+ tokens) including academic papers, legal documents, and code repositories.
Sources: examples/pooling/embed/openai_embedding_long_text/README.md
Architecture Overview
CLI Entry Point
The vLLM CLI provides a flexible command-line interface built with FlexibleArgumentParser:
graph TD
A[vllm CLI] --> B[main.py Entry Point]
B --> C[Parse Arguments]
C --> D[Load CMD_MODULES]
D --> E[Create Subparsers]
E --> F[Execute Subcommand]
F --> G[serve]
F --> H[launch]
F --> I[other modules]The CLI supports:
-v, --versionflag for version information- Subcommand system with plugin architecture
- Command validation before execution
Sources: vllm/entrypoints/cli/main.py
Serve Subcommand
The serve subcommand handles online inference server startup:
graph TD
A[serve command] --> B{Model specified?}
B -->|Yes| C[Use CLI model]
B -->|No| D[Default: Qwen/Qwen3-0.6B]
C --> E{GRPC enabled?}
D --> E
E -->|Yes| F[Start gRPC Server]
E -->|No| G{Headless mode?}
G -->|Yes| H[Set api_server_count=0]
G -->|No| I[Check LB mode]
I --> J[Start FastAPI Server]
H --> JThe serve command supports multiple deployment modes:
- Standard mode: Full API server with GPU inference
- Headless mode: No API servers, only engine processing
- gRPC mode: Alternative RPC interface
- Load-balanced mode: Data-parallel external/hybrid load balancing
Sources: vllm/entrypoints/cli/serve.py
Launch Subcommand
The launch subcommand provides a modular component launch system:
graph LR
A[launch command] --> B[LaunchSubcommand]
B --> C[launch_component subparser]
C --> D[LaunchSubcommandBase subclasses]
D --> E[run_launch_fastapi]
D --> F[other components]
E --> G[Socket binding]
E --> H[Build API Server]
E --> I[EngineArgs configuration]The launch system renders servers with preprocessing only—no inference or quantized kernels, and never allocates KV cache.
Sources: vllm/entrypoints/cli/launch.py
Observability
OpenTelemetry Integration
vLLM includes built-in OpenTelemetry support for distributed tracing:
opentelemetry-instrument vllm serve facebook/opt-125m
Core packages are bundled with vLLM:
opentelemetry-sdkopentelemetry-apiopentelemetry-exporter-otlpopentelemetry-semantic-conventions-ai
Sources: examples/observability/opentelemetry/README.md
Prometheus Metrics and Dashboards
vLLM exports Prometheus-compatible metrics and supports integration with:
| Platform | Dashboard Format | Import Method |
|---|---|---|
| Grafana | JSON | UI or API |
| Perses | YAML | CLI |
Sources: examples/observability/dashboards/README.md
Performance Profiling
The nsys_profile_tools enable GPU kernel-level profiling:
nsys profile -t cuda -o run1 -f true --trace-fork-before-exec=true \
--cuda-graph-trace=node --delay <DELAY> --duration <DURATION> \
vllm serve openai/gpt-oss-120b ...
The gputrc2graph.py script generates kernel-level summaries and visualizations from .nsys-rep files.
Sources: tools/profiler/nsys_profile_tools/README.md
Supported Features Summary
| Feature | Description | Configuration |
|---|---|---|
| Offline Inference | Batch processing without server | LLM class |
| OpenAI API | HTTP API compatibility | vllm serve |
| Structured Outputs | JSON/regex/structural constraints | --reasoning-parser |
| Disaggregated Serving | Split prefill/decode | --ec-transfer-config |
| KV Recovery | Failure resilience | Custom connectors |
| Long Text Embedding | Chunked processing | --pooler-config |
| Observability | Tracing and metrics | OpenTelemetry |
| Quantization | GGUF support | repo_id:quant_type |
Usage Modes
Offline Inference Mode
from vllm import LLM
llm = LLM("Qwen/Qwen2.5-3B-Instruct")
output = llm.generate("Hello, world!")
API Server Mode
vllm serve Qwen/Qwen2.5-3B-Instruct --tensor-parallel-size 2
Disaggregated Mode
# Prefill instance
vllm serve --prefill-only --ec-transfer-config ' {...} '
# Decode instance
vllm serve --decode-only --ec-transfer-config ' {...} '
Quick Reference
| Command | Purpose |
|---|---|
vllm serve <model> | Start API server |
vllm launch <component> | Launch specific component |
opentelemetry-instrument vllm serve | Enable tracing |
See Also
Sources: [README.md](https://github.com/vllm-project/vllm/blob/main/README.md)
Getting Started
Related topics: vLLM Overview
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: vLLM Overview
Getting Started
vLLM is a fast and easy-to-use library for Large Language Model (LLM) inference and serving. It provides both an offline inference interface via the LLM class and an online serving layer with an OpenAI-compatible API server. Sources: README.md:1-30
This guide covers the essential steps to get started with vLLM, from installation through basic inference and serving.
Installation
vLLM can be installed via pip or built from source. For detailed installation instructions, refer to the official documentation.
Quick Installation
pip install vllm
Building from Source
For custom builds or development:
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
GPU Requirements
vLLM requires CUDA-compatible GPUs. The library supports various CUDA versions. Verify your environment has the necessary GPU drivers and CUDA toolkit installed. Sources: README.md:1-15
Core Concepts
Before diving into usage, understand these fundamental concepts:
| Concept | Description |
|---|---|
| LLM Class | Primary Python interface for offline inference |
| Engine Args | Configuration parameters for the inference engine |
| Sampling Params | Controls generation behavior (temperature, max_tokens, etc.) |
| OpenAI API Server | HTTP server providing OpenAI-compatible REST endpoints |
Architecture Overview
graph TD
A[User Code] --> B[LLM Class / API Server]
B --> C[AsyncLLMEngine]
C --> D[Worker Pool]
D --> E[GPU Devices]
E --> F[PagedAttention KV Cache]
G[HTTP Clients] --> H[OpenAI API Server]
H --> BOffline Inference
Offline inference involves running model inference directly in Python without a separate server. This is ideal for batch processing, testing, or embedding vLLM into applications. Sources: examples/basic/offline_inference/README.md:1-25
Basic Usage
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM("Qwen/Qwen2.5-3B-Instruct")
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=256
)
# Run inference
outputs = llm.generate(["Hello, how are you?", "What is vLLM?"], sampling_params)
for output in outputs:
print(output.outputs[0].text)
Supported Models
vLLM supports a wide range of models including autoregressive transformers, mixture-of-experts models, and quantized models. For the complete list, see the supported models documentation. Sources: README.md:20-25
Serving with the CLI
vLLM provides a command-line interface for serving models via an OpenAI-compatible API. Sources: vllm/entrypoints/cli/serve.py:1-30
Starting the Server
vllm serve Qwen/Qwen3-0.6B
Command-Line Options
The serve command supports extensive configuration through CLI arguments:
vllm serve <model> [options]
Use --help=all to show all available flags, or --help=<ConfigGroup> to explore options by section (e.g., --help=ModelConfig, --help=Frontend). Sources: vllm/entrypoints/cli/serve.py:5-20
Key Server Options
| Option | Description | Default |
|---|---|---|
--model | Model name or path | Required |
--gpu-memory-utilization | Fraction of GPU memory to use | 0.9 |
--max-model-len | Maximum sequence length | Model default |
--tensor-parallel-size | Number of GPUs for parallelism | 1 |
--port | Server port | 8000 |
Headless Mode
For distributed setups where API servers are managed externally:
vllm serve <model> --headless
In headless mode, no API servers are started, and --api-server-count cannot be used. Sources: vllm/entrypoints/cli/serve.py:30-45
API Usage
Once the server is running, you can interact with it using the OpenAI-compatible API.
Completions API
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-3B-Instruct",
"prompt": "The capital of France is",
"max_tokens": 50
}'
Chat API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-3B-Instruct",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
Offline Inference Examples
The repository includes practical examples demonstrating various vLLM capabilities. Sources: examples/basic/offline_inference/README.md:25-60
Running Examples
# Basic example
python examples/basic/offline_inference/basic.py
# Chat example with sampling parameters
python examples/basic/offline_inference/chat.py --max_tokens 100 --temperature 0.8
# Generate example
python examples/basic/offline_inference/generate.py --generation-config auto
Generation Config
The --generation-config argument specifies where the generation config loads from:
'auto'- Load from model path<folder_path>- Load from specified directory- Not provided - Use vLLM defaults
python examples/basic/offline_inference/generate.py --generation-config auto
Note: If max_new_tokens is specified in generation config, it sets a server-wide limit on output tokens for all requests. Sources: examples/basic/offline_inference/README.md:55-70
Advanced Features
Structured Outputs
vLLM supports constrained decoding for structured outputs including JSON schemas, regex patterns, and grammar-based constraints. Sources: examples/features/structured_outputs/README.md:1-40
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--reasoning-parser deepseek_r1
# Run structured outputs example
uv run structured_outputs_offline.py --constraint json_mode regex
Long Text Embedding
For embedding models, vLLM supports chunked processing to handle texts exceeding the model's maximum context length:
MODEL_NAME="jinaai/jina-embeddings-v3" \
MAX_EMBED_LEN=1048576 \
./service.sh
Configuration example with chunked processing:
{
"pooling_type": "auto",
"use_activation": true,
"enable_chunked_processing": true,
"max_embed_len": 3072000
}
GGUF Quantized Models
vLLM supports GGUF-quantized models loaded directly from HuggingFace:
--model unsloth/Qwen3-0.6B-GGUF:Q4_K_M --tokenizer Qwen/Qwen3-0.6B
CPU Offload
For systems with limited GPU memory, CPU offload allows loading larger models:
--cpu-offload-gb 10
This creates a virtual 34GB GPU when you have a 24GB GPU, enabling 13B model loading with BF16 weights. Sources: examples/basic/offline_inference/README.md:75-85
Configuration Workflow
graph LR
A[Define Engine Args] --> B[Create Model Config]
B --> C[Initialize Engine]
C --> D[Process Requests]
D --> E[Return Outputs]
F[CLI Arguments] --> A
G[Python API] --> AProgrammatic Configuration
from vllm import LLM, EngineArgs
engine_args = EngineArgs(
model="Qwen/Qwen2.5-7B-Instruct",
gpu_memory_utilization=0.85,
tensor_parallel_size=2,
max_model_len=4096
)
llm = LLM(**engine_args)
Next Steps
| Resource | Description |
|---|---|
| Documentation | Comprehensive guides and API reference |
| Supported Models | Complete list of supported architectures |
| Examples | Usage examples for various features |
| Paper | Technical details behind vLLM's design |
Troubleshooting
Common Issues
- CUDA Out of Memory: Reduce
gpu_memory_utilizationor use smaller batch sizes - Model Not Found: Ensure HuggingFace credentials are configured for gated models
- Import Errors: Verify all dependencies are installed with
pip install vllm
Getting Help
- Technical Questions: Use GitHub Issues
- Community Discussion: vLLM Forum
- Development Coordination: Developer Slack
Sources: README.md:60-75
Sources: [README.md:60-75]()
Core Engine Architecture
Related topics: vLLM Overview, Model Executor and Worker Architecture, Scheduling and Request Processing
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: vLLM Overview, Model Executor and Worker Architecture, Scheduling and Request Processing
Core Engine Architecture
Overview
The vLLM Core Engine Architecture is the central orchestration layer responsible for managing LLM inference workflows, request scheduling, and model execution. vLLM supports two engine versions: the legacy V0 engine and the current V1 engine (introduced in v0.6.0), both designed to provide high-throughput LLM serving through efficient request batching and GPU memory management.
The engine architecture serves as the foundation for both offline inference via the LLM class and online serving via the OpenAI-compatible API server. Sources: vllm/entrypoints/llm.py:1-50
Architecture Components
V0 Engine (Legacy)
The V0 engine is the original implementation found in vllm/engine/. It consists of:
| Component | File | Purpose |
|---|---|---|
LLMEngine | vllm/engine/llm_engine.py | Synchronous inference engine with blocking operations |
AsyncLLMEngine | vllm/engine/async_llm_engine.py | Async wrapper enabling concurrent request handling |
The V0 engine uses an event-loop-based async architecture where AsyncLLMEngine wraps LLMEngine to provide non-blocking request processing.
V1 Engine (Current)
The V1 engine (vllm/v1/engine/) is the current production-ready implementation featuring a modular design:
| Component | File | Purpose |
|---|---|---|
Core | vllm/v1/engine/core.py | Low-level engine core managing model execution |
AsyncLLM | vllm/v1/engine/async_llm.py | Main async interface for inference |
LLMEngine | vllm/v1/engine/llm_engine.py | High-level engine orchestrator |
The V1 engine architecture separates concerns into distinct layers: AsyncLLM provides the public async interface, LLMEngine handles request orchestration, and Core manages low-level GPU operations.
Entry Points
vLLM provides multiple entry points for interacting with the engine:
graph TD
A[vllm serve] --> B[ServeSubcommand]
A --> C[LaunchSubcommand]
B --> D[serve_grpc / API Server]
C --> E[run_launch_fastapi]
F[vllm run-batch] --> G[BatchRunner]
H[LLM Class] --> I[AsyncLLM Engine]CLI Entry Point
The CLI entry point in vllm/entrypoints/cli/main.py provides command-line access to vLLM functionality:
# Simplified CLI structure
parser = FlexibleArgumentParser(description="vLLM CLI")
subparsers = parser.add_subparsers(required=False, dest="subparser")
cmds = {}
for cmd_module in CMD_MODULES:
new_cmds = cmd_module.cmd_init()
for cmd in new_cmds:
cmd.subparser_init(subparsers).set_defaults(dispatch_function=cmd.cmd)
Sources: vllm/entrypoints/cli/main.py:1-40
Serve Subcommand
The serve subcommand initializes the HTTP API server or gRPC service:
class ServeSubcommand(CLISubcommand):
name = "serve"
@staticmethod
def cmd(args: argparse.Namespace) -> None:
if hasattr(args, "model_tag") and args.model_tag is not None:
args.model = args.model_tag
Sources: vllm/entrypoints/cli/serve.py:1-30
Launch Subcommand
The launch subcommand provides component-level launching capabilities:
def cmd_init() -> list[CLISubcommand]:
return [LaunchSubcommand()]
async def run_launch_fastapi(args: argparse.Namespace) -> None:
listen_address, sock = setup_server(args)
engine_args = AsyncEngineArgs.from_cli_args(args)
Sources: vllm/entrypoints/cli/launch.py:1-60
Python API Entry Point
The LLM class in vllm/entrypoints/llm.py provides the primary Python interface for offline inference:
class LLM:
"""
An LLM for offline inference.
"""
def __init__(self, model: str, ...):
...
Sources: vllm/entrypoints/llm.py:1-100
Engine Initialization Flow
The following diagram illustrates the initialization flow from CLI to engine:
sequenceDiagram
participant CLI as vllm serve
participant Parser as ArgumentParser
participant EngineArgs as AsyncEngineArgs
participant Engine as AsyncLLM / Core
participant Model as ModelConfig
CLI->>Parser: Parse CLI arguments
Parser->>EngineArgs: from_cli_args()
EngineArgs->>Model: create_model_config()
Model-->>EngineArgs: Config validated
EngineArgs->>Engine: Initialize engine
Engine->>Engine: Load model weightsCore Engine Components
AsyncLLM (V1)
The AsyncLLM class is the primary async interface for V1 engine:
class AsyncLLM:
"""
Async implementation of LLM engine.
"""
async def add_request(self, request_id: str, prompt: str, ...):
...
async def step(self) -> List[RequestOutput]:
...
Sources: vllm/v1/engine/async_llm.py:1-50
Core (V1)
The Core class manages low-level model execution:
class Core:
"""
Core engine for V1.
"""
def __init__(self, engine_config: VllmConfig, ...):
...
def get_config(self) -> VllmConfig:
...
Sources: vllm/v1/engine/core.py:1-80
LLMEngine (V1)
The V1 LLMEngine orchestrates request processing:
class LLMEngine:
"""
V1 LLM Engine implementation.
"""
def __init__(self, vllm_config: VllmConfig, ...):
...
Sources: vllm/v1/engine/llm_engine.py:1-50
Configuration System
AsyncEngineArgs
Configuration flows from CLI/API to engine via AsyncEngineArgs:
| Parameter | Type | Description |
|---|---|---|
model | str | Model name or path |
tensor_parallel_size | int | Number of GPUs for tensor parallelism |
gpu_memory_utilization | float | Fraction of GPU memory to use |
max_model_len | Optional[int] | Maximum sequence length |
dtype | str | Model data type (float16, bfloat16, etc.) |
quantization | Optional[str] | Quantization method (awq, gptq, etc.) |
The engine validates configuration through create_model_config():
def create_model_config(self) -> ModelConfig:
"""Create model config from engine args."""
return ModelConfig(...)
Sources: vllm/entrypoints/cli/launch.py:40-50
Headless Mode
The V1 engine supports headless operation where no API servers are started:
if args.headless:
if args.api_server_count is not None and args.api_server_count > 0:
raise ValueError(
f"--api-server-count={args.api_server_count} cannot be "
"used with --headless (no API servers are started in "
"headless mode)."
)
args.api_server_count = 0
Sources: vllm/entrypoints/cli/serve.py:25-35
gRPC Support
vLLM supports gRPC for disaggregated prefill scenarios:
if getattr(args, "grpc", False):
from vllm.entrypoints.grpc_server import serve_grpc
uvloop.run(serve_grpc(args))
return
Sources: vllm/entrypoints/cli/serve.py:18-22
Request Processing Pipeline
graph LR
A[Request] --> B[Parser]
B --> C[AsyncLLM.add_request]
C --> D[Scheduler]
D --> E[Model Executor]
E --> F[Output]
D --> G[KV Cache]Data Parallel Modes
The engine supports multiple data parallel configurations:
| Mode | Flag | Description |
|---|---|---|
| External LB | --data-parallel-external-lb | External load balancer coordinates workers |
| Hybrid LB | --data-parallel-hybrid-lb | Hybrid approach with custom rank assignment |
| Rank | --data-parallel-rank | Specify worker rank for distributed setup |
# Detection logic
is_external_lb = getattr(args, "data_parallel_external_lb", False)
is_hybrid_lb = getattr(args, "data_parallel_hybrid_lb", False)
Sources: vllm/entrypoints/cli/serve.py:35-40
Model Config Validation
The engine performs validation during model config creation:
model_config = engine_args.create_model_config()
# Clear quantization for render servers (preprocessing only)
if render_mode:
model_config.quantization = None
Sources: vllm/entrypoints/cli/launch.py:45-55
Key Design Patterns
Async/Await Architecture
The V1 engine uses native async/await for concurrent request handling:
async def step(self) -> List[RequestOutput]:
"""Execute one iteration of the engine."""
...
Command Pattern
CLI commands follow the Command pattern with CLISubcommand base class:
class ServeSubcommand(CLISubcommand):
name = "serve"
@staticmethod
def cmd(args: argparse.Namespace) -> None:
...
Subparser Registration
Commands register themselves via cmd_init() factory functions:
def cmd_init() -> list[CLISubcommand]:
return [LaunchSubcommand(), ServeSubcommand()]
Summary
The vLLM Core Engine Architecture provides a flexible, multi-layered design supporting both V0 (legacy) and V1 (current) engine implementations. Key characteristics include:
- Dual Engine Support: V0 for backward compatibility, V1 for production workloads
- Multiple Entry Points: CLI, Python API, HTTP server, gRPC
- Async-First Design: Native async/await for concurrent request processing
- Modular Components: Clear separation between CLI, engine core, and model execution
- Flexible Configuration: Comprehensive argument system via
AsyncEngineArgs - Data Parallel Support: Multiple modes for distributed serving scenarios
The architecture prioritizes performance through efficient GPU memory management, request batching, and pipelined execution while maintaining a clean, extensible design.
Sources: [vllm/entrypoints/cli/main.py:1-40]()
Model Executor and Worker Architecture
Related topics: Core Engine Architecture, Scheduling and Request Processing, Model Architecture Support
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Engine Architecture, Scheduling and Request Processing, Model Architecture Support
Model Executor and Worker Architecture
Overview
The vLLM Model Executor and Worker Architecture forms the core execution layer responsible for model loading, inference, and batch-level processing. This architecture separates concerns between high-level request orchestration and low-level GPU-based model execution, enabling efficient parallel processing of LLM inference requests.
The architecture consists of two primary components:
| Component | Responsibility |
|---|---|
| Model Executor | Manages model loading, weight initialization, and model-specific execution logic |
| Worker | Handles GPU-side computation, memory management, and kernel execution |
Architecture Diagram
graph TD
A[AsyncEngineArgs] --> B[Model Executor]
B --> C[Model Loader]
C --> D[Model Weights]
B --> E[GPU Worker]
E --> F[GPU Model Runner]
F --> G[CUDA Kernels]
F --> H[Attention Layers]
F --> I[MLP Layers]
J[Request Batch] --> E
E --> K[Generated Tokens]
style B fill:#e1f5fe
style E fill:#fff3e0
style F fill:#f3e5f5Model Executor Layer
Purpose and Scope
The Model Executor layer handles all aspects of model lifecycle management, including initialization, weight loading, and providing the interface for model execution. This layer operates independently of the worker layer, allowing for flexibility in model configuration.
Model Loading
The model loading subsystem supports multiple backends and quantization schemes:
class ModelLoader:
def load_model(self, model_config, parallel_config, device_config):
# Load model weights based on configuration
pass
Supported Loading Modes:
| Mode | Description |
|---|---|
| Auto | Automatically detect optimal loading strategy |
| Naive | Standard PyTorch loading |
| Sharded | Load model shards across multiple devices |
| Quantized | Load quantized weights (AWQ, GPTQ, GGUF) |
Sources: vllm/model_executor/model_loader/__init__.py
Model Registry
vLLM maintains a registry of supported model architectures:
# Model architecture registration
@register_model("LlamaForCausalLM")
class LlamaForCausalLM(nn.Module):
...
The registry maps model names to their corresponding model classes, enabling automatic model instantiation based on HuggingFace model architecture detection.
Sources: vllm/model_executor/models/__init__.py
Worker Architecture
Base Worker
The worker base class defines the interface for all worker implementations:
class WorkerBase:
def __init__(self, vllm_config):
self.vllm_config = vllm_config
self.model = None
self.device = None
def execute_model(self, batch):
raise NotImplementedError
Sources: vllm/v1/worker/worker_base.py
GPU Worker
The GPU Worker is the primary worker implementation for GPU-based inference:
graph LR
A[Input Batch] --> B[Model Input Preparation]
B --> C[Forward Pass]
C --> D[Output Extraction]
D --> E[Token Generation]Key Responsibilities:
- Initialize CUDA context and memory pools
- Prepare model inputs with proper padding and masking
- Execute forward passes on GPU
- Manage KV cache memory allocation
Sources: vllm/v1/worker/gpu_worker.py
GPU Model Runner
The GPU Model Runner handles the low-level model execution details:
class GPUModelRunner:
def __init__(self, config):
self.kv_cache = None
self.attn_metadata = None
self.block_manager = None
def prepare_inputs(self, batch):
# Prepare input tensors with proper device placement
pass
def execute_model(self, input_tokens, positions):
# Execute model forward pass
pass
Core Components:
| Component | Function |
|---|---|
kv_cache | Stores key-value tensors for attention |
attn_metadata | Manages attention metadata for paged attention |
block_manager | Handles physical memory block allocation |
Sources: vllm/v1/worker/gpu_model_runner.py
Execution Flow
sequenceDiagram
participant API as API Server
participant Executor as Model Executor
participant Worker as GPU Worker
participant Runner as GPU Model Runner
participant Kernels as CUDA Kernels
API->>Executor: Initialize model
Executor->>Worker: Load weights
Worker->>Runner: Initialize GPU state
Runner->>Kernels: Allocate memory
API->>Executor: Execute batch
Executor->>Worker: Forward request
Worker->>Runner: Prepare inputs
Runner->>Kernels: Compute attention
Runner->>Kernels: Compute mlp
Kernels-->>Runner: Output logits
Runner-->>Worker: Return results
Worker-->>Executor: Output tokensMemory Management
KV Cache Architecture
vLLM uses a block-based KV cache management system:
- Logical Blocks: Abstract representation of KV cache entries
- Physical Blocks: Actual GPU memory allocations
- Block Mapping: Links logical blocks to physical locations
class BlockManager:
def allocate(self, num_blocks):
# Allocate physical blocks
pass
def get_physical_block(self, logical_id):
# Get physical location for logical block
pass
Memory Allocation Strategy
| Strategy | Use Case |
|---|---|
| Dynamic | Default, allocates on demand |
| Static | Pre-allocates at initialization |
| Hybrid | Mix of static and dynamic |
Configuration Parameters
The Model Executor and Worker architecture is configured through AsyncEngineArgs:
| Parameter | Description | Default |
|---|---|---|
model | Model name or path | Required |
tensor_parallel_size | Number of GPUs for tensor parallelism | 1 |
pipeline_parallel_size | Number of pipeline stages | 1 |
gpu_memory_utilization | Fraction of GPU memory for KV cache | 0.9 |
max_model_len | Maximum sequence length | Auto |
block_size | KV cache block size | 16 |
Summary
The Model Executor and Worker Architecture in vLLM provides a modular, extensible system for LLM inference:
- Separation of Concerns: Clear boundaries between model loading and execution
- GPU Optimization: Efficient CUDA kernel integration and memory management
- Flexible Configuration: Support for various parallelism and quantization strategies
- Extensibility: Plugin-based model registration system
This architecture enables vLLM to achieve high throughput through batch processing while maintaining low latency through careful memory management and kernel optimization.
Sources: [vllm/model_executor/model_loader/__init__.py]()
Scheduling and Request Processing
Related topics: Core Engine Architecture, Model Executor and Worker Architecture, Distributed Inference and Parallelism
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Engine Architecture, Model Executor and Worker Architecture, Distributed Inference and Parallelism
Scheduling and Request Processing
Overview
The Scheduling and Request Processing system is a core component of vLLM's v1 engine architecture. It manages the lifecycle of inference requests from arrival to completion, coordinating resource allocation, batch scheduling, and execution ordering to maximize GPU utilization while maintaining quality of service guarantees.
In vLLM, the scheduler operates as an asynchronous event-driven system that continuously evaluates pending requests, determines optimal batching strategies, and dispatches work to the underlying execution engine. This design enables vLLM to handle high-throughput serving workloads with efficient memory management through mechanisms like paged attention and dynamic batch composition.
The scheduler works in conjunction with the request queue, which serves as the primary buffer for incoming inference requests. It makes real-time decisions about which requests to include in the next execution batch based on available GPU memory, request priorities, and fairness constraints.
Architecture Overview
Component Hierarchy
The scheduling system consists of several interconnected components that work together to manage request processing:
graph TD
A[API Layer] --> B[Request Queue]
B --> C[Async Scheduler]
C --> D[Scheduler]
D --> E[Execution Engine]
E --> F[GPU]
G[Config] --> C
G --> DCore Components
| Component | File | Purpose |
|---|---|---|
Scheduler | vllm/v1/core/sched/scheduler.py | Core scheduling logic and batch selection |
AsyncScheduler | vllm/v1/core/sched/async_scheduler.py | Async wrapper for scheduler operations |
RequestQueue | vllm/v1/core/sched/request_queue.py | Request buffering and ordering |
Request | vllm/v1/request.py | Request data model and state |
SchedulerConfig | vllm/config/scheduler.py | Scheduler configuration parameters |
Request Model
Request Data Structure
The Request class represents an individual inference request with all associated metadata and state. Each request maintains its own context including tokenized inputs, sampling parameters, and execution state.
The request model tracks the following key attributes:
| Attribute | Type | Description |
|---|---|---|
request_id | str | Unique identifier for the request |
prompt | str | Original input text or tokens |
prompt_token_ids | List[int] | Tokenized prompt |
sampling_params | SamplingParams | Sampling configuration |
arrival_time | float | Request arrival timestamp |
state | RequestState | Current execution state |
Request States
Requests transition through a defined state machine during their lifecycle:
stateDiagram-v2
[*] --> WAITING: Request Arrival
WAITING --> SCHEDULED: Scheduler Selection
SCHEDULED --> RUNNING: Dispatch to GPU
RUNNING --> WAITING: Preemption/Continuation
RUNNING --> FINISHED: Completion
RUNNING --> WAITING: KV Cache Reuse
WAITING --> CANCELLED: Client Cancellation
SCHEDULED --> CANCELLED: Client Cancellation- WAITING: Request is queued and awaiting scheduling
- SCHEDULED: Request has been selected for the next batch
- RUNNING: Request is actively being processed on GPU
- FINISHED: Request has completed successfully
- CANCELLED: Request was cancelled before completion
Sources: vllm/v1/request.py
Request Queue
The RequestQueue serves as the primary buffering mechanism for incoming requests. It provides thread-safe operations for enqueueing, dequeuing, and managing request priorities.
Queue Operations
| Operation | Description |
|---|---|
enqueue() | Add new request to queue |
dequeue() | Remove and return next request |
peek() | View next request without removal |
cancel() | Remove cancelled request |
requeue() | Return request to queue (preemption) |
Priority Handling
The request queue supports priority-based ordering where higher priority requests can bypass lower priority ones. Priority is determined by a combination of factors:
- Explicit priority values in
SamplingParams - Arrival time (older requests may get priority for fairness)
- Request type (prefill vs decode operations)
Sources: vllm/v1/core/sched/request_queue.py
Scheduler
Core Scheduling Logic
The scheduler is responsible for selecting which requests to include in the next execution batch. It operates on a continuous scheduling loop that evaluates the current state of all pending requests and available resources.
#### Scheduling Criteria
The scheduler makes decisions based on multiple factors:
- Memory Availability: Sufficient GPU memory must be available for the request's KV cache
- Batch Size Limits: Maximum batch size constraints
- Prefill-Decompose Decisions: Whether to split prefill into smaller chunks
- Priority Weighting: Relative importance of pending requests
- Latency Targets: QoS requirements for specific request categories
#### Scheduling Loop
graph LR
A[Evaluate Pending Requests] --> B{Sufficient Resources?}
B -->|Yes| C[Select Request]
C --> D[Add to Batch]
D --> E{Batch Full?}
E -->|No| A
E -->|Yes| F[Dispatch Batch]
B -->|No| G[Wait / Preempt]
F --> H[Update State]
H --> AAsync Scheduler Interface
The AsyncScheduler provides an asynchronous interface to the core scheduler, enabling non-blocking scheduling operations. This is essential for maintaining high throughput in production serving scenarios where the scheduler must coexist with network I/O and other async operations.
Key async operations include:
schedule_async(): Async wrapper for schedule iterationadd_request(): Async request enqueueingabort_request(): Async request cancellation
Sources: vllm/v1/core/sched/async_scheduler.py
Scheduler Configuration
The SchedulerConfig class defines all configurable parameters for the scheduler behavior. These settings control batching strategies, memory management, and QoS characteristics.
Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
max_num_seqs | int | 256 | Maximum sequences per iteration |
max_num_batched_tokens | int | 8192 | Max tokens per batch |
max_model_len | int | 8192 | Maximum model context length |
enable_chunked_prefill | bool | True | Enable prefill chunking |
num_prefill_groups | int | 1 | Number of prefill groups |
async_scheduling | bool | True | Enable async scheduling |
Memory Management Settings
| Parameter | Description |
|---|---|
gpu_memory_utilization | Fraction of GPU memory for KV cache (0.0-1.0) |
num_causal_layers | Number of causal attention layers |
head_dim | Dimension of attention heads |
Sources: vllm/config/scheduler.py
Batching Strategies
Continuous Batching
vLLM employs continuous batching (also known as iteration-level scheduling) to maximize GPU utilization. Unlike static batching, where batch composition is fixed at the start, continuous batching allows requests to enter and exit the batch at each iteration.
#### Advantages
- Higher Throughput: GPU is never idle waiting for batch to complete
- Lower Latency: Short requests don't wait for long ones
- Better Memory Utilization: Dynamic allocation based on actual needs
Prefill Batching
Prefill operations (processing new tokens) and decode operations (generating new tokens) have different characteristics. The scheduler can:
- Combined Batching: Mix prefill and decode in same batch
- Separate Batching: Process prefill and decode in distinct batches
- Chunked Prefill: Split large prefill requests into smaller chunks
Chunked Prefill
When enable_chunked_prefill is enabled, large prefill requests are split into smaller chunks to:
- Reduce memory pressure
- Allow shorter requests to be scheduled faster
- Improve fairness between requests of different lengths
Sources: vllm/v1/core/sched/scheduler.py
Request Processing Flow
Full Request Lifecycle
sequenceDiagram
participant Client
participant API
participant Queue
participant Scheduler
participant Engine
participant GPU
Client->>API: Submit Request
API->>API: Validate & Tokenize
API->>Queue: Enqueue Request
Scheduler->>Queue: Dequeue Requests
Scheduler->>Scheduler: Evaluate Batching
Scheduler->>Engine: Dispatch Batch
Engine->>GPU: Execute Forward Pass
GPU-->>Engine: Output Tensors
Engine-->>Scheduler: Update State
Scheduler->>Queue: Requeue if Needed
Scheduler->>API: Stream Tokens
API-->>Client: Response StreamScheduling Iteration
Each scheduling iteration follows these steps:
- Request Evaluation: Scan all pending requests for eligibility
- Resource Calculation: Determine available GPU memory
- Batch Composition: Select requests based on scheduling policy
- Chunk Assignment: Divide prefill requests if needed
- Batch Dispatch: Send batch to execution engine
- State Update: Update request states and metrics
Configuration Example
from vllm.config import SchedulerConfig
config = SchedulerConfig(
max_num_seqs=256,
max_num_batched_tokens=8192,
max_model_len=32768,
enable_chunked_prefill=True,
gpu_memory_utilization=0.9,
)
Performance Considerations
Memory Management
The scheduler must balance memory allocation between:
- KV Cache: Storing attention key-value pairs
- Model Weights: The LLM parameters (typically pre-loaded)
- Activation Memory: Temporary tensors during computation
Latency vs Throughput
The scheduling configuration affects the latency-throughput tradeoff:
| Setting | Effect |
|---|---|
Smaller max_num_seqs | Lower latency, lower throughput |
Larger max_num_batched_tokens | Higher throughput, variable latency |
enable_chunked_prefill=True | Better latency fairness |
Preemption
When GPU memory is insufficient for incoming requests, the scheduler may preempt existing requests to free memory. Preempted requests are returned to the queue and rescheduled later.
Integration with Execution Engine
The scheduler interfaces with the execution engine through a well-defined API:
- schedule(): Main entry point for scheduling decisions
- add_request(): Register new inference request
- abort_request(): Cancel running request
- update_from_output(): Process execution results
The execution engine receives scheduled batches and returns completed or paused requests, allowing the scheduler to update its internal state and make subsequent scheduling decisions.
Related Components
For a complete understanding of vLLM's request processing, also refer to:
- Engine: Coordinates between scheduler and model execution
- Worker: Executes model operations on GPU
- Cache: KV cache management and allocation
- 采样参数: Sampling configuration affecting scheduling
Summary
The Scheduling and Request Processing system is fundamental to vLLM's ability to serve large language models efficiently. By employing continuous batching, intelligent memory management, and flexible scheduling policies, vLLM achieves high throughput while maintaining low latency for diverse workloads. The modular design with separate scheduler, queue, and configuration components allows for fine-tuned control over serving behavior.
Sources: [vllm/v1/request.py]()
PagedAttention and KV Cache Management
Related topics: Scheduling and Request Processing, Distributed Inference and Parallelism, Attention Backends and Kernels
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Scheduling and Request Processing, Distributed Inference and Parallelism, Attention Backends and Kernels
PagedAttention and KV Cache Management
Overview
PagedAttention is a novel attention mechanism that enables efficient virtual memory-based management of the Key-Value (KV) cache in large language model (LLM) inference. Inspired by the memory management technique in operating systems called paging, PagedAttention divides the KV cache into fixed-size "pages" that can be flexibly allocated and managed, eliminating the need for contiguous memory allocation.
The KV cache stores the key and value tensors from attention computation for each token position. During autoregressive decoding, this cache grows dynamically as new tokens are generated. Traditional LLM serving systems allocate contiguous memory blocks for the KV cache, leading to significant memory waste due to internal and external fragmentation when handling variable-length sequences and multi-user workloads.
vLLM's PagedAttention implementation provides:
- Memory efficiency: Eliminates fragmentation by using non-contiguous page-based allocation
- Flexible batching: Supports arbitrary sequence lengths and concurrent requests
- Dynamic memory management: Allocates cache pages on-demand during generation
- GPU memory optimization: Maximizes GPU memory utilization for higher throughput
Architecture
High-Level System Design
graph TD
subgraph "Application Layer"
Req[Inference Request]
Sched[Scheduler]
end
subgraph "Memory Management Layer"
KVM[KV Cache Manager]
BP[Block Pool]
end
subgraph "GPU Memory Layer"
GPU[GPU Memory]
Pages[KV Cache Pages]
end
Req --> Sched
Sched --> KVM
KVM --> BP
BP --> GPU
Pages -.>|Physical Memory| GPUPagedAttention Version Comparison
vLLM implements two versions of PagedAttention with different performance characteristics:
| Aspect | PagedAttention V1 | PagedAttention V2 |
|---|---|---|
| Kernel Type | Fused attention kernels | Optimized fused kernels |
| Memory Access | Standard page table lookup | Enhanced page table optimization |
| Performance | Baseline optimized | ~2.2x speedup over V1 |
| Use Case | General purpose | Production workloads |
| Implementation | paged_attention_v1.cu | paged_attention_v2.cu |
Sources: csrc/attention/paged_attention_v1.cu:1-100, csrc/attention/paged_attention_v2.cu:1-100, docs/design/paged_attention.md:1-50
KV Cache Manager
The KV Cache Manager (kv_cache_manager.py) is the core component responsible for tracking and managing the allocation of KV cache pages across all in-flight sequences.
Responsibilities
| Responsibility | Description |
|---|---|
| Block Allocation | Allocates and deallocates cache blocks as sequences grow or complete |
| Reference Counting | Tracks how many sequences reference each physical block |
| Page Table Management | Maintains virtual-to-physical page mappings per sequence |
| Cache Eviction | Handles cache eviction when memory pressure occurs |
Key Data Structures
# Simplified representation of block metadata
class Block:
block_id: int # Physical block identifier
page_indices: List[int] # Virtual page indices mapping
ref_count: int # Number of sequences referencing this block
is_computed: bool # Whether this block has computed KV cache
Block Allocation Workflow
sequenceDiagram
participant Scheduler
participant KVM as KV Cache Manager
participant BP as Block Pool
participant GPU as GPU Memory
Scheduler->>KVM: allocate_blocks(sequence_id, num_tokens)
KVM->>BP: reserve_blocks(count)
BP->>GPU: allocate_contiguous_pages
GPU-->>BP: block_handles
BP-->>KVM: allocated_blocks
KVM-->>Scheduler: block_tableSources: vllm/v1/core/kv_cache_manager.py:1-150, vllm/v1/core/block_pool.py:1-100
Block Pool
The Block Pool (block_pool.py) manages the physical memory allocation of KV cache pages on GPU memory.
Memory Organization
| Parameter | Description | Typical Value |
|---|---|---|
block_size | Number of tokens per cache block | 16 tokens |
num_blocks | Total number of available blocks | Dynamic based on GPU memory |
num_layers | Number of attention layers | Model-dependent (e.g., 32-80) |
num_kv_heads | Number of key/value attention heads | Model-dependent |
head_dim | Dimension of each attention head | 128 or 256 |
Block Pool Operations
graph LR
subgraph "Allocation States"
Free[Free Blocks]
Used[Used Blocks]
Partial[Partially Filled]
end
Allocate -->|Allocate Block| Free
Allocate -->|Release Block| Used
Used -->|Free| Free
Partial -->|Append Token| UsedThe block pool maintains three primary states:
- Free Blocks: Available for immediate allocation
- Used Blocks: Currently assigned to active sequences
- Partially Filled Blocks: Blocks with available slots for new tokens
Sources: vllm/v1/core/block_pool.py:1-100
PagedAttention Kernel Implementation
CUDA Kernel Architecture
Both PagedAttention V1 and V2 are implemented as optimized CUDA kernels that handle attention computation with page-table-based memory access.
#### V1 Kernel Flow
graph TD
A[Load Query Tokens] --> B[Get Block Table Entry]
B --> C[Load K/V from Pages]
C --> D[Compute Attention Scores]
D --> E[Write Output to GEMM]#### V2 Kernel Optimizations
PagedAttention V2 introduces several optimizations:
- Reduced Page Table Lookups: Consolidates multiple lookups into single operations
- Improved Memory Coalescing: Better memory access patterns for KV cache
- Warp-Level Primitives: Utilizes warp-level reductions for faster softmax computation
- Fused Operations: Combines multiple operations into single kernel launches
Sources: csrc/attention/paged_attention_v1.cu:1-200, csrc/attention/paged_attention_v2.cu:1-200
Kernel Parameters
| Parameter | Description | Range |
|---|---|---|
THREADS_PER_BLOCK | CUDA threads per block | 128-256 |
BLOCK_SIZE | Tokens per block | 16 |
CACHE_BLOCK_SIZE | Bytes per cache block | Dynamic |
num_heads | Total query heads | Model-dependent |
head_dim | Attention head dimension | 128/256 |
Page Table Management
Virtual-to-Physical Mapping
Each sequence maintains a page table that maps virtual token positions to physical cache blocks:
# Virtual page table structure
page_table = [
physical_block_id, # virtual position 0-15
physical_block_id, # virtual position 16-31
physical_block_id, # virtual position 32-47
...
]
Block Table Structure
graph TD
subgraph "Sequence 1"
S1_VP1[Virtual Page 0] --> S1_PP1[Physical Block 5]
S1_VP2[Virtual Page 1] --> S1_PP2[Physical Block 12]
S1_VP3[Virtual Page 2] --> S1_PP3[Physical Block 3]
end
subgraph "Sequence 2"
S2_VP1[Virtual Page 0] --> S2_PP1[Physical Block 5]
S2_VP2[Virtual Page 1] --> S2_PP2[Physical Block 7]
endThe page table allows:
- Non-contiguous storage: Physical blocks need not be contiguous
- Shared blocks: Multiple sequences can share the same physical block (for prefixes)
- Dynamic growth: New pages allocated as sequence extends
Sources: vllm/v1/core/kv_cache_manager.py:100-200, docs/design/paged_attention.md:50-150
Memory Management Strategies
Allocation Policy
vLLM uses a dynamic allocation policy that balances memory efficiency and allocation overhead:
| Strategy | Description | Trade-off |
|---|---|---|
| On-demand | Allocate blocks as tokens are generated | Lower memory waste, higher overhead |
| Pre-allocation | Reserve blocks when sequence starts | Lower overhead, potential waste |
| Hybrid | Pre-allocate prefix, on-demand for new tokens | Balanced approach |
Reference Counting
Reference counting prevents premature block deallocation:
- Initial allocation: Reference count = 1
- Fork/continue: Reference count incremented
- Completion: Reference count decremented
- Free block: When count reaches 0
Memory Reclamation
When GPU memory is exhausted:
graph TD
A[Memory Pressure Detected] --> B{Sequence Can Evict?}
B -->|Yes| C[Evict Cold Blocks]
B -->|No| D[Wait or Reject Request]
C --> E[Update Page Tables]
E --> F[Allocate New Blocks]Sources: vllm/v1/core/kv_cache_manager.py:200-300, vllm/v1/core/block_pool.py:100-200
Integration with Scheduler
Request Lifecycle
stateDiagram-v2
[*] --> Received: New Request
Received --> Scheduled: Add to Queue
Scheduled --> Allocating: Acquire Blocks
Allocating --> Running: Blocks Available
Running --> Running: Generate Token
Running --> Completed: EOS Token
Completed --> [*]: Free Blocks
Running --> Waiting: Block Unavailable
Waiting --> Running: Blocks AcquiredScheduling Integration Points
| Stage | KV Cache Interaction |
|---|---|
| Request Arrival | Pre-allocate blocks for known prefix length |
| Token Generation | Allocate new block when current fills up |
| Sequence Completion | Release all associated blocks |
| Prefix Caching | Share blocks with identical prefixes |
Configuration Options
Server Configuration
vllm serve <model> \
--block-size 16 \
--num-gpu-blocks-override 1000 \
--gpu-memory-utilization 0.9 \
--max-num-batched-tokens 8192
Key Parameters
| Parameter | Description | Default |
|---|---|---|
block_size | Number of tokens per KV cache block | 16 |
gpu_memory_utilization | Fraction of GPU memory for KV cache | 0.9 |
num_gpu_blocks_override | Override auto-computed block count | Auto |
max_num_batched_tokens | Maximum tokens in a single batch | Dynamic |
Memory Calculation
total_cache_memory = num_blocks × block_size × num_layers × 2 × num_kv_heads × head_dim × dtype_size
Where:
num_blocks= GPU memory × utilization / per_block_memory2accounts for both K and V cachesdtype_size= 2 bytes for FP16, 1 byte for INT8, etc.
Sources: docs/design/paged_attention.md:150-250
Performance Characteristics
Throughput Improvements
PagedAttention enables significant performance improvements compared to traditional contiguous allocation:
| Metric | Traditional | PagedAttention V1 | PagedAttention V2 |
|---|---|---|---|
| Memory Waste | 30-60% | <5% | <5% |
| Throughput | Baseline | ~2.1x | ~2.2x |
| BS=1 Latency | Baseline | ~1.9x | ~2.0x |
| BS=16+ Latency | Baseline | ~2.0x | ~2.2x |
Scalability
The system scales efficiently with:
- Longer sequences: O(seq_len) memory, no fragmentation
- More concurrent requests: Block sharing for shared prefixes
- Larger models: Better memory utilization per GPU
Sources: csrc/attention/paged_attention_v2.cu:200-300, docs/design/paged_attention.md:250-350
Implementation Details
Block Metadata Tracking
# Core block metadata structure
@dataclass
class Block:
block_id: int
device: Device
block_size: int
num_tokens: int = 0
# Physical location
physical_block_id: Optional[int] = None
# Content tracking
content_hash: Optional[int] = None
computed: bool = False
# Reference counting for shared blocks
ref_count: int = 0
Attention Computation Flow
graph TD
subgraph "Pre-computation"
Q[Query Tensors] --> QS[Query Split]
K[Key Tensors] --> KS[Key Split]
V[Value Tensors] --> VS[Value Split]
end
subgraph "Paged Attention"
QS --> AL[Attention Layer]
KS --> AL
VS --> AL
AL --> PT[Page Table Lookup]
end
subgraph "Output"
PT --> OUT[Attention Output]
OUT --> SOFTMAX[Softmax]
SOFTMAX --> GEMM[Final GEMM]
endBest Practices
Memory Configuration
- Set appropriate GPU memory utilization based on other memory needs (model weights, activations)
- Adjust block size for typical sequence lengths (larger blocks = less overhead for long sequences)
- Monitor fragmentation using vLLM metrics
Request Optimization
- Use consistent prefixes to enable block sharing across requests
- Batch similar requests to maximize cache hit rates
- Set appropriate max sequence length to avoid excessive block allocation
Debugging
# Enable verbose logging
export VLL_LOG_LEVEL=DEBUG
# Check memory status
curl http://localhost:8000/memory_stats
Related Components
| Component | File | Role |
|---|---|---|
| Attention Backend | vllm/attention/backends/ | Pluggable attention implementations |
| Scheduler | vllm/v1/core/sched.py | Coordinates cache allocation with scheduling |
| Model Runner | vllm/v1/core/model_runner.py | Executes attention with managed KV cache |
| Worker | vllm/v1/worker/worker.py | GPU-side cache management |
References
- vLLM Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention"
- Original Implementation: csrc/attention/paged_attention_v1.cu
- Optimized Implementation: csrc/attention/paged_attention_v2.cu
- Design Documentation: docs/design/paged_attention.md
- KV Cache Manager: vllm/v1/core/kv_cache_manager.py
- Block Pool: vllm/v1/core/block_pool.py
Sources: [csrc/attention/paged_attention_v1.cu:1-100](), [csrc/attention/paged_attention_v2.cu:1-100](), [docs/design/paged_attention.md:1-50]()
Attention Backends and Kernels
Related topics: PagedAttention and KV Cache Management, Quantization Support
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: PagedAttention and KV Cache Management, Quantization Support
Attention Backends and Kernels
Overview
The Attention Backends and Kernels system in vLLM provides a pluggable, hardware-accelerated implementation of attention mechanisms for Large Language Model (LLM) inference. This abstraction layer enables vLLM to leverage different optimized attention implementations (FlashAttention, FlashInfer, FlashMLA) depending on hardware capabilities and model requirements, while maintaining a unified interface for the rest of the engine.
Architecture
High-Level Design
vLLM implements a backend registry pattern that allows runtime selection of the optimal attention implementation. The attention system is designed to support both the legacy v0 engine and the optimized v1 engine architecture.
graph TD
A[Attention Interface] --> B[Attention Backends Registry]
B --> C[FlashAttention Backend]
B --> D[FlashInfer Backend]
B --> E[FlashMLA Backend]
C --> F[CUDA Kernels]
D --> G[Template-based Kernels]
E --> H[MLA-optimized Kernels]
I[Model Request] --> J[Scheduler]
J --> K[Attention Layer]
K --> ABackend Selection Flow
The registry-based architecture enables automatic backend selection based on hardware and configuration:
graph TD
A[Engine Initialization] --> B{Check user-specified backend}
B -->|Specified| C[Validate backend compatibility]
B -->|Auto| D{Detect GPU Architecture}
D -->|H100/H200| E[Select FlashAttention3]
D -->|A100/A10| F[Select FlashAttention2]
D -->|Other| G[Select FlashAttention2]
C -->|Valid| H[Initialize Backend]
C -->|Invalid| I[Raise Configuration Error]
E --> H
F --> H
G --> HAttention Backend Components
Backend Registry
The registry.py module provides the central factory for attention backend selection:
| Component | Responsibility |
|---|---|
AttentionBackend | Abstract base class defining the backend interface |
get_available_backends() | Returns list of compiled/available backends |
get_backend(name) | Retrieves a specific backend by name |
AttentionImpl | Concrete implementation wrapper per backend |
FlashAttention Backend
Located at vllm/v1/attention/backends/flash_attn.py, this backend provides:
- FlashAttention 2/3 optimized CUDA kernels
- Support for head dimensions 64, 80, 96, 128, 160, 192, 256
- Paged attention integration
- Cross-attention support for encoder-decoder models
Key Features:
| Feature | Description |
|---|---|
| Fused kernels | Combines attention computation steps |
| Dynamic persistent memory | Reuses KV cache blocks efficiently |
| Strided bias support | Enables sliding window attention |
FlashInfer Backend
Located at vllm/v1/attention/backends/flashinfer.py, this backend offers:
- Template-based kernel generation for flexibility
- Improved performance for specific head dimensions
- Better integration with vLLM's block manager
- Support for custom attention patterns
FlashMLA Backend
Located at vllm/v1/attention/backends/mla/flashmla.py, this backend is optimized for:
- Multi-head Latent Attention (MLA) architectures
- Low-rank KV cache compression
- Reduced memory bandwidth usage
- Optimized for DeepSeek-style models
Attention Interface
Core Abstraction
All attention backends implement a common interface defined by the Attention class:
class Attention:
def __init__(
self,
num_heads: int,
head_size: int,
scale: float,
num_kv_heads: int,
alibi_slopes: Optional[List[float]],
cache_config: Optional[CacheConfig],
block_size: int,
)
def forward(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
kv_cache: Optional[torch.Tensor],
attn_metadata: AttentionMetadata,
) -> torch.Tensor
Attention Metadata
The AttentionMetadata structure carries runtime information required for attention computation:
| Field | Type | Purpose |
|---|---|---|
seq_lens | List[int] | Sequence lengths for each request |
max_seq_len | int | Maximum sequence length in batch |
block_tables | torch.Tensor | Paged attention block mappings |
seq_start_idx | int | Starting index in output buffer |
Paged Attention Integration
vLLM's attention system integrates tightly with paged memory management:
graph LR
A[Logical KV Cache] --> B[Physical KV Blocks]
C[Query Tensors] --> D[Block Lookup]
D --> E[Attention Computation]
F[Block Table] --> D
B --> E
E --> G[Output Tensors]Block Table Structure
| Block Index | Physical Address | Valid Length |
|---|---|---|
| 0 | 15 | 16 |
| 1 | 23 | 16 |
| 2 | 7 | 16 |
| 3 | -1 (pending) | 0 |
Backend Configuration
Runtime Selection
Backends can be selected through multiple mechanisms:
| Method | Priority | Example |
|---|---|---|
| Environment variable | Lowest | VLLM_ATTENTION_BACKEND=FLASHINFER |
| Model config | Medium | "attention_backend": "flash_attn" |
| CLI argument | Highest | --attention-backend flashinfer |
Configuration Parameters
| Parameter | Backend | Description |
|---|---|---|
head_size | All | Dimension of each attention head |
num_kv_heads | All | Number of key/value heads (for GQA) |
scale | All | Attention scaling factor |
sliding_window | FlashAttn | Sliding window size |
page_size | FlashInfer | KV cache block size |
Performance Characteristics
Memory Efficiency
| Backend | KV Cache Format | Memory Overhead |
|---|---|---|
| FlashAttention | Standard | Baseline |
| FlashInfer | Blocked | Similar to FlashAttn |
| FlashMLA | Low-rank | 50-70% reduction |
Throughput Optimization
- Kernel Fusion: Combines multiple operations to reduce memory bandwidth
- Persistent RNN: Reuses computation across decode steps
- Async Execution: Overlaps attention with other operations
Implementation Details
FlashAttention Kernel Flow
sequenceDiagram
participant Q as Query
participant K as Key
participant V as Value
participant Kernel as FlashAttn Kernel
participant Cache as KV Cache
Q->>Kernel: Load Q tiles
K->>Kernel: Load K tiles
V->>Kernel: Load V tiles
Kernel->>Cache: Update (if write)
Kernel->>Kernel: Compute attention scores
Kernel->>Cache: Read (if read)
Kernel->>Kernel: softmax & scale
Kernel-->>Q: Output attentionBackend Initialization Sequence
graph TD
A[EngineArgs parse] --> B[Attention backend selection]
B --> C[Backend registry lookup]
C --> D[Load backend module]
D --> E[Initialize CUDA streams]
E --> F[Allocate persistent buffers]
F --> G[Register attention layers]
G --> H[Ready for inference]Extending Attention Backends
To implement a custom attention backend:
- Create backend class: Inherit from
AttentionBackend - Implement required methods:
get_name(),get_impl() - Register backend: Add to
ATTENTION_BACKENDSregistry - Implement kernels: Write CUDA/C++ kernels or wrap existing libraries
class CustomAttentionBackend(AttentionBackend):
@staticmethod
def get_name() -> str:
return "custom_attention"
@staticmethod
def get_impl() -> AttentionImpl:
return CustomAttentionImpl
Troubleshooting
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| CUDA out of memory | Large batch/sequence | Reduce gpu_memory_utilization |
| Incorrect outputs | Wrong backend selected | Verify with --attention-backend flag |
| Kernel launch failure | Unsupported head size | Use supported head dimensions |
| Slow inference | Suboptimal backend | Benchmark available backends |
Debugging Tips
- Set
VLLM_LOGGING_LEVEL=DEBUGfor attention kernel timing - Use
VLLM_ATTENTION_BACKEND=FLASH_ATTNto force specific backend - Check
nvidia-smifor kernel execution times
Summary
The Attention Backends and Kernels system provides vLLM's computational core for transformer attention. Through the registry pattern, it enables seamless switching between optimized implementations while maintaining a stable interface for the rest of the engine. The pluggable architecture supports both established backends (FlashAttention, FlashInfer) and specialized implementations (FlashMLA) optimized for specific model architectures.
Source: https://github.com/vllm-project/vllm / Human Manual
Quantization Support
Related topics: Model Architecture Support, Attention Backends and Kernels
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Model Architecture Support, Attention Backends and Kernels
Quantization Support
Overview
vLLM provides comprehensive quantization support to reduce model memory footprint and accelerate inference through various quantization schemes. Quantization compresses model weights from higher precision (typically FP16/BF16) to lower precision formats (such as INT8, INT4, or FP8), enabling larger models to run on limited GPU resources while maintaining acceptable accuracy.
The quantization system in vLLM is designed with a modular, extensible architecture that supports multiple quantization methods through a common abstraction layer. This allows users to easily switch between different quantization schemes or implement custom quantization strategies.
Sources: docs/features/quantization/README.md
Architecture
Quantization System Components
The quantization system consists of several interconnected components:
graph TD
A[Model Loading] --> B[QuantizationConfig]
B --> C[QuantizationMethod]
C --> D[Quantized Linear Layers]
C --> E[Quantized Embedding Layers]
D --> F[CUDA/ROCm Kernels]
E --> F
G[Weight Loading] --> H[Pre-quantized Weights]
G --> I[On-the-fly Quantization]Quantization Base Architecture
The base quantization architecture defines a common interface that all quantization methods must implement:
classDiagram
class QuantizationConfig {
+get_supported_methods() List[QuantizationMethods]
+get_override_quant_config() Optional[dict]
+get_quant_config() dict
+verify_quant_config() None
}
class QuantizationMethods {
<<enumeration>>
FP8
GGUF
AWQ
GPTQ
QUANTITY
}
class QuantizedLinear {
<<interface>>
+create_weights()
+apply_weights()
}
QuantizationConfig --> QuantizationMethods
QuantizationMethods --> QuantizedLinearSources: vllm/model_executor/layers/quantization/base_config.py
Layer Structure
Each quantization implementation defines its own quantized layer classes:
| Layer Type | Purpose | Quantization Scope |
|---|---|---|
QuantizedLinear | Matrix multiplication with quantized weights | Weight-only or Activation+Weight |
QuantizedEmbedding | Lookup table with quantized weights | Weight-only |
QuantizedMoE | Mixture-of-Experts with quantized components | Per-expert quantization |
Sources: vllm/model_executor/layers/quantization/fp8.py
Supported Quantization Methods
FP8 (8-bit Floating Point)
FP8 quantization uses 8-bit floating point representation with two formats:
| Format | Exponent Bits | Mantissa Bits | Use Case |
|---|---|---|---|
| E4M3 | 4 | 3 | Activations and weights |
| E5M2 | 5 | 2 | Gradients and optimizer states |
FP8 quantization is particularly well-supported in vLLM with optimized CUDA kernels for inference acceleration.
Sources: vllm/model_executor/layers/quantization/fp8.py
GGUF (GPT-Generated Unified Format)
GGUF is a quantized model format commonly used with llama.cpp. vLLM supports loading GGUF-quantized models directly:
# Load GGUF quantized model
llm = LLM(model="unsloth/Qwen3-0.6B-GGUF:Q4_K_M")
GGUF supports multiple quantization levels:
| Quantization Type | Bits per Parameter | Description |
|---|---|---|
| Q2_K | ~2.5 | 2-bit quantization with 4-bit key-values |
| Q3_K | ~3.5 | 3-bit quantization with 4-bit key-values |
| Q4_K | ~4.5 | 4-bit quantization with 8-bit key-values |
| Q5_K | ~5.5 | 5-bit quantization with 8-bit key-values |
| Q6_K | ~6.5 | 6-bit quantization |
| Q8_0 | ~8.0 | 8-bit quantization (baseline) |
Sources: vllm/model_executor/layers/quantization/gguf.py
AWQ (Activation-Aware Weight Quantization)
AWQ identifies weights with significant activation contributions and preserves them at higher precision while quantizing others aggressively.
GPTQ (Generative Pre-trained Transformer Quantization)
GPTQ performs post-training quantization with optional layer-wise precision adjustment and GPU optimization.
Quantization Configuration
Configuration Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
quantization | str | Quantization method name | None |
quantization_param_path | str | Path to quantization parameters file | None |
dtype | str | Model precision (if not pre-quantized) | auto |
kv_cache_dtype | str | KV cache quantization format | auto |
Enabling Quantization
Quantization is enabled through the --quantization CLI argument or quantization parameter in AsyncEngineArgs:
vllm serve model/path --quantization fp8
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
quantization="fp8",
gpu_memory_utilization=0.9
)
Mixed Quantization
Some layers may remain in higher precision when quantization cannot be applied uniformly:
| Layer Type | Quantization Behavior |
|---|---|
| Input/Output embeddings | Often kept in FP16/BF16 |
| Output projection | May use different precision |
| Attention softmax | Usually in FP32 for stability |
Implementation Details
Weight Loading Pipeline
graph LR
A[Model Checkpoint] --> B{Pre-quantized?}
B -->|Yes| C[Load Quantized Weights]
B -->|No| D[Dynamic Quantization]
C --> E[Apply Quantization Config]
D --> E
E --> F[Initialize Quantized Layer]
F --> G[Verify Weight Shape]
G --> H[CUDA Kernel Ready]C++ Backend (cuBLAS/cutlass)
High-performance quantization kernels are implemented in C++ and CUDA:
| Component | Location | Purpose |
|---|---|---|
| FP8 GEMM | csrc/quantization/fp8/ | FP8 matrix multiplication |
| W8A8 GEMM | csrc/quantization/fp8/ | INT8 weight, INT8 activation |
| W4A16 GEMM | csrc/quantization/ | INT4 weight, FP16 activation |
| Dequantization | csrc/quantization/ | Convert quantized to compute dtype |
Sources: csrc/quantization
Quantized Linear Layer Implementation
The QuantizedLinear layer handles the core computation:
class QuantizedLinear(QuantizedLayer):
"""Base class for quantized linear layers."""
def __init__(
self,
input_size: int,
output_size: int,
quantization_config: QuantizationConfig,
bias: bool = False,
):
self.input_size = input_size
self.output_size = output_size
self.quantization_config = quantization_config
def create_weights(self):
"""Initialize quantized weight tensors."""
raise NotImplementedError
def forward(self, input_):
"""Forward pass with quantized computation."""
raise NotImplementedError
Sources: vllm/model_executor/layers/quantization/base_config.py
Usage Examples
Loading Pre-quantized Models
from vllm import LLM
# FP8 quantized model
llm_fp8 = LLM(
model="meta-llama/Llama-2-70b-hf",
quantization="fp8",
tensor_parallel_size=4
)
# GGUF quantized model
llm_gguf = LLM(
model="TheBloke/Llama-2-70B-Chat-GGUF",
quantization="gguf",
tokenizer="meta-llama/Llama-2-70b-chat"
)
Quantization with Different Precisions
# Load with specific precision settings
llm = LLM(
model="Qwen/Qwen2.5-72B-Instruct",
quantization="fp8",
dtype="half", # Compute precision
kv_cache_dtype="fp8_e4m3" # KV cache precision
)
CLI Usage
# Serve with FP8 quantization
vllm serve meta-llama/Llama-2-70b-hf \
--quantization fp8 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95
# Serve GGUF model directly
vllm serve TheBloke/Llama-2-7B-Chat-GGUF:Q4_K_M
Performance Considerations
Memory Reduction
| Quantization | Memory Reduction | Quality Impact |
|---|---|---|
| FP8 (E4M3) | ~50% | Minimal |
| INT8 | ~50% | Low |
| INT4 | ~75% | Moderate |
| INT2 | ~87.5% | High |
Throughput Impact
Quantization improves throughput through:
- Increased batch size: More sequences fit in GPU memory
- Higher memory bandwidth utilization: Smaller weights load faster
- Accelerated compute: INT8/FP8 operations on tensor cores
Accuracy Considerations
- Post-training quantization (PTQ): May introduce accuracy degradation
- Activation-aware methods (AWQ): Better preserves model capabilities
- Calibration: Some methods require calibration data for optimal accuracy
Extension Points
Custom Quantization Methods
To implement a custom quantization method, extend the base classes:
from vllm.model_executor.layers.quantization import QuantizationConfig
class CustomQuantizationConfig(QuantizationConfig):
"""Custom quantization configuration."""
@staticmethod
def get_name() -> str:
return "custom_quant"
@staticmethod
def get_supported_methods(cls) -> list[str]:
return ["weight_only"]
def get_quant_config(self) -> dict:
return {"method": self.quant_method}
Registering Custom Quantization
Custom quantizations must be registered in the quantization registry to be discoverable at runtime.
Summary
vLLM's quantization support provides a flexible, extensible system for serving large language models with reduced memory footprint. The architecture separates concerns through well-defined interfaces, allowing seamless integration of new quantization methods while maintaining high performance through optimized CUDA kernels.
Key takeaways:
- Multiple quantization formats supported (FP8, GGUF, AWQ, GPTQ)
- Modular architecture enables easy extension
- Optimized C++/CUDA kernels for inference acceleration
- Simple API through CLI and Python interface
- Memory reduction up to 75% with INT4 quantization
Sources: [docs/features/quantization/README.md]()
Distributed Inference and Parallelism
Related topics: Scheduling and Request Processing, PagedAttention and KV Cache Management
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Scheduling and Request Processing, PagedAttention and KV Cache Management
Distributed Inference and Parallelism
vLLM provides comprehensive support for distributed inference and parallelism, enabling efficient serving of large language models across multiple GPUs and nodes. This document covers the architecture, configuration, and implementation details of vLLM's distributed computing capabilities.
Overview
vLLM's distributed inference system enables horizontal scaling of LLM workloads by distributing model computation across multiple devices. The system supports multiple parallelism strategies, including tensor parallelism, pipeline parallelism, and data parallelism, along with specialized features like disaggregated prefill/decode and KV cache transfer.
The core components of distributed inference in vLLM include:
| Component | Purpose |
|---|---|
| Parallel State Manager | Coordinates process groups for distributed communication |
| Device Communicators | Handle low-level tensor communication (NCCL, CUDA) |
| KV Transfer System | Enables disaggregated prefill/decode architectures |
| Configuration System | Manages parallelism parameters and device placement |
Parallelism Strategies
vLLM supports three primary parallelism strategies, each addressing different aspects of distributed computation.
Tensor Parallelism (TP)
Tensor parallelism splits individual weight matrices across multiple GPUs, allowing computation of large matrices that would not fit in a single device's memory. This is particularly effective for dense layers like attention and feed-forward networks.
Tensor parallelism requires:
- NVIDIA GPUs with NCCL support
- High-bandwidth interconnects (NVLink preferred)
- Each tensor-parallel rank requires full model weights in terms of optimizer states
Pipeline Parallelism (PP)
Pipeline parallelism distributes layers (stages) of the model across different GPUs or nodes. This approach reduces memory requirements per device while maintaining high GPU utilization through micro-batch pipelining.
Data Parallelism (DP)
Data parallelism replicates the entire model across multiple GPUs, with each replica processing different batches of requests. This is the simplest form of parallelism and scales throughput linearly with the number of replicas.
Parallel State Management
The ParallelState class in vllm/distributed/parallel_state.py is the central coordinator for distributed execution.
graph TD
A[ParallelState] --> B[Tensor Parallel Group]
A --> C[Pipeline Parallel Group]
A --> D[Data Parallel Group]
A --> E[World Communicator]
B --> F[Rank 0, 1, 2, 3]
C --> G[Stage 0, Stage 1]
D --> H[Replica 1, Replica 2]Process Group Initialization
Parallel state is initialized through init_distributed_environment(), which creates the necessary process groups for communication.
# From vllm/distributed/parallel_state.py:45-78
def init_distributed_environment(
rank: int,
world_size: int,
local_rank: int,
init_method: str = "env://",
backend: str = "nccl"
):
# Initialize distributed context
torch.distributed.init_process_group(
backend=backend,
init_method=init_method,
rank=rank,
world_size=world_size
)
Rank and World Size Management
| Parameter | Description |
|---|---|
rank | Unique identifier for each process in the distributed group |
world_size | Total number of processes in the distributed group |
local_rank | Rank of the process within its local node |
Sources: vllm/distributed/parallel_state.py:45-78
Device Communication
CUDA Communicator
The CUDACommunicator class in vllm/distributed/device_communicators/cuda_communicator.py provides NCCL-based communication primitives optimized for CUDA tensors.
graph LR
A[Tensor] -->|all_reduce| B[Aggregated Tensor]
A -->|broadcast| C[Same Tensor on All Ranks]
A -->|reduce_scatter| D[Partitioned Results]
E[Partial Tensors] -->|all_gather| F[Complete Tensor]Supported Communication Primitives
| Primitive | Function | Use Case |
|---|---|---|
all_reduce | Reduce tensors across all ranks | Gradient synchronization in TP |
broadcast | Send tensor from one rank to all | Weight updates |
all_gather | Collect tensors from all ranks | Output aggregation |
reduce_scatter | Reduce and partition across ranks | Gradient partitioning |
# From vllm/distributed/device_communicators/cuda_communicator.py:23-65
class CUDACommunicator:
def __init__(self, group: ProcessGroup):
self.group = group
self.world_size = group.size()
self.rank = group.rank()
def all_reduce(self, tensor: torch.Tensor) -> torch.Tensor:
# NCCL all-reduce implementation
torch.distributed.all_reduce(tensor, group=self.group)
return tensor
Sources: vllm/distributed/device_communicators/cuda_communicator.py:23-65
Communication Patterns in Distributed Inference
graph TD
subgraph "Tensor Parallel Region"
A[Attention AllReduce] --> B[FFN AllReduce]
B --> C[AllReduce Output]
end
subgraph "Pipeline Parallel Region"
D[Send Hidden States] --> E[Receive Hidden States]
E --> F[Backward Pass]
end
subgraph "Data Parallel Region"
G[Synchronize KV Cache] --> H[Load Balance Requests]
endDisaggregated Prefill and Decode
vLLM supports disaggregated prefill/decode architectures where prefill (initial prompt processing) and decode (token generation) stages run on separate GPU clusters. This enables independent scaling of prefill and decode resources.
KV Transfer Architecture
The KV transfer system enables sharing of KV cache between prefill and decode instances.
graph LR
A[Prefill Instance] -->|KV Transfer| B[Shared Storage]
B -->|KV Load| C[Decode Instance]
subgraph "Prefill Process"
A1[Tokenize Prompts] --> A2[Process Prefill]
A2 --> A3[Save KV Cache]
end
subgraph "Decode Process"
C1[Load KV Cache] --> C2[Generate Tokens]
C2 --> C3[Streaming Output]
endSources: examples/disaggregated/example_connector/README.md
KV Connector Base
The KVConnectorBase class defines the interface for KV cache transfer implementations.
# From vllm/distributed/kv_transfer/kv_connector/base.py:15-85
class KVConnectorBase(ABC):
def __init__(self, kv_transfer_config: KVTransferParams):
self.config = kv_transfer_config
@abstractmethod
def load_kv_cache(
self,
requests: List[TransferJob],
scheduler: Any
) -> None:
"""Load KV cache from external source during prefill"""
pass
@abstractmethod
def save_kv_cache(
self,
blocks: List[PhysicalTokenBlock],
kv_pair: KVCache,
callback: Callable
) -> TransferJob:
"""Save KV cache to external storage during decode"""
pass
| Method | Purpose |
|---|---|
load_kv_cache | Load KV cache during prefill phase |
save_kv_cache | Persist KV cache during decode phase |
profile_num_available_blocks | Determine available transfer capacity |
Sources: vllm/distributed/kv_transfer/kv_connector/base.py:15-85
Example Connector Implementation
The ExampleConnector provides a reference implementation for KV transfer:
# From examples/disaggregated/example_connector/
class ExampleConnector(KVConnectorBase):
def __init__(self, kv_transfer_config: KVTransferParams):
super().__init__(kv_transfer_config)
self.local_storage = "./local_storage"
The connector workflow:
- Prefill Phase: Process prompts and save KV state to
local_storagedirectory - Decode Phase: Load KV state from storage and continue generation
Sources: examples/disaggregated/example_connector/README.md
Configuration
Parallel Configuration Parameters
The ParallelConfig class in vllm/config/parallel.py manages all parallelism settings.
| Parameter | Type | Default | Description |
|---|---|---|---|
tensor_parallel_size | int | 1 | Number of GPUs for tensor parallelism |
pipeline_parallel_size | int | 1 | Number of pipeline stages |
data_parallel_size | int | 1 | Number of data parallel replicas |
data_parallel_size_per_region | int | - | DP size for hybrid parallelism |
data_parallel_master_port | int | 29500 | Port for DP master communication |
data_parallel_master_addr | str | - | Address for DP master |
numa_aware | bool | False | Enable NUMA-aware GPU placement |
# From vllm/config/parallel.py:10-45
@dataclass
class ParallelConfig:
tensor_parallel_size: int = 1
pipeline_parallel_size: int = 1
data_parallel_size: int = 1
data_parallel_size_per_region: Optional[int] = None
data_parallel_master_port: int = 29500
data_parallel_master_addr: Optional[str] = None
data_parallel_standalone: bool = False
numa_aware: bool = False
Sources: vllm/config/parallel.py:10-45
Environment Variables
| Variable | Description |
|---|---|
CUDA_VISIBLE_DEVICES | Comma-separated list of GPU IDs to use |
VLLM_HOST_IP | Host IP for distributed communication |
VLLM_PORT | Port for worker communication |
Launching Distributed Inference
#### Multi-GPU Launch with torchrun
torchrun --nproc_per_node=4 \
--nnodes=1 \
vllm/entrypoints/llm.py \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4
#### Multi-Node Launch
torchrun --nproc_per_node=8 \
--nnodes=2 \
--node_rank=0 \
--master_addr=10.0.0.1 \
--master_port=29500 \
vllm/entrypoints/llm.py \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--pipeline-parallel-size 4
Scaling Strategies
Scaling Guidelines
| Model Size | GPU Memory | Recommended Configuration |
|---|---|---|
| 7B | 24GB | TP=1, DP=single node |
| 13B | 48GB | TP=2, DP=single node |
| 70B | 320GB+ | TP=4 or 8, PP=2+ |
| 405B | 800GB+ | TP=8, PP=8, multi-node |
Sources: docs/serving/parallelism_scaling.md
Disaggregated Prefill Scaling
For disaggregated prefill/decode, consider:
| Workload Pattern | Prefill Resources | Decode Resources |
|---|---|---|
| Short prompts, many requests | Scale prefill | Scale decode |
| Long prompts, few requests | Scale prefill with TP | Scale decode |
| Mixed workload | Balance both | Balance both |
Scaling Best Practices
- Start with tensor parallelism for intra-node scaling
- Add pipeline parallelism for multi-node deployments
- Use data parallelism to increase throughput on same-stage workloads
- Enable disaggregation when prefill and decode have different resource needs
Sources: docs/serving/parallelism_scaling.md
Architecture Diagram
graph TB
subgraph "vLLM Distributed Architecture"
subgraph "Process 0"
P0_M[Model Shard 0]
P0_S[Scheduler]
P0_C[Cache Engine]
end
subgraph "Process 1"
P1_M[Model Shard 1]
P1_S[Scheduler]
P1_C[Cache Engine]
end
subgraph "Process N"
PN_M[Model Shard N]
PN_S[Scheduler]
PN_C[Cache Engine]
end
NCCL[NCCL AllReduce]
P0_S <-->|NCCL| P1_S
P1_S <-->|NCCL| PN_S
P0_S <-->|NCCL| PN_S
P0_M <-->|Forward Pass| P0_S
P1_M <-->|Forward Pass| P1_S
PN_M <-->|Forward Pass| PN_S
end
subgraph "External Services"
KV[KV Transfer Service]
Redis[(Redis/Storage)]
end
P0_C <-->|Save KV| Redis
PN_C <-->|Load KV| Redis
Redis <--> KVSee Also
Sources: [vllm/distributed/parallel_state.py:45-78](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py)
Model Architecture Support
Related topics: Model Executor and Worker Architecture, Quantization Support
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Model Executor and Worker Architecture, Quantization Support
Model Architecture Support
Overview
vLLM provides comprehensive support for various LLM architectures, enabling users to serve, fine-tune, and run inference with a wide range of transformer-based models. The system is designed to be architecture-agnostic while providing optimized implementations for popular model families.
Core Model Loading Architecture
Python API (Offline Inference)
The primary Python interface for running offline inference is the LLM class, which handles model loading and inference without requiring a separate inference server.
# Basic usage example
from vllm import LLM
llm = LLM(model="Qwen/Qwen3-0.6B")
output = llm.generate("Hello, world!")
Sources: examples/basic/offline_inference/README.md
CLI-based Model Serving
The vLLM CLI provides a serve subcommand for HTTP-based model serving:
vllm serve Qwen/Qwen3-0.6B
If no model is specified, the CLI defaults to Qwen/Qwen3-0.6B.
Sources: vllm/entrypoints/cli/serve.py
Supported Model Categories
Language Models
vLLM supports a broad range of causal language models including:
- Decoder-only transformers: Standard autoregressive models
- Mixture of Experts (MoE): Sparse architectures like Mixtral
- Multimodal models: Models that process multiple input types
Quantization Support
vLLM supports quantized models through GGUF format, enabling deployment of compressed models with reduced memory footprint.
Example loading a quantized model directly from HuggingFace:
--model unsloth/Qwen3-0.6B-GGUF:Q4_K_M --tokenizer Qwen/Qwen3-0.6B
Sources: examples/basic/offline_inference/README.md
Model Configuration
Generation Configuration
The --generation-config argument specifies where the generation config is loaded from:
| Value | Source | Description |
|---|---|---|
auto | Model path | Loads from model's configuration directory |
<folder_path> | Local folder | Loads from specified directory |
| Not provided | vLLM defaults | Uses built-in default parameters |
If max_new_tokens is specified in generation config, it sets a server-wide limit on output tokens for all requests.
Sources: examples/basic/offline_inference/README.md
Engine Arguments
Model configuration is controlled through AsyncEngineArgs, which processes CLI arguments and creates the model configuration:
engine_args = AsyncEngineArgs.from_cli_args(args)
model_config = engine_args.create_model_config()
Sources: vllm/entrypoints/cli/launch.py
Structured Outputs
vLLM supports structured output generation for models that support it, including reasoning models:
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--reasoning-parser deepseek_r1
This enables compliance with output format constraints defined by the model.
Sources: examples/features/structured_outputs/README.md
CPU Offload Support
For models that exceed available GPU memory, vLLM provides CPU offload capabilities:
--cpu-offload-gb 10
This creates virtual GPU memory by offloading portions of the model to CPU RAM. For example, with a 24GB GPU and 10GB offload, you can effectively load a 13B model requiring ~26GB.
Note: This requires fast CPU-GPU interconnect for acceptable performance.
Sources: examples/basic/offline_inference/README.md
Model Registry Architecture
The vLLM CLI uses a modular command structure where model support is registered through subcommand modules:
for cmd_module in CMD_MODULES:
new_cmds = cmd_module.cmd_init()
for cmd in new_cmds:
cmd.subparser_init(subparsers).set_defaults(dispatch_function=cmd.cmd)
Each registered command includes validation logic to ensure model configurations are valid before execution.
Sources: vllm/entrypoints/cli/main.py
Serving Modes
Standard API Server
The default serving mode starts an HTTP server with OpenAI-compatible endpoints:
vllm serve <model_name>
Headless Mode
For distributed deployments, headless mode skips API server initialization:
if args.headless:
if args.api_server_count is not None and args.api_server_count > 0:
raise ValueError(
f"--api-server-count={args.api_server_count} cannot be "
"used with --headless (no API servers are started in "
"headless mode)."
)
args.api_server_count = 0
Sources: vllm/entrypoints/cli/serve.py
gRPC Server Mode
For high-performance scenarios, vLLM supports gRPC-based serving:
if getattr(args, "grpc", False):
from vllm.entrypoints.grpc_server import serve_grpc
uvloop.run(serve_grpc(args))
Sources: vllm/entrypoints/cli/serve.py
Data Parallel Modes
vLLM supports distributed model serving through multiple load balancing strategies:
| Mode | Flag | Description |
|---|---|---|
| External LB | --data-parallel-external-lb or --data-parallel-rank | External load balancer manages request distribution |
| Hybrid LB | --data-parallel-hybrid-lb or --data-parallel-start-rank | Hybrid approach with internal and external coordination |
The system auto-detects load balancing mode to set appropriate default values for api_server_count.
Sources: vllm/entrypoints/cli/serve.py
Installation and Quickstart
For users getting started with model architecture support:
- Install vLLM following the installation guide
- Review the quickstart documentation
- Check the list of supported models
Sources: README.md
Architecture Flow Diagram
graph TD
A[User Request] --> B{CLI or Python API?}
B -->|CLI| C[vllm serve command]
B -->|Python| D[LLM class instantiation]
C --> E[CLISubcommand processing]
D --> F[AsyncEngineArgs configuration]
E --> G[Model Registry Lookup]
F --> H[Model Config Creation]
G --> I[Load Model Architecture]
H --> I
I --> J{Quantization?}
J -->|GGUF| K[Load Quantized Weights]
J -->|BF16/FP8| L[Load Standard Weights]
K --> M[PagedAttention Engine]
L --> M
M --> N[Inference Execution]Key Implementation Files
| Component | File Path | Purpose |
|---|---|---|
| CLI Entry | vllm/entrypoints/cli/main.py | Main CLI dispatcher and argument parsing |
| Serve Command | vllm/entrypoints/cli/serve.py | HTTP server startup and configuration |
| Launch Layer | vllm/entrypoints/cli/launch.py | FastAPI-based serving layer |
| Offline Inference | examples/basic/offline_inference/ | Python API usage examples |
| Structured Outputs | examples/features/structured_outputs/ | Advanced output formatting |
Sources: [examples/basic/offline_inference/README.md](https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/README.md)
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
First-time setup may fail or require extra isolation and rollback planning.
First-time setup may fail or require extra isolation and rollback planning.
First-time setup may fail or require extra isolation and rollback planning.
First-time setup may fail or require extra isolation and rollback planning.
Doramagic Pitfall Log
Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.
1. Installation risk: [Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling
- Severity: medium
- Finding: Installation risk is backed by a source signal: [Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/42182
2. Installation risk: [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0
- Severity: medium
- Finding: Installation risk is backed by a source signal: [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/40896
3. Installation risk: [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_l…
- Severity: medium
- Finding: Installation risk is backed by a source signal: [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_l…. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/42207
4. Installation risk: v0.18.1
- Severity: medium
- Finding: Installation risk is backed by a source signal: v0.18.1. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/releases/tag/v0.18.1
5. Capability assumption: [Feature]: Qwen3.5-Moe LoRA Support (experts)
- Severity: medium
- Finding: Capability assumption is backed by a source signal: [Feature]: Qwen3.5-Moe LoRA Support (experts). Treat it as a review item until the current version is checked.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/40005
6. Capability assumption: README/documentation is current enough for a first validation pass.
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: capability.assumptions | github_repo:599547518 | https://github.com/vllm-project/vllm | README/documentation is current enough for a first validation pass.
7. Project risk: v0.20.2
- Severity: medium
- Finding: Project risk is backed by a source signal: v0.20.2. Treat it as a review item until the current version is checked.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/releases/tag/v0.20.2
8. Maintenance risk: Maintainer activity is unknown
- Severity: medium
- Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: evidence.maintainer_signals | github_repo:599547518 | https://github.com/vllm-project/vllm | last_activity_observed missing
9. Security or permission risk: no_demo
- Severity: medium
- Finding: no_demo
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: downstream_validation.risk_items | github_repo:599547518 | https://github.com/vllm-project/vllm | no_demo; severity=medium
10. Security or permission risk: No sandbox install has been executed yet; downstream must verify before user use.
- Severity: medium
- Finding: No sandbox install has been executed yet; downstream must verify before user use.
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: risks.safety_notes | github_repo:599547518 | https://github.com/vllm-project/vllm | No sandbox install has been executed yet; downstream must verify before user use.
11. Security or permission risk: no_demo
- Severity: medium
- Finding: no_demo
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: risks.scoring_risks | github_repo:599547518 | https://github.com/vllm-project/vllm | no_demo; severity=medium
12. Security or permission risk: [Bug]: ngram speculative decoding changes greedy output on Qwen3-0.6B / A100
- Severity: medium
- Finding: Security or permission risk is backed by a source signal: [Bug]: ngram speculative decoding changes greedy output on Qwen3-0.6B / A100. Treat it as a review item until the current version is checked.
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/41758
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using vllm with real data or production workflows.
- [[Bug]: vLLM v1 with prefix caching: first request differs from subsequen](https://github.com/vllm-project/vllm/issues/40896) - github / github_issue
- [[AMD][CI Failure][Tracker] Static dashboard tracker for current CI failu](https://github.com/vllm-project/vllm/issues/40554) - github / github_issue
- [[Usage]: How to proactively clear CPU-resident memory left behind by unl](https://github.com/vllm-project/vllm/issues/42207) - github / github_issue
- [[Feature]: Qwen3.5-Moe LoRA Support (experts)](https://github.com/vllm-project/vllm/issues/40005) - github / github_issue
- [[Bug]: ngram speculative decoding changes greedy output on Qwen3-0.6B /](https://github.com/vllm-project/vllm/issues/41758) - github / github_issue
- [[Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async sch](https://github.com/vllm-project/vllm/issues/42182) - github / github_issue
- v0.20.2 - github / github_release
- v0.20.1 - github / github_release
- v0.20.0 - github / github_release
- v0.19.1 - github / github_release
- v0.19.0 - github / github_release
- v0.18.1 - github / github_release
Source: Project Pack community evidence and pitfall evidence