vllm Manual Preview - Doramagic.ai

Doramagic Project Pack · Human Manual

vllm

Related topics: Getting Started, Core Engine Architecture

vLLM Overview

Related topics: Getting Started, Core Engine Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Offline Inference

Continue reading this section for the full explanation and source context.

Section OpenAI-Compatible API Server

Continue reading this section for the full explanation and source context.

Section Structured Outputs

Continue reading this section for the full explanation and source context.

Related topics: Getting Started, Core Engine Architecture

vLLM Overview

What is vLLM?

vLLM is a fast and easy-to-use library for LLM (Large Language Model) inference and serving. It provides high-throughput, memory-efficient inference with an OpenAI-compatible API server, making it suitable for both research and production environments.

Sources: README.md

Key Features

Offline Inference

The LLM class provides the primary Python interface for offline inference—interacting with a model without using a separate model inference server. This enables direct model interaction for batch processing, experimentation, and development workflows.

Sources: examples/basic/offline_inference/README.md

OpenAI-Compatible API Server

vLLM serves LLM completions via HTTP through an OpenAI-compatible API. The server can be started with a simple command:

vllm serve Qwen/Qwen2.5-3B-Instruct

Sources: vllm/entrypoints/cli/serve.py

Structured Outputs

vLLM supports constrained decoding for structured outputs including JSON schema, regex patterns, and structural tags. This is essential for building reliable applications that require predictable output formats.

Sources: examples/features/structured_outputs/README.md

Disaggregated Prefill and Decode

vLLM supports disaggregated prefill architecture where prefill (token generation) and decode (token consumption) stages can run on separate instances. This enables:

Independent scaling of prefill and decode workloads
Improved resource utilization
Better support for multi-turn conversations

Sources: examples/disaggregated/example_connector/README.md

KV Cache Transfer

For disaggregated serving, vLLM supports KV cache transfer between prefill and decode workers. The architecture includes:

Component	Role	Description
`ECExampleConnector`	Cache Storage	Stores encoder cache on local disk
EC Producer	Precompute	Pre-computes encoder cache
EC Consumer	Retrieve	Retrieves cached KV data

Sources: examples/disaggregated/disaggregated_encoder/README.md

KV Load Failure Recovery

vLLM implements robust recovery mechanisms for KV load failures in both synchronous and asynchronous loading modes. The system:

Identifies invalid KV blocks
Reschedules affected requests
Ensures consistent output through recovery logic

Sources: examples/disaggregated/kv_load_failure_recovery_offline/README.md

Long Text Embedding with Chunked Processing

vLLM supports embedding models with chunked processing for texts exceeding the model's maximum context length:

{
  "pooling_type": "auto",
  "use_activation": true,
  "enable_chunked_processing": true,
  "max_embed_len": 3072000
}

This enables processing of extremely long documents (up to 3M+ tokens) including academic papers, legal documents, and code repositories.

Sources: examples/pooling/embed/openai_embedding_long_text/README.md

Architecture Overview

CLI Entry Point

The vLLM CLI provides a flexible command-line interface built with FlexibleArgumentParser:

graph TD
    A[vllm CLI] --> B[main.py Entry Point]
    B --> C[Parse Arguments]
    C --> D[Load CMD_MODULES]
    D --> E[Create Subparsers]
    E --> F[Execute Subcommand]
    
    F --> G[serve]
    F --> H[launch]
    F --> I[other modules]

The CLI supports:

-v, --version flag for version information
Subcommand system with plugin architecture
Command validation before execution

Sources: vllm/entrypoints/cli/main.py

Serve Subcommand

The serve subcommand handles online inference server startup:

graph TD
    A[serve command] --> B{Model specified?}
    B -->|Yes| C[Use CLI model]
    B -->|No| D[Default: Qwen/Qwen3-0.6B]
    
    C --> E{GRPC enabled?}
    D --> E
    
    E -->|Yes| F[Start gRPC Server]
    E -->|No| G{Headless mode?}
    
    G -->|Yes| H[Set api_server_count=0]
    G -->|No| I[Check LB mode]
    
    I --> J[Start FastAPI Server]
    H --> J

The serve command supports multiple deployment modes:

Standard mode: Full API server with GPU inference
Headless mode: No API servers, only engine processing
gRPC mode: Alternative RPC interface
Load-balanced mode: Data-parallel external/hybrid load balancing

Sources: vllm/entrypoints/cli/serve.py

Launch Subcommand

The launch subcommand provides a modular component launch system:

graph LR
    A[launch command] --> B[LaunchSubcommand]
    B --> C[launch_component subparser]
    C --> D[LaunchSubcommandBase subclasses]
    
    D --> E[run_launch_fastapi]
    D --> F[other components]
    
    E --> G[Socket binding]
    E --> H[Build API Server]
    E --> I[EngineArgs configuration]

The launch system renders servers with preprocessing only—no inference or quantized kernels, and never allocates KV cache.

Sources: vllm/entrypoints/cli/launch.py

Observability

OpenTelemetry Integration

vLLM includes built-in OpenTelemetry support for distributed tracing:

opentelemetry-instrument vllm serve facebook/opt-125m

Core packages are bundled with vLLM:

opentelemetry-sdk
opentelemetry-api
opentelemetry-exporter-otlp
opentelemetry-semantic-conventions-ai

Sources: examples/observability/opentelemetry/README.md

Prometheus Metrics and Dashboards

vLLM exports Prometheus-compatible metrics and supports integration with:

Platform	Dashboard Format	Import Method
Grafana	JSON	UI or API
Perses	YAML	CLI

Sources: examples/observability/dashboards/README.md

Performance Profiling

The nsys_profile_tools enable GPU kernel-level profiling:

nsys profile -t cuda -o run1 -f true --trace-fork-before-exec=true \
    --cuda-graph-trace=node --delay <DELAY> --duration <DURATION> \
    vllm serve openai/gpt-oss-120b ...

The gputrc2graph.py script generates kernel-level summaries and visualizations from .nsys-rep files.

Sources: tools/profiler/nsys_profile_tools/README.md

Supported Features Summary

Feature	Description	Configuration
Offline Inference	Batch processing without server	`LLM` class
OpenAI API	HTTP API compatibility	`vllm serve`
Structured Outputs	JSON/regex/structural constraints	`--reasoning-parser`
Disaggregated Serving	Split prefill/decode	`--ec-transfer-config`
KV Recovery	Failure resilience	Custom connectors
Long Text Embedding	Chunked processing	`--pooler-config`
Observability	Tracing and metrics	OpenTelemetry
Quantization	GGUF support	`repo_id:quant_type`

Usage Modes

Offline Inference Mode

from vllm import LLM

llm = LLM("Qwen/Qwen2.5-3B-Instruct")
output = llm.generate("Hello, world!")

API Server Mode

vllm serve Qwen/Qwen2.5-3B-Instruct --tensor-parallel-size 2

Disaggregated Mode

# Prefill instance
vllm serve --prefill-only --ec-transfer-config ' {...} '

# Decode instance
vllm serve --decode-only --ec-transfer-config ' {...} '

Quick Reference

Command	Purpose
`vllm serve <model>`	Start API server
`vllm launch <component>`	Launch specific component
`opentelemetry-instrument vllm serve`	Enable tracing

Getting Started

vLLM is a fast and easy-to-use library for Large Language Model (LLM) inference and serving. It provides both an offline inference interface via the LLM class and an online serving layer with an OpenAI-compatible API server. Sources: README.md:1-30

This guide covers the essential steps to get started with vLLM, from installation through basic inference and serving.

Installation

vLLM can be installed via pip or built from source. For detailed installation instructions, refer to the official documentation.

Quick Installation

pip install vllm

Building from Source

For custom builds or development:

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

GPU Requirements

vLLM requires CUDA-compatible GPUs. The library supports various CUDA versions. Verify your environment has the necessary GPU drivers and CUDA toolkit installed. Sources: README.md:1-15

Core Concepts

Before diving into usage, understand these fundamental concepts:

Concept	Description
LLM Class	Primary Python interface for offline inference
Engine Args	Configuration parameters for the inference engine
Sampling Params	Controls generation behavior (temperature, max_tokens, etc.)
OpenAI API Server	HTTP server providing OpenAI-compatible REST endpoints

Architecture Overview

graph TD
    A[User Code] --> B[LLM Class / API Server]
    B --> C[AsyncLLMEngine]
    C --> D[Worker Pool]
    D --> E[GPU Devices]
    E --> F[PagedAttention KV Cache]
    
    G[HTTP Clients] --> H[OpenAI API Server]
    H --> B

Offline Inference

Offline inference involves running model inference directly in Python without a separate server. This is ideal for batch processing, testing, or embedding vLLM into applications. Sources: examples/basic/offline_inference/README.md:1-25

Basic Usage

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM("Qwen/Qwen2.5-3B-Instruct")

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=256
)

# Run inference
outputs = llm.generate(["Hello, how are you?", "What is vLLM?"], sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Supported Models

vLLM supports a wide range of models including autoregressive transformers, mixture-of-experts models, and quantized models. For the complete list, see the supported models documentation. Sources: README.md:20-25

Serving with the CLI

vLLM provides a command-line interface for serving models via an OpenAI-compatible API. Sources: vllm/entrypoints/cli/serve.py:1-30

Starting the Server

vllm serve Qwen/Qwen3-0.6B

Command-Line Options

The serve command supports extensive configuration through CLI arguments:

vllm serve <model> [options]

Use --help=all to show all available flags, or --help=<ConfigGroup> to explore options by section (e.g., --help=ModelConfig, --help=Frontend). Sources: vllm/entrypoints/cli/serve.py:5-20

Key Server Options

Option	Description	Default
`--model`	Model name or path	Required
`--gpu-memory-utilization`	Fraction of GPU memory to use	0.9
`--max-model-len`	Maximum sequence length	Model default
`--tensor-parallel-size`	Number of GPUs for parallelism	1
`--port`	Server port	8000

Headless Mode

For distributed setups where API servers are managed externally:

vllm serve <model> --headless

In headless mode, no API servers are started, and --api-server-count cannot be used. Sources: vllm/entrypoints/cli/serve.py:30-45

API Usage

Once the server is running, you can interact with it using the OpenAI-compatible API.

Completions API

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-3B-Instruct",
    "prompt": "The capital of France is",
    "max_tokens": 50
  }'

Chat API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-3B-Instruct",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

Offline Inference Examples

The repository includes practical examples demonstrating various vLLM capabilities. Sources: examples/basic/offline_inference/README.md:25-60

Running Examples

# Basic example
python examples/basic/offline_inference/basic.py

# Chat example with sampling parameters
python examples/basic/offline_inference/chat.py --max_tokens 100 --temperature 0.8

# Generate example
python examples/basic/offline_inference/generate.py --generation-config auto

Generation Config

The --generation-config argument specifies where the generation config loads from:

'auto' - Load from model path
<folder_path> - Load from specified directory
Not provided - Use vLLM defaults

python examples/basic/offline_inference/generate.py --generation-config auto

Note: If max_new_tokens is specified in generation config, it sets a server-wide limit on output tokens for all requests. Sources: examples/basic/offline_inference/README.md:55-70

Advanced Features

Structured Outputs

vLLM supports constrained decoding for structured outputs including JSON schemas, regex patterns, and grammar-based constraints. Sources: examples/features/structured_outputs/README.md:1-40

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --reasoning-parser deepseek_r1

# Run structured outputs example
uv run structured_outputs_offline.py --constraint json_mode regex

Long Text Embedding

For embedding models, vLLM supports chunked processing to handle texts exceeding the model's maximum context length:

MODEL_NAME="jinaai/jina-embeddings-v3" \
MAX_EMBED_LEN=1048576 \
./service.sh

Configuration example with chunked processing:

{
  "pooling_type": "auto",
  "use_activation": true,
  "enable_chunked_processing": true,
  "max_embed_len": 3072000
}

GGUF Quantized Models

vLLM supports GGUF-quantized models loaded directly from HuggingFace:

--model unsloth/Qwen3-0.6B-GGUF:Q4_K_M --tokenizer Qwen/Qwen3-0.6B

CPU Offload

For systems with limited GPU memory, CPU offload allows loading larger models:

--cpu-offload-gb 10

This creates a virtual 34GB GPU when you have a 24GB GPU, enabling 13B model loading with BF16 weights. Sources: examples/basic/offline_inference/README.md:75-85

Configuration Workflow

graph LR
    A[Define Engine Args] --> B[Create Model Config]
    B --> C[Initialize Engine]
    C --> D[Process Requests]
    D --> E[Return Outputs]
    
    F[CLI Arguments] --> A
    G[Python API] --> A

Programmatic Configuration

from vllm import LLM, EngineArgs

engine_args = EngineArgs(
    model="Qwen/Qwen2.5-7B-Instruct",
    gpu_memory_utilization=0.85,
    tensor_parallel_size=2,
    max_model_len=4096
)

llm = LLM(**engine_args)

Next Steps

Resource	Description
Documentation	Comprehensive guides and API reference
Supported Models	Complete list of supported architectures
Examples	Usage examples for various features
Paper	Technical details behind vLLM's design

Troubleshooting

Common Issues

CUDA Out of Memory: Reduce gpu_memory_utilization or use smaller batch sizes
Model Not Found: Ensure HuggingFace credentials are configured for gated models
Import Errors: Verify all dependencies are installed with pip install vllm

Getting Help

Technical Questions: Use GitHub Issues
Community Discussion: vLLM Forum
Development Coordination: Developer Slack

Sources: README.md:60-75

Sources: [README.md:60-75]()

Core Engine Architecture

Related topics: vLLM Overview, Model Executor and Worker Architecture, Scheduling and Request Processing

Section Related Pages

Continue reading this section for the full explanation and source context.

Section V0 Engine (Legacy)

Continue reading this section for the full explanation and source context.

Section V1 Engine (Current)

Continue reading this section for the full explanation and source context.

Section CLI Entry Point

Continue reading this section for the full explanation and source context.

Core Engine Architecture

Overview

The vLLM Core Engine Architecture is the central orchestration layer responsible for managing LLM inference workflows, request scheduling, and model execution. vLLM supports two engine versions: the legacy V0 engine and the current V1 engine (introduced in v0.6.0), both designed to provide high-throughput LLM serving through efficient request batching and GPU memory management.

The engine architecture serves as the foundation for both offline inference via the LLM class and online serving via the OpenAI-compatible API server. Sources: vllm/entrypoints/llm.py:1-50

Architecture Components

V0 Engine (Legacy)

The V0 engine is the original implementation found in vllm/engine/. It consists of:

Component	File	Purpose
`LLMEngine`	`vllm/engine/llm_engine.py`	Synchronous inference engine with blocking operations
`AsyncLLMEngine`	`vllm/engine/async_llm_engine.py`	Async wrapper enabling concurrent request handling

The V0 engine uses an event-loop-based async architecture where AsyncLLMEngine wraps LLMEngine to provide non-blocking request processing.

V1 Engine (Current)

The V1 engine (vllm/v1/engine/) is the current production-ready implementation featuring a modular design:

Component	File	Purpose
`Core`	`vllm/v1/engine/core.py`	Low-level engine core managing model execution
`AsyncLLM`	`vllm/v1/engine/async_llm.py`	Main async interface for inference
`LLMEngine`	`vllm/v1/engine/llm_engine.py`	High-level engine orchestrator

The V1 engine architecture separates concerns into distinct layers: AsyncLLM provides the public async interface, LLMEngine handles request orchestration, and Core manages low-level GPU operations.

Entry Points

vLLM provides multiple entry points for interacting with the engine:

graph TD
    A[vllm serve] --> B[ServeSubcommand]
    A --> C[LaunchSubcommand]
    B --> D[serve_grpc / API Server]
    C --> E[run_launch_fastapi]
    F[vllm run-batch] --> G[BatchRunner]
    H[LLM Class] --> I[AsyncLLM Engine]

CLI Entry Point

The CLI entry point in vllm/entrypoints/cli/main.py provides command-line access to vLLM functionality:

# Simplified CLI structure
parser = FlexibleArgumentParser(description="vLLM CLI")
subparsers = parser.add_subparsers(required=False, dest="subparser")
cmds = {}
for cmd_module in CMD_MODULES:
    new_cmds = cmd_module.cmd_init()
    for cmd in new_cmds:
        cmd.subparser_init(subparsers).set_defaults(dispatch_function=cmd.cmd)

Sources: vllm/entrypoints/cli/main.py:1-40

Serve Subcommand

The serve subcommand initializes the HTTP API server or gRPC service:

class ServeSubcommand(CLISubcommand):
    name = "serve"
    
    @staticmethod
    def cmd(args: argparse.Namespace) -> None:
        if hasattr(args, "model_tag") and args.model_tag is not None:
            args.model = args.model_tag

Sources: vllm/entrypoints/cli/serve.py:1-30

Launch Subcommand

The launch subcommand provides component-level launching capabilities:

def cmd_init() -> list[CLISubcommand]:
    return [LaunchSubcommand()]

async def run_launch_fastapi(args: argparse.Namespace) -> None:
    listen_address, sock = setup_server(args)
    engine_args = AsyncEngineArgs.from_cli_args(args)

Sources: vllm/entrypoints/cli/launch.py:1-60

Python API Entry Point

The LLM class in vllm/entrypoints/llm.py provides the primary Python interface for offline inference:

class LLM:
    """
    An LLM for offline inference.
    """
    
    def __init__(self, model: str, ...):
        ...

Sources: vllm/entrypoints/llm.py:1-100

Engine Initialization Flow

The following diagram illustrates the initialization flow from CLI to engine:

sequenceDiagram
    participant CLI as vllm serve
    participant Parser as ArgumentParser
    participant EngineArgs as AsyncEngineArgs
    participant Engine as AsyncLLM / Core
    participant Model as ModelConfig
    
    CLI->>Parser: Parse CLI arguments
    Parser->>EngineArgs: from_cli_args()
    EngineArgs->>Model: create_model_config()
    Model-->>EngineArgs: Config validated
    EngineArgs->>Engine: Initialize engine
    Engine->>Engine: Load model weights

Core Engine Components

AsyncLLM (V1)

The AsyncLLM class is the primary async interface for V1 engine:

class AsyncLLM:
    """
    Async implementation of LLM engine.
    """
    
    async def add_request(self, request_id: str, prompt: str, ...):
        ...
    
    async def step(self) -> List[RequestOutput]:
        ...

Sources: vllm/v1/engine/async_llm.py:1-50

Core (V1)

The Core class manages low-level model execution:

class Core:
    """
    Core engine for V1.
    """
    
    def __init__(self, engine_config: VllmConfig, ...):
        ...
    
    def get_config(self) -> VllmConfig:
        ...

Sources: vllm/v1/engine/core.py:1-80

LLMEngine (V1)

The V1 LLMEngine orchestrates request processing:

class LLMEngine:
    """
    V1 LLM Engine implementation.
    """
    
    def __init__(self, vllm_config: VllmConfig, ...):
        ...

Sources: vllm/v1/engine/llm_engine.py:1-50

Configuration System

AsyncEngineArgs

Configuration flows from CLI/API to engine via AsyncEngineArgs:

Parameter	Type	Description
`model`	`str`	Model name or path
`tensor_parallel_size`	`int`	Number of GPUs for tensor parallelism
`gpu_memory_utilization`	`float`	Fraction of GPU memory to use
`max_model_len`	`Optional[int]`	Maximum sequence length
`dtype`	`str`	Model data type (float16, bfloat16, etc.)
`quantization`	`Optional[str]`	Quantization method (awq, gptq, etc.)

The engine validates configuration through create_model_config():

def create_model_config(self) -> ModelConfig:
    """Create model config from engine args."""
    return ModelConfig(...)

Sources: vllm/entrypoints/cli/launch.py:40-50

Headless Mode

The V1 engine supports headless operation where no API servers are started:

if args.headless:
    if args.api_server_count is not None and args.api_server_count > 0:
        raise ValueError(
            f"--api-server-count={args.api_server_count} cannot be "
            "used with --headless (no API servers are started in "
            "headless mode)."
        )
    args.api_server_count = 0

Sources: vllm/entrypoints/cli/serve.py:25-35

gRPC Support

vLLM supports gRPC for disaggregated prefill scenarios:

if getattr(args, "grpc", False):
    from vllm.entrypoints.grpc_server import serve_grpc
    uvloop.run(serve_grpc(args))
    return

Sources: vllm/entrypoints/cli/serve.py:18-22

Request Processing Pipeline

graph LR
    A[Request] --> B[Parser]
    B --> C[AsyncLLM.add_request]
    C --> D[Scheduler]
    D --> E[Model Executor]
    E --> F[Output]
    D --> G[KV Cache]

Data Parallel Modes

The engine supports multiple data parallel configurations:

Mode	Flag	Description
External LB	`--data-parallel-external-lb`	External load balancer coordinates workers
Hybrid LB	`--data-parallel-hybrid-lb`	Hybrid approach with custom rank assignment
Rank	`--data-parallel-rank`	Specify worker rank for distributed setup

# Detection logic
is_external_lb = getattr(args, "data_parallel_external_lb", False)
is_hybrid_lb = getattr(args, "data_parallel_hybrid_lb", False)

Sources: vllm/entrypoints/cli/serve.py:35-40

Model Config Validation

The engine performs validation during model config creation:

model_config = engine_args.create_model_config()

# Clear quantization for render servers (preprocessing only)
if render_mode:
    model_config.quantization = None

Sources: vllm/entrypoints/cli/launch.py:45-55

Key Design Patterns

Async/Await Architecture

The V1 engine uses native async/await for concurrent request handling:

async def step(self) -> List[RequestOutput]:
    """Execute one iteration of the engine."""
    ...

Command Pattern

CLI commands follow the Command pattern with CLISubcommand base class:

class ServeSubcommand(CLISubcommand):
    name = "serve"
    
    @staticmethod
    def cmd(args: argparse.Namespace) -> None:
        ...

Subparser Registration

Commands register themselves via cmd_init() factory functions:

def cmd_init() -> list[CLISubcommand]:
    return [LaunchSubcommand(), ServeSubcommand()]

Summary

The vLLM Core Engine Architecture provides a flexible, multi-layered design supporting both V0 (legacy) and V1 (current) engine implementations. Key characteristics include:

Dual Engine Support: V0 for backward compatibility, V1 for production workloads
Multiple Entry Points: CLI, Python API, HTTP server, gRPC
Async-First Design: Native async/await for concurrent request processing
Modular Components: Clear separation between CLI, engine core, and model execution
Flexible Configuration: Comprehensive argument system via AsyncEngineArgs
Data Parallel Support: Multiple modes for distributed serving scenarios

The architecture prioritizes performance through efficient GPU memory management, request batching, and pipelined execution while maintaining a clean, extensible design.

Sources: [vllm/entrypoints/cli/main.py:1-40]()

Model Executor and Worker Architecture

Related topics: Core Engine Architecture, Scheduling and Request Processing, Model Architecture Support

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Purpose and Scope

Continue reading this section for the full explanation and source context.

Section Model Loading

Continue reading this section for the full explanation and source context.

Section Model Registry

Continue reading this section for the full explanation and source context.

Model Executor and Worker Architecture

Overview

The vLLM Model Executor and Worker Architecture forms the core execution layer responsible for model loading, inference, and batch-level processing. This architecture separates concerns between high-level request orchestration and low-level GPU-based model execution, enabling efficient parallel processing of LLM inference requests.

The architecture consists of two primary components:

Component	Responsibility
Model Executor	Manages model loading, weight initialization, and model-specific execution logic
Worker	Handles GPU-side computation, memory management, and kernel execution

Architecture Diagram

graph TD
    A[AsyncEngineArgs] --> B[Model Executor]
    B --> C[Model Loader]
    C --> D[Model Weights]
    B --> E[GPU Worker]
    E --> F[GPU Model Runner]
    F --> G[CUDA Kernels]
    F --> H[Attention Layers]
    F --> I[MLP Layers]
    
    J[Request Batch] --> E
    E --> K[Generated Tokens]
    
    style B fill:#e1f5fe
    style E fill:#fff3e0
    style F fill:#f3e5f5

Model Executor Layer

Purpose and Scope

The Model Executor layer handles all aspects of model lifecycle management, including initialization, weight loading, and providing the interface for model execution. This layer operates independently of the worker layer, allowing for flexibility in model configuration.

Model Loading

The model loading subsystem supports multiple backends and quantization schemes:

class ModelLoader:
    def load_model(self, model_config, parallel_config, device_config):
        # Load model weights based on configuration
        pass

Supported Loading Modes:

Mode	Description
Auto	Automatically detect optimal loading strategy
Naive	Standard PyTorch loading
Sharded	Load model shards across multiple devices
Quantized	Load quantized weights (AWQ, GPTQ, GGUF)

Sources: vllm/model_executor/model_loader/__init__.py

Model Registry

vLLM maintains a registry of supported model architectures:

# Model architecture registration
@register_model("LlamaForCausalLM")
class LlamaForCausalLM(nn.Module):
    ...

The registry maps model names to their corresponding model classes, enabling automatic model instantiation based on HuggingFace model architecture detection.

Sources: vllm/model_executor/models/__init__.py

Worker Architecture

Base Worker

The worker base class defines the interface for all worker implementations:

class WorkerBase:
    def __init__(self, vllm_config):
        self.vllm_config = vllm_config
        self.model = None
        self.device = None
    
    def execute_model(self, batch):
        raise NotImplementedError

Sources: vllm/v1/worker/worker_base.py

GPU Worker

The GPU Worker is the primary worker implementation for GPU-based inference:

graph LR
    A[Input Batch] --> B[Model Input Preparation]
    B --> C[Forward Pass]
    C --> D[Output Extraction]
    D --> E[Token Generation]

Key Responsibilities:

Initialize CUDA context and memory pools
Prepare model inputs with proper padding and masking
Execute forward passes on GPU
Manage KV cache memory allocation

Sources: vllm/v1/worker/gpu_worker.py

GPU Model Runner

The GPU Model Runner handles the low-level model execution details:

class GPUModelRunner:
    def __init__(self, config):
        self.kv_cache = None
        self.attn_metadata = None
        self.block_manager = None
    
    def prepare_inputs(self, batch):
        # Prepare input tensors with proper device placement
        pass
    
    def execute_model(self, input_tokens, positions):
        # Execute model forward pass
        pass

Core Components:

Component	Function
`kv_cache`	Stores key-value tensors for attention
`attn_metadata`	Manages attention metadata for paged attention
`block_manager`	Handles physical memory block allocation

Sources: vllm/v1/worker/gpu_model_runner.py

Execution Flow

sequenceDiagram
    participant API as API Server
    participant Executor as Model Executor
    participant Worker as GPU Worker
    participant Runner as GPU Model Runner
    participant Kernels as CUDA Kernels
    
    API->>Executor: Initialize model
    Executor->>Worker: Load weights
    Worker->>Runner: Initialize GPU state
    Runner->>Kernels: Allocate memory
    
    API->>Executor: Execute batch
    Executor->>Worker: Forward request
    Worker->>Runner: Prepare inputs
    Runner->>Kernels: Compute attention
    Runner->>Kernels: Compute mlp
    Kernels-->>Runner: Output logits
    Runner-->>Worker: Return results
    Worker-->>Executor: Output tokens

Memory Management

KV Cache Architecture

vLLM uses a block-based KV cache management system:

Logical Blocks: Abstract representation of KV cache entries
Physical Blocks: Actual GPU memory allocations
Block Mapping: Links logical blocks to physical locations

class BlockManager:
    def allocate(self, num_blocks):
        # Allocate physical blocks
        pass
    
    def get_physical_block(self, logical_id):
        # Get physical location for logical block
        pass

Memory Allocation Strategy

Strategy	Use Case
Dynamic	Default, allocates on demand
Static	Pre-allocates at initialization
Hybrid	Mix of static and dynamic

Configuration Parameters

The Model Executor and Worker architecture is configured through AsyncEngineArgs:

Parameter	Description	Default
`model`	Model name or path	Required
`tensor_parallel_size`	Number of GPUs for tensor parallelism	1
`pipeline_parallel_size`	Number of pipeline stages	1
`gpu_memory_utilization`	Fraction of GPU memory for KV cache	0.9
`max_model_len`	Maximum sequence length	Auto
`block_size`	KV cache block size	16

Summary

The Model Executor and Worker Architecture in vLLM provides a modular, extensible system for LLM inference:

Separation of Concerns: Clear boundaries between model loading and execution
GPU Optimization: Efficient CUDA kernel integration and memory management
Flexible Configuration: Support for various parallelism and quantization strategies
Extensibility: Plugin-based model registration system

This architecture enables vLLM to achieve high throughput through batch processing while maintaining low latency through careful memory management and kernel optimization.

Sources: [vllm/model_executor/model_loader/__init__.py]()

Scheduling and Request Processing

Related topics: Core Engine Architecture, Model Executor and Worker Architecture, Distributed Inference and Parallelism

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Component Hierarchy

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Request Data Structure

Continue reading this section for the full explanation and source context.

Scheduling and Request Processing

Overview

The Scheduling and Request Processing system is a core component of vLLM's v1 engine architecture. It manages the lifecycle of inference requests from arrival to completion, coordinating resource allocation, batch scheduling, and execution ordering to maximize GPU utilization while maintaining quality of service guarantees.

In vLLM, the scheduler operates as an asynchronous event-driven system that continuously evaluates pending requests, determines optimal batching strategies, and dispatches work to the underlying execution engine. This design enables vLLM to handle high-throughput serving workloads with efficient memory management through mechanisms like paged attention and dynamic batch composition.

The scheduler works in conjunction with the request queue, which serves as the primary buffer for incoming inference requests. It makes real-time decisions about which requests to include in the next execution batch based on available GPU memory, request priorities, and fairness constraints.

Architecture Overview

Component Hierarchy

The scheduling system consists of several interconnected components that work together to manage request processing:

graph TD
    A[API Layer] --> B[Request Queue]
    B --> C[Async Scheduler]
    C --> D[Scheduler]
    D --> E[Execution Engine]
    E --> F[GPU]
    
    G[Config] --> C
    G --> D

Core Components

Component	File	Purpose
`Scheduler`	`vllm/v1/core/sched/scheduler.py`	Core scheduling logic and batch selection
`AsyncScheduler`	`vllm/v1/core/sched/async_scheduler.py`	Async wrapper for scheduler operations
`RequestQueue`	`vllm/v1/core/sched/request_queue.py`	Request buffering and ordering
`Request`	`vllm/v1/request.py`	Request data model and state
`SchedulerConfig`	`vllm/config/scheduler.py`	Scheduler configuration parameters

Request Model

Request Data Structure

The Request class represents an individual inference request with all associated metadata and state. Each request maintains its own context including tokenized inputs, sampling parameters, and execution state.

The request model tracks the following key attributes:

Attribute	Type	Description
`request_id`	`str`	Unique identifier for the request
`prompt`	`str`	Original input text or tokens
`prompt_token_ids`	`List[int]`	Tokenized prompt
`sampling_params`	`SamplingParams`	Sampling configuration
`arrival_time`	`float`	Request arrival timestamp
`state`	`RequestState`	Current execution state

Request States

Requests transition through a defined state machine during their lifecycle:

stateDiagram-v2
    [*] --> WAITING: Request Arrival
    WAITING --> SCHEDULED: Scheduler Selection
    SCHEDULED --> RUNNING: Dispatch to GPU
    RUNNING --> WAITING: Preemption/Continuation
    RUNNING --> FINISHED: Completion
    RUNNING --> WAITING: KV Cache Reuse
    WAITING --> CANCELLED: Client Cancellation
    SCHEDULED --> CANCELLED: Client Cancellation

WAITING: Request is queued and awaiting scheduling
SCHEDULED: Request has been selected for the next batch
RUNNING: Request is actively being processed on GPU
FINISHED: Request has completed successfully
CANCELLED: Request was cancelled before completion

Sources: vllm/v1/request.py

Request Queue

The RequestQueue serves as the primary buffering mechanism for incoming requests. It provides thread-safe operations for enqueueing, dequeuing, and managing request priorities.

Queue Operations

Operation	Description
`enqueue()`	Add new request to queue
`dequeue()`	Remove and return next request
`peek()`	View next request without removal
`cancel()`	Remove cancelled request
`requeue()`	Return request to queue (preemption)

Priority Handling

The request queue supports priority-based ordering where higher priority requests can bypass lower priority ones. Priority is determined by a combination of factors:

Explicit priority values in SamplingParams
Arrival time (older requests may get priority for fairness)
Request type (prefill vs decode operations)

Sources: vllm/v1/core/sched/request_queue.py

Scheduler

Core Scheduling Logic

The scheduler is responsible for selecting which requests to include in the next execution batch. It operates on a continuous scheduling loop that evaluates the current state of all pending requests and available resources.

#### Scheduling Criteria

The scheduler makes decisions based on multiple factors:

Memory Availability: Sufficient GPU memory must be available for the request's KV cache
Batch Size Limits: Maximum batch size constraints
Prefill-Decompose Decisions: Whether to split prefill into smaller chunks
Priority Weighting: Relative importance of pending requests
Latency Targets: QoS requirements for specific request categories

#### Scheduling Loop

graph LR
    A[Evaluate Pending Requests] --> B{Sufficient Resources?}
    B -->|Yes| C[Select Request]
    C --> D[Add to Batch]
    D --> E{Batch Full?}
    E -->|No| A
    E -->|Yes| F[Dispatch Batch]
    B -->|No| G[Wait / Preempt]
    F --> H[Update State]
    H --> A

Async Scheduler Interface

The AsyncScheduler provides an asynchronous interface to the core scheduler, enabling non-blocking scheduling operations. This is essential for maintaining high throughput in production serving scenarios where the scheduler must coexist with network I/O and other async operations.

Key async operations include:

schedule_async(): Async wrapper for schedule iteration
add_request(): Async request enqueueing
abort_request(): Async request cancellation

Sources: vllm/v1/core/sched/async_scheduler.py

Scheduler Configuration

The SchedulerConfig class defines all configurable parameters for the scheduler behavior. These settings control batching strategies, memory management, and QoS characteristics.

Configuration Parameters

Parameter	Type	Default	Description
`max_num_seqs`	`int`	`256`	Maximum sequences per iteration
`max_num_batched_tokens`	`int`	`8192`	Max tokens per batch
`max_model_len`	`int`	`8192`	Maximum model context length
`enable_chunked_prefill`	`bool`	`True`	Enable prefill chunking
`num_prefill_groups`	`int`	`1`	Number of prefill groups
`async_scheduling`	`bool`	`True`	Enable async scheduling

Memory Management Settings

Parameter	Description
`gpu_memory_utilization`	Fraction of GPU memory for KV cache (0.0-1.0)
`num_causal_layers`	Number of causal attention layers
`head_dim`	Dimension of attention heads

Sources: vllm/config/scheduler.py

Batching Strategies

Continuous Batching

vLLM employs continuous batching (also known as iteration-level scheduling) to maximize GPU utilization. Unlike static batching, where batch composition is fixed at the start, continuous batching allows requests to enter and exit the batch at each iteration.

#### Advantages

Higher Throughput: GPU is never idle waiting for batch to complete
Lower Latency: Short requests don't wait for long ones
Better Memory Utilization: Dynamic allocation based on actual needs

Prefill Batching

Prefill operations (processing new tokens) and decode operations (generating new tokens) have different characteristics. The scheduler can:

Combined Batching: Mix prefill and decode in same batch
Separate Batching: Process prefill and decode in distinct batches
Chunked Prefill: Split large prefill requests into smaller chunks

Chunked Prefill

When enable_chunked_prefill is enabled, large prefill requests are split into smaller chunks to:

Reduce memory pressure
Allow shorter requests to be scheduled faster
Improve fairness between requests of different lengths

Sources: vllm/v1/core/sched/scheduler.py

Request Processing Flow

Full Request Lifecycle

sequenceDiagram
    participant Client
    participant API
    participant Queue
    participant Scheduler
    participant Engine
    participant GPU
    
    Client->>API: Submit Request
    API->>API: Validate & Tokenize
    API->>Queue: Enqueue Request
    Scheduler->>Queue: Dequeue Requests
    Scheduler->>Scheduler: Evaluate Batching
    Scheduler->>Engine: Dispatch Batch
    Engine->>GPU: Execute Forward Pass
    GPU-->>Engine: Output Tensors
    Engine-->>Scheduler: Update State
    Scheduler->>Queue: Requeue if Needed
    Scheduler->>API: Stream Tokens
    API-->>Client: Response Stream

Scheduling Iteration

Each scheduling iteration follows these steps:

Request Evaluation: Scan all pending requests for eligibility
Resource Calculation: Determine available GPU memory
Batch Composition: Select requests based on scheduling policy
Chunk Assignment: Divide prefill requests if needed
Batch Dispatch: Send batch to execution engine
State Update: Update request states and metrics

Configuration Example

from vllm.config import SchedulerConfig

config = SchedulerConfig(
    max_num_seqs=256,
    max_num_batched_tokens=8192,
    max_model_len=32768,
    enable_chunked_prefill=True,
    gpu_memory_utilization=0.9,
)

Performance Considerations

Memory Management

The scheduler must balance memory allocation between:

KV Cache: Storing attention key-value pairs
Model Weights: The LLM parameters (typically pre-loaded)
Activation Memory: Temporary tensors during computation

Latency vs Throughput

The scheduling configuration affects the latency-throughput tradeoff:

Setting	Effect
Smaller `max_num_seqs`	Lower latency, lower throughput
Larger `max_num_batched_tokens`	Higher throughput, variable latency
`enable_chunked_prefill=True`	Better latency fairness

Preemption

When GPU memory is insufficient for incoming requests, the scheduler may preempt existing requests to free memory. Preempted requests are returned to the queue and rescheduled later.

Integration with Execution Engine

The scheduler interfaces with the execution engine through a well-defined API:

schedule(): Main entry point for scheduling decisions
add_request(): Register new inference request
abort_request(): Cancel running request
update_from_output(): Process execution results

The execution engine receives scheduled batches and returns completed or paused requests, allowing the scheduler to update its internal state and make subsequent scheduling decisions.

For a complete understanding of vLLM's request processing, also refer to:

Engine: Coordinates between scheduler and model execution
Worker: Executes model operations on GPU
Cache: KV cache management and allocation
采样参数: Sampling configuration affecting scheduling

Summary

The Scheduling and Request Processing system is fundamental to vLLM's ability to serve large language models efficiently. By employing continuous batching, intelligent memory management, and flexible scheduling policies, vLLM achieves high throughput while maintaining low latency for diverse workloads. The modular design with separate scheduler, queue, and configuration components allows for fine-tuned control over serving behavior.

Sources: [vllm/v1/request.py]()

PagedAttention and KV Cache Management

Related topics: Scheduling and Request Processing, Distributed Inference and Parallelism, Attention Backends and Kernels

Section Related Pages

Continue reading this section for the full explanation and source context.

Section High-Level System Design

Continue reading this section for the full explanation and source context.

Section PagedAttention Version Comparison

Continue reading this section for the full explanation and source context.

Section Responsibilities

Continue reading this section for the full explanation and source context.

PagedAttention and KV Cache Management

Overview

PagedAttention is a novel attention mechanism that enables efficient virtual memory-based management of the Key-Value (KV) cache in large language model (LLM) inference. Inspired by the memory management technique in operating systems called paging, PagedAttention divides the KV cache into fixed-size "pages" that can be flexibly allocated and managed, eliminating the need for contiguous memory allocation.

The KV cache stores the key and value tensors from attention computation for each token position. During autoregressive decoding, this cache grows dynamically as new tokens are generated. Traditional LLM serving systems allocate contiguous memory blocks for the KV cache, leading to significant memory waste due to internal and external fragmentation when handling variable-length sequences and multi-user workloads.

vLLM's PagedAttention implementation provides:

Memory efficiency: Eliminates fragmentation by using non-contiguous page-based allocation
Flexible batching: Supports arbitrary sequence lengths and concurrent requests
Dynamic memory management: Allocates cache pages on-demand during generation
GPU memory optimization: Maximizes GPU memory utilization for higher throughput

Architecture

High-Level System Design

graph TD
    subgraph "Application Layer"
        Req[Inference Request]
        Sched[Scheduler]
    end
    
    subgraph "Memory Management Layer"
        KVM[KV Cache Manager]
        BP[Block Pool]
    end
    
    subgraph "GPU Memory Layer"
        GPU[GPU Memory]
        Pages[KV Cache Pages]
    end
    
    Req --> Sched
    Sched --> KVM
    KVM --> BP
    BP --> GPU
    Pages -.>|Physical Memory| GPU

PagedAttention Version Comparison

vLLM implements two versions of PagedAttention with different performance characteristics:

Aspect	PagedAttention V1	PagedAttention V2
Kernel Type	Fused attention kernels	Optimized fused kernels
Memory Access	Standard page table lookup	Enhanced page table optimization
Performance	Baseline optimized	~2.2x speedup over V1
Use Case	General purpose	Production workloads
Implementation	`paged_attention_v1.cu`	`paged_attention_v2.cu`

Sources: csrc/attention/paged_attention_v1.cu:1-100, csrc/attention/paged_attention_v2.cu:1-100, docs/design/paged_attention.md:1-50

KV Cache Manager

The KV Cache Manager (kv_cache_manager.py) is the core component responsible for tracking and managing the allocation of KV cache pages across all in-flight sequences.

Responsibilities

Responsibility	Description
Block Allocation	Allocates and deallocates cache blocks as sequences grow or complete
Reference Counting	Tracks how many sequences reference each physical block
Page Table Management	Maintains virtual-to-physical page mappings per sequence
Cache Eviction	Handles cache eviction when memory pressure occurs

Key Data Structures

# Simplified representation of block metadata
class Block:
    block_id: int          # Physical block identifier
    page_indices: List[int]  # Virtual page indices mapping
    ref_count: int         # Number of sequences referencing this block
    is_computed: bool      # Whether this block has computed KV cache

Block Allocation Workflow

sequenceDiagram
    participant Scheduler
    participant KVM as KV Cache Manager
    participant BP as Block Pool
    participant GPU as GPU Memory
    
    Scheduler->>KVM: allocate_blocks(sequence_id, num_tokens)
    KVM->>BP: reserve_blocks(count)
    BP->>GPU: allocate_contiguous_pages
    GPU-->>BP: block_handles
    BP-->>KVM: allocated_blocks
    KVM-->>Scheduler: block_table

Sources: vllm/v1/core/kv_cache_manager.py:1-150, vllm/v1/core/block_pool.py:1-100

Block Pool

The Block Pool (block_pool.py) manages the physical memory allocation of KV cache pages on GPU memory.

Memory Organization

Parameter	Description	Typical Value
`block_size`	Number of tokens per cache block	16 tokens
`num_blocks`	Total number of available blocks	Dynamic based on GPU memory
`num_layers`	Number of attention layers	Model-dependent (e.g., 32-80)
`num_kv_heads`	Number of key/value attention heads	Model-dependent
`head_dim`	Dimension of each attention head	128 or 256

Block Pool Operations

graph LR
    subgraph "Allocation States"
        Free[Free Blocks]
        Used[Used Blocks]
        Partial[Partially Filled]
    end
    
    Allocate -->|Allocate Block| Free
    Allocate -->|Release Block| Used
    Used -->|Free| Free
    Partial -->|Append Token| Used

The block pool maintains three primary states:

Free Blocks: Available for immediate allocation
Used Blocks: Currently assigned to active sequences
Partially Filled Blocks: Blocks with available slots for new tokens

Sources: vllm/v1/core/block_pool.py:1-100

PagedAttention Kernel Implementation

CUDA Kernel Architecture

Both PagedAttention V1 and V2 are implemented as optimized CUDA kernels that handle attention computation with page-table-based memory access.

#### V1 Kernel Flow

graph TD
    A[Load Query Tokens] --> B[Get Block Table Entry]
    B --> C[Load K/V from Pages]
    C --> D[Compute Attention Scores]
    D --> E[Write Output to GEMM]

#### V2 Kernel Optimizations

PagedAttention V2 introduces several optimizations:

Reduced Page Table Lookups: Consolidates multiple lookups into single operations
Improved Memory Coalescing: Better memory access patterns for KV cache
Warp-Level Primitives: Utilizes warp-level reductions for faster softmax computation
Fused Operations: Combines multiple operations into single kernel launches

Sources: csrc/attention/paged_attention_v1.cu:1-200, csrc/attention/paged_attention_v2.cu:1-200

Kernel Parameters

Parameter	Description	Range
`THREADS_PER_BLOCK`	CUDA threads per block	128-256
`BLOCK_SIZE`	Tokens per block	16
`CACHE_BLOCK_SIZE`	Bytes per cache block	Dynamic
`num_heads`	Total query heads	Model-dependent
`head_dim`	Attention head dimension	128/256

Page Table Management

Virtual-to-Physical Mapping

Each sequence maintains a page table that maps virtual token positions to physical cache blocks:

# Virtual page table structure
page_table = [
    physical_block_id,  # virtual position 0-15
    physical_block_id,  # virtual position 16-31
    physical_block_id,  # virtual position 32-47
    ...
]

Block Table Structure

graph TD
    subgraph "Sequence 1"
        S1_VP1[Virtual Page 0] --> S1_PP1[Physical Block 5]
        S1_VP2[Virtual Page 1] --> S1_PP2[Physical Block 12]
        S1_VP3[Virtual Page 2] --> S1_PP3[Physical Block 3]
    end
    
    subgraph "Sequence 2"
        S2_VP1[Virtual Page 0] --> S2_PP1[Physical Block 5]
        S2_VP2[Virtual Page 1] --> S2_PP2[Physical Block 7]
    end

The page table allows:

Non-contiguous storage: Physical blocks need not be contiguous
Shared blocks: Multiple sequences can share the same physical block (for prefixes)
Dynamic growth: New pages allocated as sequence extends

Sources: vllm/v1/core/kv_cache_manager.py:100-200, docs/design/paged_attention.md:50-150

Memory Management Strategies

Allocation Policy

vLLM uses a dynamic allocation policy that balances memory efficiency and allocation overhead:

Strategy	Description	Trade-off
On-demand	Allocate blocks as tokens are generated	Lower memory waste, higher overhead
Pre-allocation	Reserve blocks when sequence starts	Lower overhead, potential waste
Hybrid	Pre-allocate prefix, on-demand for new tokens	Balanced approach

Reference Counting

Reference counting prevents premature block deallocation:

Initial allocation: Reference count = 1
Fork/continue: Reference count incremented
Completion: Reference count decremented
Free block: When count reaches 0

Memory Reclamation

When GPU memory is exhausted:

graph TD
    A[Memory Pressure Detected] --> B{Sequence Can Evict?}
    B -->|Yes| C[Evict Cold Blocks]
    B -->|No| D[Wait or Reject Request]
    C --> E[Update Page Tables]
    E --> F[Allocate New Blocks]

Sources: vllm/v1/core/kv_cache_manager.py:200-300, vllm/v1/core/block_pool.py:100-200

Integration with Scheduler

Request Lifecycle

stateDiagram-v2
    [*] --> Received: New Request
    Received --> Scheduled: Add to Queue
    Scheduled --> Allocating: Acquire Blocks
    Allocating --> Running: Blocks Available
    Running --> Running: Generate Token
    Running --> Completed: EOS Token
    Completed --> [*]: Free Blocks
    Running --> Waiting: Block Unavailable
    Waiting --> Running: Blocks Acquired

Scheduling Integration Points

Stage	KV Cache Interaction
Request Arrival	Pre-allocate blocks for known prefix length
Token Generation	Allocate new block when current fills up
Sequence Completion	Release all associated blocks
Prefix Caching	Share blocks with identical prefixes

Configuration Options

Server Configuration

vllm serve <model> \
    --block-size 16 \
    --num-gpu-blocks-override 1000 \
    --gpu-memory-utilization 0.9 \
    --max-num-batched-tokens 8192

Key Parameters

Parameter	Description	Default
`block_size`	Number of tokens per KV cache block	16
`gpu_memory_utilization`	Fraction of GPU memory for KV cache	0.9
`num_gpu_blocks_override`	Override auto-computed block count	Auto
`max_num_batched_tokens`	Maximum tokens in a single batch	Dynamic

Memory Calculation

total_cache_memory = num_blocks × block_size × num_layers × 2 × num_kv_heads × head_dim × dtype_size

Where:

num_blocks = GPU memory × utilization / per_block_memory
2 accounts for both K and V caches
dtype_size = 2 bytes for FP16, 1 byte for INT8, etc.

Sources: docs/design/paged_attention.md:150-250

Performance Characteristics

Throughput Improvements

PagedAttention enables significant performance improvements compared to traditional contiguous allocation:

Metric	Traditional	PagedAttention V1	PagedAttention V2
Memory Waste	30-60%	<5%	<5%
Throughput	Baseline	~2.1x	~2.2x
BS=1 Latency	Baseline	~1.9x	~2.0x
BS=16+ Latency	Baseline	~2.0x	~2.2x

Scalability

The system scales efficiently with:

Longer sequences: O(seq_len) memory, no fragmentation
More concurrent requests: Block sharing for shared prefixes
Larger models: Better memory utilization per GPU

Sources: csrc/attention/paged_attention_v2.cu:200-300, docs/design/paged_attention.md:250-350

Implementation Details

Block Metadata Tracking

# Core block metadata structure
@dataclass
class Block:
    block_id: int
    device: Device
    block_size: int
    num_tokens: int = 0
    
    # Physical location
    physical_block_id: Optional[int] = None
    
    # Content tracking
    content_hash: Optional[int] = None
    computed: bool = False
    
    # Reference counting for shared blocks
    ref_count: int = 0

Attention Computation Flow

graph TD
    subgraph "Pre-computation"
        Q[Query Tensors] --> QS[Query Split]
        K[Key Tensors] --> KS[Key Split]
        V[Value Tensors] --> VS[Value Split]
    end
    
    subgraph "Paged Attention"
        QS --> AL[Attention Layer]
        KS --> AL
        VS --> AL
        AL --> PT[Page Table Lookup]
    end
    
    subgraph "Output"
        PT --> OUT[Attention Output]
        OUT --> SOFTMAX[Softmax]
        SOFTMAX --> GEMM[Final GEMM]
    end

Best Practices

Memory Configuration

Set appropriate GPU memory utilization based on other memory needs (model weights, activations)
Adjust block size for typical sequence lengths (larger blocks = less overhead for long sequences)
Monitor fragmentation using vLLM metrics

Request Optimization

Use consistent prefixes to enable block sharing across requests
Batch similar requests to maximize cache hit rates
Set appropriate max sequence length to avoid excessive block allocation

Debugging

# Enable verbose logging
export VLL_LOG_LEVEL=DEBUG

# Check memory status
curl http://localhost:8000/memory_stats

Component	File	Role
Attention Backend	`vllm/attention/backends/`	Pluggable attention implementations
Scheduler	`vllm/v1/core/sched.py`	Coordinates cache allocation with scheduling
Model Runner	`vllm/v1/core/model_runner.py`	Executes attention with managed KV cache
Worker	`vllm/v1/worker/worker.py`	GPU-side cache management

References

vLLM Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention"
Original Implementation: csrc/attention/paged_attention_v1.cu
Optimized Implementation: csrc/attention/paged_attention_v2.cu
Design Documentation: docs/design/paged_attention.md
KV Cache Manager: vllm/v1/core/kv_cache_manager.py
Block Pool: vllm/v1/core/block_pool.py

Sources: [csrc/attention/paged_attention_v1.cu:1-100](), [csrc/attention/paged_attention_v2.cu:1-100](), [docs/design/paged_attention.md:1-50]()

Attention Backends and Kernels

Related topics: PagedAttention and KV Cache Management, Quantization Support

Section Related Pages

Continue reading this section for the full explanation and source context.

Section High-Level Design

Continue reading this section for the full explanation and source context.

Section Backend Selection Flow

Continue reading this section for the full explanation and source context.

Section Backend Registry

Continue reading this section for the full explanation and source context.

Attention Backends and Kernels

Overview

The Attention Backends and Kernels system in vLLM provides a pluggable, hardware-accelerated implementation of attention mechanisms for Large Language Model (LLM) inference. This abstraction layer enables vLLM to leverage different optimized attention implementations (FlashAttention, FlashInfer, FlashMLA) depending on hardware capabilities and model requirements, while maintaining a unified interface for the rest of the engine.

Architecture

High-Level Design

vLLM implements a backend registry pattern that allows runtime selection of the optimal attention implementation. The attention system is designed to support both the legacy v0 engine and the optimized v1 engine architecture.

graph TD
    A[Attention Interface] --> B[Attention Backends Registry]
    B --> C[FlashAttention Backend]
    B --> D[FlashInfer Backend]
    B --> E[FlashMLA Backend]
    C --> F[CUDA Kernels]
    D --> G[Template-based Kernels]
    E --> H[MLA-optimized Kernels]
    
    I[Model Request] --> J[Scheduler]
    J --> K[Attention Layer]
    K --> A

Backend Selection Flow

The registry-based architecture enables automatic backend selection based on hardware and configuration:

graph TD
    A[Engine Initialization] --> B{Check user-specified backend}
    B -->|Specified| C[Validate backend compatibility]
    B -->|Auto| D{Detect GPU Architecture}
    D -->|H100/H200| E[Select FlashAttention3]
    D -->|A100/A10| F[Select FlashAttention2]
    D -->|Other| G[Select FlashAttention2]
    C -->|Valid| H[Initialize Backend]
    C -->|Invalid| I[Raise Configuration Error]
    E --> H
    F --> H
    G --> H

Attention Backend Components

Backend Registry

The registry.py module provides the central factory for attention backend selection:

Component	Responsibility
`AttentionBackend`	Abstract base class defining the backend interface
`get_available_backends()`	Returns list of compiled/available backends
`get_backend(name)`	Retrieves a specific backend by name
`AttentionImpl`	Concrete implementation wrapper per backend

FlashAttention Backend

Located at vllm/v1/attention/backends/flash_attn.py, this backend provides:

FlashAttention 2/3 optimized CUDA kernels
Support for head dimensions 64, 80, 96, 128, 160, 192, 256
Paged attention integration
Cross-attention support for encoder-decoder models

Key Features:

Feature	Description
Fused kernels	Combines attention computation steps
Dynamic persistent memory	Reuses KV cache blocks efficiently
Strided bias support	Enables sliding window attention

FlashInfer Backend

Located at vllm/v1/attention/backends/flashinfer.py, this backend offers:

Template-based kernel generation for flexibility
Improved performance for specific head dimensions
Better integration with vLLM's block manager
Support for custom attention patterns

FlashMLA Backend

Located at vllm/v1/attention/backends/mla/flashmla.py, this backend is optimized for:

Multi-head Latent Attention (MLA) architectures
Low-rank KV cache compression
Reduced memory bandwidth usage
Optimized for DeepSeek-style models

Attention Interface

Core Abstraction

All attention backends implement a common interface defined by the Attention class:

class Attention:
    def __init__(
        self,
        num_heads: int,
        head_size: int,
        scale: float,
        num_kv_heads: int,
        alibi_slopes: Optional[List[float]],
        cache_config: Optional[CacheConfig],
        block_size: int,
    )
    
    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: Optional[torch.Tensor],
        attn_metadata: AttentionMetadata,
    ) -> torch.Tensor

Attention Metadata

The AttentionMetadata structure carries runtime information required for attention computation:

Field	Type	Purpose
`seq_lens`	List[int]	Sequence lengths for each request
`max_seq_len`	int	Maximum sequence length in batch
`block_tables`	torch.Tensor	Paged attention block mappings
`seq_start_idx`	int	Starting index in output buffer

Paged Attention Integration

vLLM's attention system integrates tightly with paged memory management:

graph LR
    A[Logical KV Cache] --> B[Physical KV Blocks]
    C[Query Tensors] --> D[Block Lookup]
    D --> E[Attention Computation]
    F[Block Table] --> D
    B --> E
    E --> G[Output Tensors]

Block Table Structure

Block Index	Physical Address	Valid Length
0	15	16
1	23	16
2	7	16
3	-1 (pending)	0

Backend Configuration

Runtime Selection

Backends can be selected through multiple mechanisms:

Method	Priority	Example
Environment variable	Lowest	`VLLM_ATTENTION_BACKEND=FLASHINFER`
Model config	Medium	`"attention_backend": "flash_attn"`
CLI argument	Highest	`--attention-backend flashinfer`

Configuration Parameters

Parameter	Backend	Description
`head_size`	All	Dimension of each attention head
`num_kv_heads`	All	Number of key/value heads (for GQA)
`scale`	All	Attention scaling factor
`sliding_window`	FlashAttn	Sliding window size
`page_size`	FlashInfer	KV cache block size

Performance Characteristics

Memory Efficiency

Backend	KV Cache Format	Memory Overhead
FlashAttention	Standard	Baseline
FlashInfer	Blocked	Similar to FlashAttn
FlashMLA	Low-rank	50-70% reduction

Throughput Optimization

Kernel Fusion: Combines multiple operations to reduce memory bandwidth
Persistent RNN: Reuses computation across decode steps
Async Execution: Overlaps attention with other operations

Implementation Details

FlashAttention Kernel Flow

sequenceDiagram
    participant Q as Query
    participant K as Key
    participant V as Value
    participant Kernel as FlashAttn Kernel
    participant Cache as KV Cache
    
    Q->>Kernel: Load Q tiles
    K->>Kernel: Load K tiles
    V->>Kernel: Load V tiles
    Kernel->>Cache: Update (if write)
    Kernel->>Kernel: Compute attention scores
    Kernel->>Cache: Read (if read)
    Kernel->>Kernel: softmax & scale
    Kernel-->>Q: Output attention

Backend Initialization Sequence

graph TD
    A[EngineArgs parse] --> B[Attention backend selection]
    B --> C[Backend registry lookup]
    C --> D[Load backend module]
    D --> E[Initialize CUDA streams]
    E --> F[Allocate persistent buffers]
    F --> G[Register attention layers]
    G --> H[Ready for inference]

Extending Attention Backends

To implement a custom attention backend:

Create backend class: Inherit from AttentionBackend
Implement required methods: get_name(), get_impl()
Register backend: Add to ATTENTION_BACKENDS registry
Implement kernels: Write CUDA/C++ kernels or wrap existing libraries

class CustomAttentionBackend(AttentionBackend):
    @staticmethod
    def get_name() -> str:
        return "custom_attention"
    
    @staticmethod
    def get_impl() -> AttentionImpl:
        return CustomAttentionImpl

Troubleshooting

Common Issues

Issue	Cause	Solution
CUDA out of memory	Large batch/sequence	Reduce `gpu_memory_utilization`
Incorrect outputs	Wrong backend selected	Verify with `--attention-backend` flag
Kernel launch failure	Unsupported head size	Use supported head dimensions
Slow inference	Suboptimal backend	Benchmark available backends

Debugging Tips

Set VLLM_LOGGING_LEVEL=DEBUG for attention kernel timing
Use VLLM_ATTENTION_BACKEND=FLASH_ATTN to force specific backend
Check nvidia-smi for kernel execution times

Summary

The Attention Backends and Kernels system provides vLLM's computational core for transformer attention. Through the registry pattern, it enables seamless switching between optimized implementations while maintaining a stable interface for the rest of the engine. The pluggable architecture supports both established backends (FlashAttention, FlashInfer) and specialized implementations (FlashMLA) optimized for specific model architectures.

Source: https://github.com/vllm-project/vllm / Human Manual

Quantization Support

Related topics: Model Architecture Support, Attention Backends and Kernels

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Quantization System Components

Continue reading this section for the full explanation and source context.

Section Quantization Base Architecture

Continue reading this section for the full explanation and source context.

Section Layer Structure

Continue reading this section for the full explanation and source context.

Quantization Support

Overview

vLLM provides comprehensive quantization support to reduce model memory footprint and accelerate inference through various quantization schemes. Quantization compresses model weights from higher precision (typically FP16/BF16) to lower precision formats (such as INT8, INT4, or FP8), enabling larger models to run on limited GPU resources while maintaining acceptable accuracy.

The quantization system in vLLM is designed with a modular, extensible architecture that supports multiple quantization methods through a common abstraction layer. This allows users to easily switch between different quantization schemes or implement custom quantization strategies.

Sources: docs/features/quantization/README.md

Architecture

Quantization System Components

The quantization system consists of several interconnected components:

graph TD
    A[Model Loading] --> B[QuantizationConfig]
    B --> C[QuantizationMethod]
    C --> D[Quantized Linear Layers]
    C --> E[Quantized Embedding Layers]
    D --> F[CUDA/ROCm Kernels]
    E --> F
    G[Weight Loading] --> H[Pre-quantized Weights]
    G --> I[On-the-fly Quantization]

Quantization Base Architecture

The base quantization architecture defines a common interface that all quantization methods must implement:

classDiagram
    class QuantizationConfig {
        +get_supported_methods() List[QuantizationMethods]
        +get_override_quant_config() Optional[dict]
        +get_quant_config() dict
        +verify_quant_config() None
    }
    
    class QuantizationMethods {
        <<enumeration>>
        FP8
        GGUF
        AWQ
        GPTQ
        QUANTITY
    }
    
    class QuantizedLinear {
        <<interface>>
        +create_weights()
        +apply_weights()
    }
    
    QuantizationConfig --> QuantizationMethods
    QuantizationMethods --> QuantizedLinear

Sources: vllm/model_executor/layers/quantization/base_config.py

Layer Structure

Each quantization implementation defines its own quantized layer classes:

Layer Type	Purpose	Quantization Scope
`QuantizedLinear`	Matrix multiplication with quantized weights	Weight-only or Activation+Weight
`QuantizedEmbedding`	Lookup table with quantized weights	Weight-only
`QuantizedMoE`	Mixture-of-Experts with quantized components	Per-expert quantization

Sources: vllm/model_executor/layers/quantization/fp8.py

Supported Quantization Methods

FP8 (8-bit Floating Point)

FP8 quantization uses 8-bit floating point representation with two formats:

Format	Exponent Bits	Mantissa Bits	Use Case
E4M3	4	3	Activations and weights
E5M2	5	2	Gradients and optimizer states

FP8 quantization is particularly well-supported in vLLM with optimized CUDA kernels for inference acceleration.

Sources: vllm/model_executor/layers/quantization/fp8.py

GGUF (GPT-Generated Unified Format)

GGUF is a quantized model format commonly used with llama.cpp. vLLM supports loading GGUF-quantized models directly:

# Load GGUF quantized model
llm = LLM(model="unsloth/Qwen3-0.6B-GGUF:Q4_K_M")

GGUF supports multiple quantization levels:

Quantization Type	Bits per Parameter	Description
Q2_K	~2.5	2-bit quantization with 4-bit key-values
Q3_K	~3.5	3-bit quantization with 4-bit key-values
Q4_K	~4.5	4-bit quantization with 8-bit key-values
Q5_K	~5.5	5-bit quantization with 8-bit key-values
Q6_K	~6.5	6-bit quantization
Q8_0	~8.0	8-bit quantization (baseline)

Sources: vllm/model_executor/layers/quantization/gguf.py

AWQ (Activation-Aware Weight Quantization)

AWQ identifies weights with significant activation contributions and preserves them at higher precision while quantizing others aggressively.

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ performs post-training quantization with optional layer-wise precision adjustment and GPU optimization.

Quantization Configuration

Configuration Parameters

Parameter	Type	Description	Default
`quantization`	str	Quantization method name	None
`quantization_param_path`	str	Path to quantization parameters file	None
`dtype`	str	Model precision (if not pre-quantized)	auto
`kv_cache_dtype`	str	KV cache quantization format	auto

Enabling Quantization

Quantization is enabled through the --quantization CLI argument or quantization parameter in AsyncEngineArgs:

vllm serve model/path --quantization fp8

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    quantization="fp8",
    gpu_memory_utilization=0.9
)

Mixed Quantization

Some layers may remain in higher precision when quantization cannot be applied uniformly:

Layer Type	Quantization Behavior
Input/Output embeddings	Often kept in FP16/BF16
Output projection	May use different precision
Attention softmax	Usually in FP32 for stability

Implementation Details

Weight Loading Pipeline

graph LR
    A[Model Checkpoint] --> B{Pre-quantized?}
    B -->|Yes| C[Load Quantized Weights]
    B -->|No| D[Dynamic Quantization]
    C --> E[Apply Quantization Config]
    D --> E
    E --> F[Initialize Quantized Layer]
    F --> G[Verify Weight Shape]
    G --> H[CUDA Kernel Ready]

C++ Backend (cuBLAS/cutlass)

High-performance quantization kernels are implemented in C++ and CUDA:

Component	Location	Purpose
FP8 GEMM	`csrc/quantization/fp8/`	FP8 matrix multiplication
W8A8 GEMM	`csrc/quantization/fp8/`	INT8 weight, INT8 activation
W4A16 GEMM	`csrc/quantization/`	INT4 weight, FP16 activation
Dequantization	`csrc/quantization/`	Convert quantized to compute dtype

Sources: csrc/quantization

Quantized Linear Layer Implementation

The QuantizedLinear layer handles the core computation:

class QuantizedLinear(QuantizedLayer):
    """Base class for quantized linear layers."""
    
    def __init__(
        self,
        input_size: int,
        output_size: int,
        quantization_config: QuantizationConfig,
        bias: bool = False,
    ):
        self.input_size = input_size
        self.output_size = output_size
        self.quantization_config = quantization_config
        
    def create_weights(self):
        """Initialize quantized weight tensors."""
        raise NotImplementedError
        
    def forward(self, input_):
        """Forward pass with quantized computation."""
        raise NotImplementedError

Sources: vllm/model_executor/layers/quantization/base_config.py

Usage Examples

Loading Pre-quantized Models

from vllm import LLM

# FP8 quantized model
llm_fp8 = LLM(
    model="meta-llama/Llama-2-70b-hf",
    quantization="fp8",
    tensor_parallel_size=4
)

# GGUF quantized model
llm_gguf = LLM(
    model="TheBloke/Llama-2-70B-Chat-GGUF",
    quantization="gguf",
    tokenizer="meta-llama/Llama-2-70b-chat"
)

Quantization with Different Precisions

# Load with specific precision settings
llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct",
    quantization="fp8",
    dtype="half",  # Compute precision
    kv_cache_dtype="fp8_e4m3"  # KV cache precision
)

CLI Usage

# Serve with FP8 quantization
vllm serve meta-llama/Llama-2-70b-hf \
    --quantization fp8 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.95

# Serve GGUF model directly
vllm serve TheBloke/Llama-2-7B-Chat-GGUF:Q4_K_M

Performance Considerations

Memory Reduction

Quantization	Memory Reduction	Quality Impact
FP8 (E4M3)	~50%	Minimal
INT8	~50%	Low
INT4	~75%	Moderate
INT2	~87.5%	High

Throughput Impact

Quantization improves throughput through:

Increased batch size: More sequences fit in GPU memory
Higher memory bandwidth utilization: Smaller weights load faster
Accelerated compute: INT8/FP8 operations on tensor cores

Accuracy Considerations

Post-training quantization (PTQ): May introduce accuracy degradation
Activation-aware methods (AWQ): Better preserves model capabilities
Calibration: Some methods require calibration data for optimal accuracy

Extension Points

Custom Quantization Methods

To implement a custom quantization method, extend the base classes:

from vllm.model_executor.layers.quantization import QuantizationConfig

class CustomQuantizationConfig(QuantizationConfig):
    """Custom quantization configuration."""
    
    @staticmethod
    def get_name() -> str:
        return "custom_quant"
    
    @staticmethod
    def get_supported_methods(cls) -> list[str]:
        return ["weight_only"]
    
    def get_quant_config(self) -> dict:
        return {"method": self.quant_method}

Registering Custom Quantization

Custom quantizations must be registered in the quantization registry to be discoverable at runtime.

Summary

vLLM's quantization support provides a flexible, extensible system for serving large language models with reduced memory footprint. The architecture separates concerns through well-defined interfaces, allowing seamless integration of new quantization methods while maintaining high performance through optimized CUDA kernels.

Key takeaways:

Multiple quantization formats supported (FP8, GGUF, AWQ, GPTQ)
Modular architecture enables easy extension
Optimized C++/CUDA kernels for inference acceleration
Simple API through CLI and Python interface
Memory reduction up to 75% with INT4 quantization

Sources: [docs/features/quantization/README.md]()

Distributed Inference and Parallelism

Related topics: Scheduling and Request Processing, PagedAttention and KV Cache Management

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Tensor Parallelism (TP)

Continue reading this section for the full explanation and source context.

Section Pipeline Parallelism (PP)

Continue reading this section for the full explanation and source context.

Section Data Parallelism (DP)

Continue reading this section for the full explanation and source context.

Distributed Inference and Parallelism

vLLM provides comprehensive support for distributed inference and parallelism, enabling efficient serving of large language models across multiple GPUs and nodes. This document covers the architecture, configuration, and implementation details of vLLM's distributed computing capabilities.

Overview

vLLM's distributed inference system enables horizontal scaling of LLM workloads by distributing model computation across multiple devices. The system supports multiple parallelism strategies, including tensor parallelism, pipeline parallelism, and data parallelism, along with specialized features like disaggregated prefill/decode and KV cache transfer.

The core components of distributed inference in vLLM include:

Component	Purpose
Parallel State Manager	Coordinates process groups for distributed communication
Device Communicators	Handle low-level tensor communication (NCCL, CUDA)
KV Transfer System	Enables disaggregated prefill/decode architectures
Configuration System	Manages parallelism parameters and device placement

Parallelism Strategies

vLLM supports three primary parallelism strategies, each addressing different aspects of distributed computation.

Tensor Parallelism (TP)

Tensor parallelism splits individual weight matrices across multiple GPUs, allowing computation of large matrices that would not fit in a single device's memory. This is particularly effective for dense layers like attention and feed-forward networks.

Tensor parallelism requires:

NVIDIA GPUs with NCCL support
High-bandwidth interconnects (NVLink preferred)
Each tensor-parallel rank requires full model weights in terms of optimizer states

Pipeline Parallelism (PP)

Pipeline parallelism distributes layers (stages) of the model across different GPUs or nodes. This approach reduces memory requirements per device while maintaining high GPU utilization through micro-batch pipelining.

Data Parallelism (DP)

Data parallelism replicates the entire model across multiple GPUs, with each replica processing different batches of requests. This is the simplest form of parallelism and scales throughput linearly with the number of replicas.

Parallel State Management

The ParallelState class in vllm/distributed/parallel_state.py is the central coordinator for distributed execution.

graph TD
    A[ParallelState] --> B[Tensor Parallel Group]
    A --> C[Pipeline Parallel Group]
    A --> D[Data Parallel Group]
    A --> E[World Communicator]
    B --> F[Rank 0, 1, 2, 3]
    C --> G[Stage 0, Stage 1]
    D --> H[Replica 1, Replica 2]

Process Group Initialization

Parallel state is initialized through init_distributed_environment(), which creates the necessary process groups for communication.

# From vllm/distributed/parallel_state.py:45-78
def init_distributed_environment(
    rank: int,
    world_size: int,
    local_rank: int,
    init_method: str = "env://",
    backend: str = "nccl"
):
    # Initialize distributed context
    torch.distributed.init_process_group(
        backend=backend,
        init_method=init_method,
        rank=rank,
        world_size=world_size
    )

Rank and World Size Management

Parameter	Description
`rank`	Unique identifier for each process in the distributed group
`world_size`	Total number of processes in the distributed group
`local_rank`	Rank of the process within its local node

Sources: vllm/distributed/parallel_state.py:45-78

Device Communication

CUDA Communicator

The CUDACommunicator class in vllm/distributed/device_communicators/cuda_communicator.py provides NCCL-based communication primitives optimized for CUDA tensors.

graph LR
    A[Tensor] -->|all_reduce| B[Aggregated Tensor]
    A -->|broadcast| C[Same Tensor on All Ranks]
    A -->|reduce_scatter| D[Partitioned Results]
    E[Partial Tensors] -->|all_gather| F[Complete Tensor]

Supported Communication Primitives

Primitive	Function	Use Case
`all_reduce`	Reduce tensors across all ranks	Gradient synchronization in TP
`broadcast`	Send tensor from one rank to all	Weight updates
`all_gather`	Collect tensors from all ranks	Output aggregation
`reduce_scatter`	Reduce and partition across ranks	Gradient partitioning

# From vllm/distributed/device_communicators/cuda_communicator.py:23-65
class CUDACommunicator:
    def __init__(self, group: ProcessGroup):
        self.group = group
        self.world_size = group.size()
        self.rank = group.rank()
    
    def all_reduce(self, tensor: torch.Tensor) -> torch.Tensor:
        # NCCL all-reduce implementation
        torch.distributed.all_reduce(tensor, group=self.group)
        return tensor

Sources: vllm/distributed/device_communicators/cuda_communicator.py:23-65

Communication Patterns in Distributed Inference

graph TD
    subgraph "Tensor Parallel Region"
        A[Attention AllReduce] --> B[FFN AllReduce]
        B --> C[AllReduce Output]
    end
    
    subgraph "Pipeline Parallel Region"
        D[Send Hidden States] --> E[Receive Hidden States]
        E --> F[Backward Pass]
    end
    
    subgraph "Data Parallel Region"
        G[Synchronize KV Cache] --> H[Load Balance Requests]
    end

Disaggregated Prefill and Decode

vLLM supports disaggregated prefill/decode architectures where prefill (initial prompt processing) and decode (token generation) stages run on separate GPU clusters. This enables independent scaling of prefill and decode resources.

KV Transfer Architecture

The KV transfer system enables sharing of KV cache between prefill and decode instances.

graph LR
    A[Prefill Instance] -->|KV Transfer| B[Shared Storage]
    B -->|KV Load| C[Decode Instance]
    
    subgraph "Prefill Process"
        A1[Tokenize Prompts] --> A2[Process Prefill]
        A2 --> A3[Save KV Cache]
    end
    
    subgraph "Decode Process"
        C1[Load KV Cache] --> C2[Generate Tokens]
        C2 --> C3[Streaming Output]
    end

Sources: examples/disaggregated/example_connector/README.md

KV Connector Base

The KVConnectorBase class defines the interface for KV cache transfer implementations.

# From vllm/distributed/kv_transfer/kv_connector/base.py:15-85
class KVConnectorBase(ABC):
    def __init__(self, kv_transfer_config: KVTransferParams):
        self.config = kv_transfer_config
    
    @abstractmethod
    def load_kv_cache(
        self,
        requests: List[TransferJob],
        scheduler: Any
    ) -> None:
        """Load KV cache from external source during prefill"""
        pass
    
    @abstractmethod
    def save_kv_cache(
        self,
        blocks: List[PhysicalTokenBlock],
        kv_pair: KVCache,
        callback: Callable
    ) -> TransferJob:
        """Save KV cache to external storage during decode"""
        pass

Method	Purpose
`load_kv_cache`	Load KV cache during prefill phase
`save_kv_cache`	Persist KV cache during decode phase
`profile_num_available_blocks`	Determine available transfer capacity

Sources: vllm/distributed/kv_transfer/kv_connector/base.py:15-85

Example Connector Implementation

The ExampleConnector provides a reference implementation for KV transfer:

# From examples/disaggregated/example_connector/
class ExampleConnector(KVConnectorBase):
    def __init__(self, kv_transfer_config: KVTransferParams):
        super().__init__(kv_transfer_config)
        self.local_storage = "./local_storage"

The connector workflow:

Prefill Phase: Process prompts and save KV state to local_storage directory
Decode Phase: Load KV state from storage and continue generation

Sources: examples/disaggregated/example_connector/README.md

Configuration

Parallel Configuration Parameters

The ParallelConfig class in vllm/config/parallel.py manages all parallelism settings.

Parameter	Type	Default	Description
`tensor_parallel_size`	int	1	Number of GPUs for tensor parallelism
`pipeline_parallel_size`	int	1	Number of pipeline stages
`data_parallel_size`	int	1	Number of data parallel replicas
`data_parallel_size_per_region`	int	-	DP size for hybrid parallelism
`data_parallel_master_port`	int	29500	Port for DP master communication
`data_parallel_master_addr`	str	-	Address for DP master
`numa_aware`	bool	False	Enable NUMA-aware GPU placement

# From vllm/config/parallel.py:10-45
@dataclass
class ParallelConfig:
    tensor_parallel_size: int = 1
    pipeline_parallel_size: int = 1
    data_parallel_size: int = 1
    data_parallel_size_per_region: Optional[int] = None
    data_parallel_master_port: int = 29500
    data_parallel_master_addr: Optional[str] = None
    data_parallel_standalone: bool = False
    numa_aware: bool = False

Sources: vllm/config/parallel.py:10-45

Environment Variables

Variable	Description
`CUDA_VISIBLE_DEVICES`	Comma-separated list of GPU IDs to use
`VLLM_HOST_IP`	Host IP for distributed communication
`VLLM_PORT`	Port for worker communication

Launching Distributed Inference

#### Multi-GPU Launch with torchrun

torchrun --nproc_per_node=4 \
    --nnodes=1 \
    vllm/entrypoints/llm.py \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4

#### Multi-Node Launch

torchrun --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=10.0.0.1 \
    --master_port=29500 \
    vllm/entrypoints/llm.py \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 4

Scaling Strategies

Scaling Guidelines

Model Size	GPU Memory	Recommended Configuration
7B	24GB	TP=1, DP=single node
13B	48GB	TP=2, DP=single node
70B	320GB+	TP=4 or 8, PP=2+
405B	800GB+	TP=8, PP=8, multi-node

Sources: docs/serving/parallelism_scaling.md

Disaggregated Prefill Scaling

For disaggregated prefill/decode, consider:

Workload Pattern	Prefill Resources	Decode Resources
Short prompts, many requests	Scale prefill	Scale decode
Long prompts, few requests	Scale prefill with TP	Scale decode
Mixed workload	Balance both	Balance both

Scaling Best Practices

Start with tensor parallelism for intra-node scaling
Add pipeline parallelism for multi-node deployments
Use data parallelism to increase throughput on same-stage workloads
Enable disaggregation when prefill and decode have different resource needs

Sources: docs/serving/parallelism_scaling.md

Architecture Diagram

graph TB
    subgraph "vLLM Distributed Architecture"
        subgraph "Process 0"
            P0_M[Model Shard 0]
            P0_S[Scheduler]
            P0_C[Cache Engine]
        end
        
        subgraph "Process 1"
            P1_M[Model Shard 1]
            P1_S[Scheduler]
            P1_C[Cache Engine]
        end
        
        subgraph "Process N"
            PN_M[Model Shard N]
            PN_S[Scheduler]
            PN_C[Cache Engine]
        end
        
        NCCL[NCCL AllReduce]
        
        P0_S <-->|NCCL| P1_S
        P1_S <-->|NCCL| PN_S
        P0_S <-->|NCCL| PN_S
        
        P0_M <-->|Forward Pass| P0_S
        P1_M <-->|Forward Pass| P1_S
        PN_M <-->|Forward Pass| PN_S
    end
    
    subgraph "External Services"
        KV[KV Transfer Service]
        Redis[(Redis/Storage)]
    end
    
    P0_C <-->|Save KV| Redis
    PN_C <-->|Load KV| Redis
    Redis <--> KV

Model Architecture Support

Related topics: Model Executor and Worker Architecture, Quantization Support

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Python API (Offline Inference)

Continue reading this section for the full explanation and source context.

Section CLI-based Model Serving

Continue reading this section for the full explanation and source context.

Section Language Models

Continue reading this section for the full explanation and source context.

Model Architecture Support

Overview

vLLM provides comprehensive support for various LLM architectures, enabling users to serve, fine-tune, and run inference with a wide range of transformer-based models. The system is designed to be architecture-agnostic while providing optimized implementations for popular model families.

Core Model Loading Architecture

Python API (Offline Inference)

The primary Python interface for running offline inference is the LLM class, which handles model loading and inference without requiring a separate inference server.

# Basic usage example
from vllm import LLM

llm = LLM(model="Qwen/Qwen3-0.6B")
output = llm.generate("Hello, world!")

Sources: examples/basic/offline_inference/README.md

CLI-based Model Serving

The vLLM CLI provides a serve subcommand for HTTP-based model serving:

vllm serve Qwen/Qwen3-0.6B

If no model is specified, the CLI defaults to Qwen/Qwen3-0.6B.

Sources: vllm/entrypoints/cli/serve.py

Supported Model Categories

Language Models

vLLM supports a broad range of causal language models including:

Decoder-only transformers: Standard autoregressive models
Mixture of Experts (MoE): Sparse architectures like Mixtral
Multimodal models: Models that process multiple input types

Quantization Support

vLLM supports quantized models through GGUF format, enabling deployment of compressed models with reduced memory footprint.

Example loading a quantized model directly from HuggingFace:

--model unsloth/Qwen3-0.6B-GGUF:Q4_K_M --tokenizer Qwen/Qwen3-0.6B

Sources: examples/basic/offline_inference/README.md

Model Configuration

Generation Configuration

The --generation-config argument specifies where the generation config is loaded from:

Value	Source	Description
`auto`	Model path	Loads from model's configuration directory
`<folder_path>`	Local folder	Loads from specified directory
Not provided	vLLM defaults	Uses built-in default parameters

If max_new_tokens is specified in generation config, it sets a server-wide limit on output tokens for all requests.

Sources: examples/basic/offline_inference/README.md

Engine Arguments

Model configuration is controlled through AsyncEngineArgs, which processes CLI arguments and creates the model configuration:

engine_args = AsyncEngineArgs.from_cli_args(args)
model_config = engine_args.create_model_config()

Sources: vllm/entrypoints/cli/launch.py

Structured Outputs

vLLM supports structured output generation for models that support it, including reasoning models:

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --reasoning-parser deepseek_r1

This enables compliance with output format constraints defined by the model.

Sources: examples/features/structured_outputs/README.md

CPU Offload Support

For models that exceed available GPU memory, vLLM provides CPU offload capabilities:

--cpu-offload-gb 10

This creates virtual GPU memory by offloading portions of the model to CPU RAM. For example, with a 24GB GPU and 10GB offload, you can effectively load a 13B model requiring ~26GB.

Note: This requires fast CPU-GPU interconnect for acceptable performance.

Sources: examples/basic/offline_inference/README.md

Model Registry Architecture

The vLLM CLI uses a modular command structure where model support is registered through subcommand modules:

for cmd_module in CMD_MODULES:
    new_cmds = cmd_module.cmd_init()
    for cmd in new_cmds:
        cmd.subparser_init(subparsers).set_defaults(dispatch_function=cmd.cmd)

Each registered command includes validation logic to ensure model configurations are valid before execution.

Sources: vllm/entrypoints/cli/main.py

Serving Modes

Standard API Server

The default serving mode starts an HTTP server with OpenAI-compatible endpoints:

vllm serve <model_name>

Headless Mode

For distributed deployments, headless mode skips API server initialization:

if args.headless:
    if args.api_server_count is not None and args.api_server_count > 0:
        raise ValueError(
            f"--api-server-count={args.api_server_count} cannot be "
            "used with --headless (no API servers are started in "
            "headless mode)."
        )
    args.api_server_count = 0

Sources: vllm/entrypoints/cli/serve.py

gRPC Server Mode

For high-performance scenarios, vLLM supports gRPC-based serving:

if getattr(args, "grpc", False):
    from vllm.entrypoints.grpc_server import serve_grpc
    uvloop.run(serve_grpc(args))

Sources: vllm/entrypoints/cli/serve.py

Data Parallel Modes

vLLM supports distributed model serving through multiple load balancing strategies:

Mode	Flag	Description
External LB	`--data-parallel-external-lb` or `--data-parallel-rank`	External load balancer manages request distribution
Hybrid LB	`--data-parallel-hybrid-lb` or `--data-parallel-start-rank`	Hybrid approach with internal and external coordination

The system auto-detects load balancing mode to set appropriate default values for api_server_count.

Sources: vllm/entrypoints/cli/serve.py

Installation and Quickstart

For users getting started with model architecture support:

Install vLLM following the installation guide
Review the quickstart documentation
Check the list of supported models

Sources: README.md

Architecture Flow Diagram

graph TD
    A[User Request] --> B{CLI or Python API?}
    B -->|CLI| C[vllm serve command]
    B -->|Python| D[LLM class instantiation]
    C --> E[CLISubcommand processing]
    D --> F[AsyncEngineArgs configuration]
    E --> G[Model Registry Lookup]
    F --> H[Model Config Creation]
    G --> I[Load Model Architecture]
    H --> I
    I --> J{Quantization?}
    J -->|GGUF| K[Load Quantized Weights]
    J -->|BF16/FP8| L[Load Standard Weights]
    K --> M[PagedAttention Engine]
    L --> M
    M --> N[Inference Execution]

Key Implementation Files

Component	File Path	Purpose
CLI Entry	`vllm/entrypoints/cli/main.py`	Main CLI dispatcher and argument parsing
Serve Command	`vllm/entrypoints/cli/serve.py`	HTTP server startup and configuration
Launch Layer	`vllm/entrypoints/cli/launch.py`	FastAPI-based serving layer
Offline Inference	`examples/basic/offline_inference/`	Python API usage examples
Structured Outputs	`examples/features/structured_outputs/`	Advanced output formatting

Sources: [examples/basic/offline_inference/README.md](https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/README.md)

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium [Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling

First-time setup may fail or require extra isolation and rollback planning.

medium [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0

First-time setup may fail or require extra isolation and rollback planning.

medium [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_l…

First-time setup may fail or require extra isolation and rollback planning.

medium v0.18.1

First-time setup may fail or require extra isolation and rollback planning.

Doramagic Pitfall Log

Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: [Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling

Severity: medium
Finding: Installation risk is backed by a source signal: [Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/42182

2. Installation risk: [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0

Severity: medium
Finding: Installation risk is backed by a source signal: [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/40896

3. Installation risk: [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_l…

Severity: medium
Finding: Installation risk is backed by a source signal: [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_l…. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/42207

4. Installation risk: v0.18.1

Severity: medium
Finding: Installation risk is backed by a source signal: v0.18.1. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/releases/tag/v0.18.1

5. Capability assumption: [Feature]: Qwen3.5-Moe LoRA Support (experts)

Severity: medium
Finding: Capability assumption is backed by a source signal: [Feature]: Qwen3.5-Moe LoRA Support (experts). Treat it as a review item until the current version is checked.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/40005

6. Capability assumption: README/documentation is current enough for a first validation pass.

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: capability.assumptions | github_repo:599547518 | https://github.com/vllm-project/vllm | README/documentation is current enough for a first validation pass.

7. Project risk: v0.20.2

Severity: medium
Finding: Project risk is backed by a source signal: v0.20.2. Treat it as a review item until the current version is checked.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/releases/tag/v0.20.2

8. Maintenance risk: Maintainer activity is unknown

Severity: medium
Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: evidence.maintainer_signals | github_repo:599547518 | https://github.com/vllm-project/vllm | last_activity_observed missing

9. Security or permission risk: no_demo

Severity: medium
Finding: no_demo
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: downstream_validation.risk_items | github_repo:599547518 | https://github.com/vllm-project/vllm | no_demo; severity=medium

10. Security or permission risk: No sandbox install has been executed yet; downstream must verify before user use.

Severity: medium
Finding: No sandbox install has been executed yet; downstream must verify before user use.
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: risks.safety_notes | github_repo:599547518 | https://github.com/vllm-project/vllm | No sandbox install has been executed yet; downstream must verify before user use.

11. Security or permission risk: no_demo

Severity: medium
Finding: no_demo
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: risks.scoring_risks | github_repo:599547518 | https://github.com/vllm-project/vllm | no_demo; severity=medium

12. Security or permission risk: [Bug]: ngram speculative decoding changes greedy output on Qwen3-0.6B / A100

Severity: medium
Finding: Security or permission risk is backed by a source signal: [Bug]: ngram speculative decoding changes greedy output on Qwen3-0.6B / A100. Treat it as a review item until the current version is checked.
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/41758

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using vllm with real data or production workflows.

[[Bug]: vLLM v1 with prefix caching: first request differs from subsequen](https://github.com/vllm-project/vllm/issues/40896) - github / github_issue
[[AMD][CI Failure][Tracker] Static dashboard tracker for current CI failu](https://github.com/vllm-project/vllm/issues/40554) - github / github_issue
[[Usage]: How to proactively clear CPU-resident memory left behind by unl](https://github.com/vllm-project/vllm/issues/42207) - github / github_issue
[[Feature]: Qwen3.5-Moe LoRA Support (experts)](https://github.com/vllm-project/vllm/issues/40005) - github / github_issue
[[Bug]: ngram speculative decoding changes greedy output on Qwen3-0.6B /](https://github.com/vllm-project/vllm/issues/41758) - github / github_issue
[[Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async sch](https://github.com/vllm-project/vllm/issues/42182) - github / github_issue
v0.20.2 - github / github_release
v0.20.1 - github / github_release
v0.20.0 - github / github_release
v0.19.1 - github / github_release
v0.19.0 - github / github_release
v0.18.1 - github / github_release

Source: Project Pack community evidence and pitfall evidence

vllm

vLLM Overview

Related Pages

vLLM Overview

What is vLLM?

Key Features

Offline Inference

OpenAI-Compatible API Server

Structured Outputs

Disaggregated Prefill and Decode

KV Cache Transfer

KV Load Failure Recovery

Long Text Embedding with Chunked Processing

Architecture Overview

CLI Entry Point

Serve Subcommand

Launch Subcommand

Observability

OpenTelemetry Integration

Prometheus Metrics and Dashboards

Performance Profiling

Supported Features Summary

Usage Modes

Offline Inference Mode

API Server Mode

Disaggregated Mode

Quick Reference

See Also

Getting Started

Related Pages

Getting Started

Installation

Quick Installation

Building from Source

GPU Requirements

Core Concepts

Architecture Overview

Offline Inference

Basic Usage

Supported Models

Serving with the CLI

Starting the Server

Command-Line Options

Key Server Options

Headless Mode

API Usage

Completions API

Chat API

Offline Inference Examples

Running Examples

Generation Config

Advanced Features

Structured Outputs

Long Text Embedding

GGUF Quantized Models

CPU Offload

Configuration Workflow

Programmatic Configuration

Next Steps

Troubleshooting

Common Issues

Getting Help

Core Engine Architecture

Related Pages

Core Engine Architecture

Overview

Architecture Components

V0 Engine (Legacy)

V1 Engine (Current)

Entry Points

CLI Entry Point

Serve Subcommand

Launch Subcommand

Python API Entry Point

Engine Initialization Flow

Core Engine Components

AsyncLLM (V1)

Core (V1)

LLMEngine (V1)

Configuration System