Doramagic Project Pack · Human Manual

vllm

Related topics: Getting Started, Core Engine Architecture

vLLM Overview

Related topics: Getting Started, Core Engine Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Offline Inference

Continue reading this section for the full explanation and source context.

Section OpenAI-Compatible API Server

Continue reading this section for the full explanation and source context.

Section Structured Outputs

Continue reading this section for the full explanation and source context.

Related topics: Getting Started, Core Engine Architecture

vLLM Overview

What is vLLM?

vLLM is a fast and easy-to-use library for LLM (Large Language Model) inference and serving. It provides high-throughput, memory-efficient inference with an OpenAI-compatible API server, making it suitable for both research and production environments.

Sources: README.md

Key Features

Offline Inference

The LLM class provides the primary Python interface for offline inference—interacting with a model without using a separate model inference server. This enables direct model interaction for batch processing, experimentation, and development workflows.

Sources: examples/basic/offline_inference/README.md

OpenAI-Compatible API Server

vLLM serves LLM completions via HTTP through an OpenAI-compatible API. The server can be started with a simple command:

vllm serve Qwen/Qwen2.5-3B-Instruct

Sources: vllm/entrypoints/cli/serve.py

Structured Outputs

vLLM supports constrained decoding for structured outputs including JSON schema, regex patterns, and structural tags. This is essential for building reliable applications that require predictable output formats.

Sources: examples/features/structured_outputs/README.md

Disaggregated Prefill and Decode

vLLM supports disaggregated prefill architecture where prefill (token generation) and decode (token consumption) stages can run on separate instances. This enables:

  • Independent scaling of prefill and decode workloads
  • Improved resource utilization
  • Better support for multi-turn conversations

Sources: examples/disaggregated/example_connector/README.md

KV Cache Transfer

For disaggregated serving, vLLM supports KV cache transfer between prefill and decode workers. The architecture includes:

ComponentRoleDescription
ECExampleConnectorCache StorageStores encoder cache on local disk
EC ProducerPrecomputePre-computes encoder cache
EC ConsumerRetrieveRetrieves cached KV data

Sources: examples/disaggregated/disaggregated_encoder/README.md

KV Load Failure Recovery

vLLM implements robust recovery mechanisms for KV load failures in both synchronous and asynchronous loading modes. The system:

  1. Identifies invalid KV blocks
  2. Reschedules affected requests
  3. Ensures consistent output through recovery logic

Sources: examples/disaggregated/kv_load_failure_recovery_offline/README.md

Long Text Embedding with Chunked Processing

vLLM supports embedding models with chunked processing for texts exceeding the model's maximum context length:

{
  "pooling_type": "auto",
  "use_activation": true,
  "enable_chunked_processing": true,
  "max_embed_len": 3072000
}

This enables processing of extremely long documents (up to 3M+ tokens) including academic papers, legal documents, and code repositories.

Sources: examples/pooling/embed/openai_embedding_long_text/README.md

Architecture Overview

CLI Entry Point

The vLLM CLI provides a flexible command-line interface built with FlexibleArgumentParser:

graph TD
    A[vllm CLI] --> B[main.py Entry Point]
    B --> C[Parse Arguments]
    C --> D[Load CMD_MODULES]
    D --> E[Create Subparsers]
    E --> F[Execute Subcommand]
    
    F --> G[serve]
    F --> H[launch]
    F --> I[other modules]

The CLI supports:

  • -v, --version flag for version information
  • Subcommand system with plugin architecture
  • Command validation before execution

Sources: vllm/entrypoints/cli/main.py

Serve Subcommand

The serve subcommand handles online inference server startup:

graph TD
    A[serve command] --> B{Model specified?}
    B -->|Yes| C[Use CLI model]
    B -->|No| D[Default: Qwen/Qwen3-0.6B]
    
    C --> E{GRPC enabled?}
    D --> E
    
    E -->|Yes| F[Start gRPC Server]
    E -->|No| G{Headless mode?}
    
    G -->|Yes| H[Set api_server_count=0]
    G -->|No| I[Check LB mode]
    
    I --> J[Start FastAPI Server]
    H --> J

The serve command supports multiple deployment modes:

  • Standard mode: Full API server with GPU inference
  • Headless mode: No API servers, only engine processing
  • gRPC mode: Alternative RPC interface
  • Load-balanced mode: Data-parallel external/hybrid load balancing

Sources: vllm/entrypoints/cli/serve.py

Launch Subcommand

The launch subcommand provides a modular component launch system:

graph LR
    A[launch command] --> B[LaunchSubcommand]
    B --> C[launch_component subparser]
    C --> D[LaunchSubcommandBase subclasses]
    
    D --> E[run_launch_fastapi]
    D --> F[other components]
    
    E --> G[Socket binding]
    E --> H[Build API Server]
    E --> I[EngineArgs configuration]

The launch system renders servers with preprocessing only—no inference or quantized kernels, and never allocates KV cache.

Sources: vllm/entrypoints/cli/launch.py

Observability

OpenTelemetry Integration

vLLM includes built-in OpenTelemetry support for distributed tracing:

opentelemetry-instrument vllm serve facebook/opt-125m

Core packages are bundled with vLLM:

  • opentelemetry-sdk
  • opentelemetry-api
  • opentelemetry-exporter-otlp
  • opentelemetry-semantic-conventions-ai

Sources: examples/observability/opentelemetry/README.md

Prometheus Metrics and Dashboards

vLLM exports Prometheus-compatible metrics and supports integration with:

PlatformDashboard FormatImport Method
GrafanaJSONUI or API
PersesYAMLCLI

Sources: examples/observability/dashboards/README.md

Performance Profiling

The nsys_profile_tools enable GPU kernel-level profiling:

nsys profile -t cuda -o run1 -f true --trace-fork-before-exec=true \
    --cuda-graph-trace=node --delay <DELAY> --duration <DURATION> \
    vllm serve openai/gpt-oss-120b ...

The gputrc2graph.py script generates kernel-level summaries and visualizations from .nsys-rep files.

Sources: tools/profiler/nsys_profile_tools/README.md

Supported Features Summary

FeatureDescriptionConfiguration
Offline InferenceBatch processing without serverLLM class
OpenAI APIHTTP API compatibilityvllm serve
Structured OutputsJSON/regex/structural constraints--reasoning-parser
Disaggregated ServingSplit prefill/decode--ec-transfer-config
KV RecoveryFailure resilienceCustom connectors
Long Text EmbeddingChunked processing--pooler-config
ObservabilityTracing and metricsOpenTelemetry
QuantizationGGUF supportrepo_id:quant_type

Usage Modes

Offline Inference Mode

from vllm import LLM

llm = LLM("Qwen/Qwen2.5-3B-Instruct")
output = llm.generate("Hello, world!")

API Server Mode

vllm serve Qwen/Qwen2.5-3B-Instruct --tensor-parallel-size 2

Disaggregated Mode

# Prefill instance
vllm serve --prefill-only --ec-transfer-config ' {...} '

# Decode instance
vllm serve --decode-only --ec-transfer-config ' {...} '

Quick Reference

CommandPurpose
vllm serve <model>Start API server
vllm launch <component>Launch specific component
opentelemetry-instrument vllm serveEnable tracing

See Also

Sources: [README.md](https://github.com/vllm-project/vllm/blob/main/README.md)

Getting Started

Related topics: vLLM Overview

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Quick Installation

Continue reading this section for the full explanation and source context.

Section Building from Source

Continue reading this section for the full explanation and source context.

Section GPU Requirements

Continue reading this section for the full explanation and source context.

Related topics: vLLM Overview

Getting Started

vLLM is a fast and easy-to-use library for Large Language Model (LLM) inference and serving. It provides both an offline inference interface via the LLM class and an online serving layer with an OpenAI-compatible API server. Sources: README.md:1-30

This guide covers the essential steps to get started with vLLM, from installation through basic inference and serving.

Installation

vLLM can be installed via pip or built from source. For detailed installation instructions, refer to the official documentation.

Quick Installation

pip install vllm

Building from Source

For custom builds or development:

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

GPU Requirements

vLLM requires CUDA-compatible GPUs. The library supports various CUDA versions. Verify your environment has the necessary GPU drivers and CUDA toolkit installed. Sources: README.md:1-15

Core Concepts

Before diving into usage, understand these fundamental concepts:

ConceptDescription
LLM ClassPrimary Python interface for offline inference
Engine ArgsConfiguration parameters for the inference engine
Sampling ParamsControls generation behavior (temperature, max_tokens, etc.)
OpenAI API ServerHTTP server providing OpenAI-compatible REST endpoints

Architecture Overview

graph TD
    A[User Code] --> B[LLM Class / API Server]
    B --> C[AsyncLLMEngine]
    C --> D[Worker Pool]
    D --> E[GPU Devices]
    E --> F[PagedAttention KV Cache]
    
    G[HTTP Clients] --> H[OpenAI API Server]
    H --> B

Offline Inference

Offline inference involves running model inference directly in Python without a separate server. This is ideal for batch processing, testing, or embedding vLLM into applications. Sources: examples/basic/offline_inference/README.md:1-25

Basic Usage

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM("Qwen/Qwen2.5-3B-Instruct")

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=256
)

# Run inference
outputs = llm.generate(["Hello, how are you?", "What is vLLM?"], sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Supported Models

vLLM supports a wide range of models including autoregressive transformers, mixture-of-experts models, and quantized models. For the complete list, see the supported models documentation. Sources: README.md:20-25

Serving with the CLI

vLLM provides a command-line interface for serving models via an OpenAI-compatible API. Sources: vllm/entrypoints/cli/serve.py:1-30

Starting the Server

vllm serve Qwen/Qwen3-0.6B

Command-Line Options

The serve command supports extensive configuration through CLI arguments:

vllm serve <model> [options]

Use --help=all to show all available flags, or --help=<ConfigGroup> to explore options by section (e.g., --help=ModelConfig, --help=Frontend). Sources: vllm/entrypoints/cli/serve.py:5-20

Key Server Options

OptionDescriptionDefault
--modelModel name or pathRequired
--gpu-memory-utilizationFraction of GPU memory to use0.9
--max-model-lenMaximum sequence lengthModel default
--tensor-parallel-sizeNumber of GPUs for parallelism1
--portServer port8000

Headless Mode

For distributed setups where API servers are managed externally:

vllm serve <model> --headless

In headless mode, no API servers are started, and --api-server-count cannot be used. Sources: vllm/entrypoints/cli/serve.py:30-45

API Usage

Once the server is running, you can interact with it using the OpenAI-compatible API.

Completions API

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-3B-Instruct",
    "prompt": "The capital of France is",
    "max_tokens": 50
  }'

Chat API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-3B-Instruct",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

Offline Inference Examples

The repository includes practical examples demonstrating various vLLM capabilities. Sources: examples/basic/offline_inference/README.md:25-60

Running Examples

# Basic example
python examples/basic/offline_inference/basic.py

# Chat example with sampling parameters
python examples/basic/offline_inference/chat.py --max_tokens 100 --temperature 0.8

# Generate example
python examples/basic/offline_inference/generate.py --generation-config auto

Generation Config

The --generation-config argument specifies where the generation config loads from:

  • 'auto' - Load from model path
  • <folder_path> - Load from specified directory
  • Not provided - Use vLLM defaults
python examples/basic/offline_inference/generate.py --generation-config auto
Note: If max_new_tokens is specified in generation config, it sets a server-wide limit on output tokens for all requests. Sources: examples/basic/offline_inference/README.md:55-70

Advanced Features

Structured Outputs

vLLM supports constrained decoding for structured outputs including JSON schemas, regex patterns, and grammar-based constraints. Sources: examples/features/structured_outputs/README.md:1-40

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --reasoning-parser deepseek_r1
# Run structured outputs example
uv run structured_outputs_offline.py --constraint json_mode regex

Long Text Embedding

For embedding models, vLLM supports chunked processing to handle texts exceeding the model's maximum context length:

MODEL_NAME="jinaai/jina-embeddings-v3" \
MAX_EMBED_LEN=1048576 \
./service.sh

Configuration example with chunked processing:

{
  "pooling_type": "auto",
  "use_activation": true,
  "enable_chunked_processing": true,
  "max_embed_len": 3072000
}

GGUF Quantized Models

vLLM supports GGUF-quantized models loaded directly from HuggingFace:

--model unsloth/Qwen3-0.6B-GGUF:Q4_K_M --tokenizer Qwen/Qwen3-0.6B

CPU Offload

For systems with limited GPU memory, CPU offload allows loading larger models:

--cpu-offload-gb 10

This creates a virtual 34GB GPU when you have a 24GB GPU, enabling 13B model loading with BF16 weights. Sources: examples/basic/offline_inference/README.md:75-85

Configuration Workflow

graph LR
    A[Define Engine Args] --> B[Create Model Config]
    B --> C[Initialize Engine]
    C --> D[Process Requests]
    D --> E[Return Outputs]
    
    F[CLI Arguments] --> A
    G[Python API] --> A

Programmatic Configuration

from vllm import LLM, EngineArgs

engine_args = EngineArgs(
    model="Qwen/Qwen2.5-7B-Instruct",
    gpu_memory_utilization=0.85,
    tensor_parallel_size=2,
    max_model_len=4096
)

llm = LLM(**engine_args)

Next Steps

ResourceDescription
DocumentationComprehensive guides and API reference
Supported ModelsComplete list of supported architectures
ExamplesUsage examples for various features
PaperTechnical details behind vLLM's design

Troubleshooting

Common Issues

  1. CUDA Out of Memory: Reduce gpu_memory_utilization or use smaller batch sizes
  2. Model Not Found: Ensure HuggingFace credentials are configured for gated models
  3. Import Errors: Verify all dependencies are installed with pip install vllm

Getting Help

Sources: README.md:60-75

Sources: [README.md:60-75]()

Core Engine Architecture

Related topics: vLLM Overview, Model Executor and Worker Architecture, Scheduling and Request Processing

Section Related Pages

Continue reading this section for the full explanation and source context.

Section V0 Engine (Legacy)

Continue reading this section for the full explanation and source context.

Section V1 Engine (Current)

Continue reading this section for the full explanation and source context.

Section CLI Entry Point

Continue reading this section for the full explanation and source context.

Related topics: vLLM Overview, Model Executor and Worker Architecture, Scheduling and Request Processing

Core Engine Architecture

Overview

The vLLM Core Engine Architecture is the central orchestration layer responsible for managing LLM inference workflows, request scheduling, and model execution. vLLM supports two engine versions: the legacy V0 engine and the current V1 engine (introduced in v0.6.0), both designed to provide high-throughput LLM serving through efficient request batching and GPU memory management.

The engine architecture serves as the foundation for both offline inference via the LLM class and online serving via the OpenAI-compatible API server. Sources: vllm/entrypoints/llm.py:1-50

Architecture Components

V0 Engine (Legacy)

The V0 engine is the original implementation found in vllm/engine/. It consists of:

ComponentFilePurpose
LLMEnginevllm/engine/llm_engine.pySynchronous inference engine with blocking operations
AsyncLLMEnginevllm/engine/async_llm_engine.pyAsync wrapper enabling concurrent request handling

The V0 engine uses an event-loop-based async architecture where AsyncLLMEngine wraps LLMEngine to provide non-blocking request processing.

V1 Engine (Current)

The V1 engine (vllm/v1/engine/) is the current production-ready implementation featuring a modular design:

ComponentFilePurpose
Corevllm/v1/engine/core.pyLow-level engine core managing model execution
AsyncLLMvllm/v1/engine/async_llm.pyMain async interface for inference
LLMEnginevllm/v1/engine/llm_engine.pyHigh-level engine orchestrator

The V1 engine architecture separates concerns into distinct layers: AsyncLLM provides the public async interface, LLMEngine handles request orchestration, and Core manages low-level GPU operations.

Entry Points

vLLM provides multiple entry points for interacting with the engine:

graph TD
    A[vllm serve] --> B[ServeSubcommand]
    A --> C[LaunchSubcommand]
    B --> D[serve_grpc / API Server]
    C --> E[run_launch_fastapi]
    F[vllm run-batch] --> G[BatchRunner]
    H[LLM Class] --> I[AsyncLLM Engine]

CLI Entry Point

The CLI entry point in vllm/entrypoints/cli/main.py provides command-line access to vLLM functionality:

# Simplified CLI structure
parser = FlexibleArgumentParser(description="vLLM CLI")
subparsers = parser.add_subparsers(required=False, dest="subparser")
cmds = {}
for cmd_module in CMD_MODULES:
    new_cmds = cmd_module.cmd_init()
    for cmd in new_cmds:
        cmd.subparser_init(subparsers).set_defaults(dispatch_function=cmd.cmd)

Sources: vllm/entrypoints/cli/main.py:1-40

Serve Subcommand

The serve subcommand initializes the HTTP API server or gRPC service:

class ServeSubcommand(CLISubcommand):
    name = "serve"
    
    @staticmethod
    def cmd(args: argparse.Namespace) -> None:
        if hasattr(args, "model_tag") and args.model_tag is not None:
            args.model = args.model_tag

Sources: vllm/entrypoints/cli/serve.py:1-30

Launch Subcommand

The launch subcommand provides component-level launching capabilities:

def cmd_init() -> list[CLISubcommand]:
    return [LaunchSubcommand()]

async def run_launch_fastapi(args: argparse.Namespace) -> None:
    listen_address, sock = setup_server(args)
    engine_args = AsyncEngineArgs.from_cli_args(args)

Sources: vllm/entrypoints/cli/launch.py:1-60

Python API Entry Point

The LLM class in vllm/entrypoints/llm.py provides the primary Python interface for offline inference:

class LLM:
    """
    An LLM for offline inference.
    """
    
    def __init__(self, model: str, ...):
        ...

Sources: vllm/entrypoints/llm.py:1-100

Engine Initialization Flow

The following diagram illustrates the initialization flow from CLI to engine:

sequenceDiagram
    participant CLI as vllm serve
    participant Parser as ArgumentParser
    participant EngineArgs as AsyncEngineArgs
    participant Engine as AsyncLLM / Core
    participant Model as ModelConfig
    
    CLI->>Parser: Parse CLI arguments
    Parser->>EngineArgs: from_cli_args()
    EngineArgs->>Model: create_model_config()
    Model-->>EngineArgs: Config validated
    EngineArgs->>Engine: Initialize engine
    Engine->>Engine: Load model weights

Core Engine Components

AsyncLLM (V1)

The AsyncLLM class is the primary async interface for V1 engine:

class AsyncLLM:
    """
    Async implementation of LLM engine.
    """
    
    async def add_request(self, request_id: str, prompt: str, ...):
        ...
    
    async def step(self) -> List[RequestOutput]:
        ...

Sources: vllm/v1/engine/async_llm.py:1-50

Core (V1)

The Core class manages low-level model execution:

class Core:
    """
    Core engine for V1.
    """
    
    def __init__(self, engine_config: VllmConfig, ...):
        ...
    
    def get_config(self) -> VllmConfig:
        ...

Sources: vllm/v1/engine/core.py:1-80

LLMEngine (V1)

The V1 LLMEngine orchestrates request processing:

class LLMEngine:
    """
    V1 LLM Engine implementation.
    """
    
    def __init__(self, vllm_config: VllmConfig, ...):
        ...

Sources: vllm/v1/engine/llm_engine.py:1-50

Configuration System

AsyncEngineArgs

Configuration flows from CLI/API to engine via AsyncEngineArgs:

ParameterTypeDescription
modelstrModel name or path
tensor_parallel_sizeintNumber of GPUs for tensor parallelism
gpu_memory_utilizationfloatFraction of GPU memory to use
max_model_lenOptional[int]Maximum sequence length
dtypestrModel data type (float16, bfloat16, etc.)
quantizationOptional[str]Quantization method (awq, gptq, etc.)

The engine validates configuration through create_model_config():

def create_model_config(self) -> ModelConfig:
    """Create model config from engine args."""
    return ModelConfig(...)

Sources: vllm/entrypoints/cli/launch.py:40-50

Headless Mode

The V1 engine supports headless operation where no API servers are started:

if args.headless:
    if args.api_server_count is not None and args.api_server_count > 0:
        raise ValueError(
            f"--api-server-count={args.api_server_count} cannot be "
            "used with --headless (no API servers are started in "
            "headless mode)."
        )
    args.api_server_count = 0

Sources: vllm/entrypoints/cli/serve.py:25-35

gRPC Support

vLLM supports gRPC for disaggregated prefill scenarios:

if getattr(args, "grpc", False):
    from vllm.entrypoints.grpc_server import serve_grpc
    uvloop.run(serve_grpc(args))
    return

Sources: vllm/entrypoints/cli/serve.py:18-22

Request Processing Pipeline

graph LR
    A[Request] --> B[Parser]
    B --> C[AsyncLLM.add_request]
    C --> D[Scheduler]
    D --> E[Model Executor]
    E --> F[Output]
    D --> G[KV Cache]

Data Parallel Modes

The engine supports multiple data parallel configurations:

ModeFlagDescription
External LB--data-parallel-external-lbExternal load balancer coordinates workers
Hybrid LB--data-parallel-hybrid-lbHybrid approach with custom rank assignment
Rank--data-parallel-rankSpecify worker rank for distributed setup
# Detection logic
is_external_lb = getattr(args, "data_parallel_external_lb", False)
is_hybrid_lb = getattr(args, "data_parallel_hybrid_lb", False)

Sources: vllm/entrypoints/cli/serve.py:35-40

Model Config Validation

The engine performs validation during model config creation:

model_config = engine_args.create_model_config()

# Clear quantization for render servers (preprocessing only)
if render_mode:
    model_config.quantization = None

Sources: vllm/entrypoints/cli/launch.py:45-55

Key Design Patterns

Async/Await Architecture

The V1 engine uses native async/await for concurrent request handling:

async def step(self) -> List[RequestOutput]:
    """Execute one iteration of the engine."""
    ...

Command Pattern

CLI commands follow the Command pattern with CLISubcommand base class:

class ServeSubcommand(CLISubcommand):
    name = "serve"
    
    @staticmethod
    def cmd(args: argparse.Namespace) -> None:
        ...

Subparser Registration

Commands register themselves via cmd_init() factory functions:

def cmd_init() -> list[CLISubcommand]:
    return [LaunchSubcommand(), ServeSubcommand()]

Summary

The vLLM Core Engine Architecture provides a flexible, multi-layered design supporting both V0 (legacy) and V1 (current) engine implementations. Key characteristics include:

  1. Dual Engine Support: V0 for backward compatibility, V1 for production workloads
  2. Multiple Entry Points: CLI, Python API, HTTP server, gRPC
  3. Async-First Design: Native async/await for concurrent request processing
  4. Modular Components: Clear separation between CLI, engine core, and model execution
  5. Flexible Configuration: Comprehensive argument system via AsyncEngineArgs
  6. Data Parallel Support: Multiple modes for distributed serving scenarios

The architecture prioritizes performance through efficient GPU memory management, request batching, and pipelined execution while maintaining a clean, extensible design.

Sources: [vllm/entrypoints/cli/main.py:1-40]()

Model Executor and Worker Architecture

Related topics: Core Engine Architecture, Scheduling and Request Processing, Model Architecture Support

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Purpose and Scope

Continue reading this section for the full explanation and source context.

Section Model Loading

Continue reading this section for the full explanation and source context.

Section Model Registry

Continue reading this section for the full explanation and source context.

Related topics: Core Engine Architecture, Scheduling and Request Processing, Model Architecture Support

Model Executor and Worker Architecture

Overview

The vLLM Model Executor and Worker Architecture forms the core execution layer responsible for model loading, inference, and batch-level processing. This architecture separates concerns between high-level request orchestration and low-level GPU-based model execution, enabling efficient parallel processing of LLM inference requests.

The architecture consists of two primary components:

ComponentResponsibility
Model ExecutorManages model loading, weight initialization, and model-specific execution logic
WorkerHandles GPU-side computation, memory management, and kernel execution

Architecture Diagram

graph TD
    A[AsyncEngineArgs] --> B[Model Executor]
    B --> C[Model Loader]
    C --> D[Model Weights]
    B --> E[GPU Worker]
    E --> F[GPU Model Runner]
    F --> G[CUDA Kernels]
    F --> H[Attention Layers]
    F --> I[MLP Layers]
    
    J[Request Batch] --> E
    E --> K[Generated Tokens]
    
    style B fill:#e1f5fe
    style E fill:#fff3e0
    style F fill:#f3e5f5

Model Executor Layer

Purpose and Scope

The Model Executor layer handles all aspects of model lifecycle management, including initialization, weight loading, and providing the interface for model execution. This layer operates independently of the worker layer, allowing for flexibility in model configuration.

Model Loading

The model loading subsystem supports multiple backends and quantization schemes:

class ModelLoader:
    def load_model(self, model_config, parallel_config, device_config):
        # Load model weights based on configuration
        pass

Supported Loading Modes:

ModeDescription
AutoAutomatically detect optimal loading strategy
NaiveStandard PyTorch loading
ShardedLoad model shards across multiple devices
QuantizedLoad quantized weights (AWQ, GPTQ, GGUF)

Sources: vllm/model_executor/model_loader/__init__.py

Model Registry

vLLM maintains a registry of supported model architectures:

# Model architecture registration
@register_model("LlamaForCausalLM")
class LlamaForCausalLM(nn.Module):
    ...

The registry maps model names to their corresponding model classes, enabling automatic model instantiation based on HuggingFace model architecture detection.

Sources: vllm/model_executor/models/__init__.py

Worker Architecture

Base Worker

The worker base class defines the interface for all worker implementations:

class WorkerBase:
    def __init__(self, vllm_config):
        self.vllm_config = vllm_config
        self.model = None
        self.device = None
    
    def execute_model(self, batch):
        raise NotImplementedError

Sources: vllm/v1/worker/worker_base.py

GPU Worker

The GPU Worker is the primary worker implementation for GPU-based inference:

graph LR
    A[Input Batch] --> B[Model Input Preparation]
    B --> C[Forward Pass]
    C --> D[Output Extraction]
    D --> E[Token Generation]

Key Responsibilities:

  1. Initialize CUDA context and memory pools
  2. Prepare model inputs with proper padding and masking
  3. Execute forward passes on GPU
  4. Manage KV cache memory allocation

Sources: vllm/v1/worker/gpu_worker.py

GPU Model Runner

The GPU Model Runner handles the low-level model execution details:

class GPUModelRunner:
    def __init__(self, config):
        self.kv_cache = None
        self.attn_metadata = None
        self.block_manager = None
    
    def prepare_inputs(self, batch):
        # Prepare input tensors with proper device placement
        pass
    
    def execute_model(self, input_tokens, positions):
        # Execute model forward pass
        pass

Core Components:

ComponentFunction
kv_cacheStores key-value tensors for attention
attn_metadataManages attention metadata for paged attention
block_managerHandles physical memory block allocation

Sources: vllm/v1/worker/gpu_model_runner.py

Execution Flow

sequenceDiagram
    participant API as API Server
    participant Executor as Model Executor
    participant Worker as GPU Worker
    participant Runner as GPU Model Runner
    participant Kernels as CUDA Kernels
    
    API->>Executor: Initialize model
    Executor->>Worker: Load weights
    Worker->>Runner: Initialize GPU state
    Runner->>Kernels: Allocate memory
    
    API->>Executor: Execute batch
    Executor->>Worker: Forward request
    Worker->>Runner: Prepare inputs
    Runner->>Kernels: Compute attention
    Runner->>Kernels: Compute mlp
    Kernels-->>Runner: Output logits
    Runner-->>Worker: Return results
    Worker-->>Executor: Output tokens

Memory Management

KV Cache Architecture

vLLM uses a block-based KV cache management system:

  1. Logical Blocks: Abstract representation of KV cache entries
  2. Physical Blocks: Actual GPU memory allocations
  3. Block Mapping: Links logical blocks to physical locations
class BlockManager:
    def allocate(self, num_blocks):
        # Allocate physical blocks
        pass
    
    def get_physical_block(self, logical_id):
        # Get physical location for logical block
        pass

Memory Allocation Strategy

StrategyUse Case
DynamicDefault, allocates on demand
StaticPre-allocates at initialization
HybridMix of static and dynamic

Configuration Parameters

The Model Executor and Worker architecture is configured through AsyncEngineArgs:

ParameterDescriptionDefault
modelModel name or pathRequired
tensor_parallel_sizeNumber of GPUs for tensor parallelism1
pipeline_parallel_sizeNumber of pipeline stages1
gpu_memory_utilizationFraction of GPU memory for KV cache0.9
max_model_lenMaximum sequence lengthAuto
block_sizeKV cache block size16

Summary

The Model Executor and Worker Architecture in vLLM provides a modular, extensible system for LLM inference:

  • Separation of Concerns: Clear boundaries between model loading and execution
  • GPU Optimization: Efficient CUDA kernel integration and memory management
  • Flexible Configuration: Support for various parallelism and quantization strategies
  • Extensibility: Plugin-based model registration system

This architecture enables vLLM to achieve high throughput through batch processing while maintaining low latency through careful memory management and kernel optimization.

Sources: [vllm/model_executor/model_loader/__init__.py]()

Scheduling and Request Processing

Related topics: Core Engine Architecture, Model Executor and Worker Architecture, Distributed Inference and Parallelism

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Component Hierarchy

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Request Data Structure

Continue reading this section for the full explanation and source context.

Related topics: Core Engine Architecture, Model Executor and Worker Architecture, Distributed Inference and Parallelism

Scheduling and Request Processing

Overview

The Scheduling and Request Processing system is a core component of vLLM's v1 engine architecture. It manages the lifecycle of inference requests from arrival to completion, coordinating resource allocation, batch scheduling, and execution ordering to maximize GPU utilization while maintaining quality of service guarantees.

In vLLM, the scheduler operates as an asynchronous event-driven system that continuously evaluates pending requests, determines optimal batching strategies, and dispatches work to the underlying execution engine. This design enables vLLM to handle high-throughput serving workloads with efficient memory management through mechanisms like paged attention and dynamic batch composition.

The scheduler works in conjunction with the request queue, which serves as the primary buffer for incoming inference requests. It makes real-time decisions about which requests to include in the next execution batch based on available GPU memory, request priorities, and fairness constraints.

Architecture Overview

Component Hierarchy

The scheduling system consists of several interconnected components that work together to manage request processing:

graph TD
    A[API Layer] --> B[Request Queue]
    B --> C[Async Scheduler]
    C --> D[Scheduler]
    D --> E[Execution Engine]
    E --> F[GPU]
    
    G[Config] --> C
    G --> D

Core Components

ComponentFilePurpose
Schedulervllm/v1/core/sched/scheduler.pyCore scheduling logic and batch selection
AsyncSchedulervllm/v1/core/sched/async_scheduler.pyAsync wrapper for scheduler operations
RequestQueuevllm/v1/core/sched/request_queue.pyRequest buffering and ordering
Requestvllm/v1/request.pyRequest data model and state
SchedulerConfigvllm/config/scheduler.pyScheduler configuration parameters

Request Model

Request Data Structure

The Request class represents an individual inference request with all associated metadata and state. Each request maintains its own context including tokenized inputs, sampling parameters, and execution state.

The request model tracks the following key attributes:

AttributeTypeDescription
request_idstrUnique identifier for the request
promptstrOriginal input text or tokens
prompt_token_idsList[int]Tokenized prompt
sampling_paramsSamplingParamsSampling configuration
arrival_timefloatRequest arrival timestamp
stateRequestStateCurrent execution state

Request States

Requests transition through a defined state machine during their lifecycle:

stateDiagram-v2
    [*] --> WAITING: Request Arrival
    WAITING --> SCHEDULED: Scheduler Selection
    SCHEDULED --> RUNNING: Dispatch to GPU
    RUNNING --> WAITING: Preemption/Continuation
    RUNNING --> FINISHED: Completion
    RUNNING --> WAITING: KV Cache Reuse
    WAITING --> CANCELLED: Client Cancellation
    SCHEDULED --> CANCELLED: Client Cancellation
  1. WAITING: Request is queued and awaiting scheduling
  2. SCHEDULED: Request has been selected for the next batch
  3. RUNNING: Request is actively being processed on GPU
  4. FINISHED: Request has completed successfully
  5. CANCELLED: Request was cancelled before completion

Sources: vllm/v1/request.py

Request Queue

The RequestQueue serves as the primary buffering mechanism for incoming requests. It provides thread-safe operations for enqueueing, dequeuing, and managing request priorities.

Queue Operations

OperationDescription
enqueue()Add new request to queue
dequeue()Remove and return next request
peek()View next request without removal
cancel()Remove cancelled request
requeue()Return request to queue (preemption)

Priority Handling

The request queue supports priority-based ordering where higher priority requests can bypass lower priority ones. Priority is determined by a combination of factors:

  • Explicit priority values in SamplingParams
  • Arrival time (older requests may get priority for fairness)
  • Request type (prefill vs decode operations)

Sources: vllm/v1/core/sched/request_queue.py

Scheduler

Core Scheduling Logic

The scheduler is responsible for selecting which requests to include in the next execution batch. It operates on a continuous scheduling loop that evaluates the current state of all pending requests and available resources.

#### Scheduling Criteria

The scheduler makes decisions based on multiple factors:

  1. Memory Availability: Sufficient GPU memory must be available for the request's KV cache
  2. Batch Size Limits: Maximum batch size constraints
  3. Prefill-Decompose Decisions: Whether to split prefill into smaller chunks
  4. Priority Weighting: Relative importance of pending requests
  5. Latency Targets: QoS requirements for specific request categories

#### Scheduling Loop

graph LR
    A[Evaluate Pending Requests] --> B{Sufficient Resources?}
    B -->|Yes| C[Select Request]
    C --> D[Add to Batch]
    D --> E{Batch Full?}
    E -->|No| A
    E -->|Yes| F[Dispatch Batch]
    B -->|No| G[Wait / Preempt]
    F --> H[Update State]
    H --> A

Async Scheduler Interface

The AsyncScheduler provides an asynchronous interface to the core scheduler, enabling non-blocking scheduling operations. This is essential for maintaining high throughput in production serving scenarios where the scheduler must coexist with network I/O and other async operations.

Key async operations include:

  • schedule_async(): Async wrapper for schedule iteration
  • add_request(): Async request enqueueing
  • abort_request(): Async request cancellation

Sources: vllm/v1/core/sched/async_scheduler.py

Scheduler Configuration

The SchedulerConfig class defines all configurable parameters for the scheduler behavior. These settings control batching strategies, memory management, and QoS characteristics.

Configuration Parameters

ParameterTypeDefaultDescription
max_num_seqsint256Maximum sequences per iteration
max_num_batched_tokensint8192Max tokens per batch
max_model_lenint8192Maximum model context length
enable_chunked_prefillboolTrueEnable prefill chunking
num_prefill_groupsint1Number of prefill groups
async_schedulingboolTrueEnable async scheduling

Memory Management Settings

ParameterDescription
gpu_memory_utilizationFraction of GPU memory for KV cache (0.0-1.0)
num_causal_layersNumber of causal attention layers
head_dimDimension of attention heads

Sources: vllm/config/scheduler.py

Batching Strategies

Continuous Batching

vLLM employs continuous batching (also known as iteration-level scheduling) to maximize GPU utilization. Unlike static batching, where batch composition is fixed at the start, continuous batching allows requests to enter and exit the batch at each iteration.

#### Advantages

  • Higher Throughput: GPU is never idle waiting for batch to complete
  • Lower Latency: Short requests don't wait for long ones
  • Better Memory Utilization: Dynamic allocation based on actual needs

Prefill Batching

Prefill operations (processing new tokens) and decode operations (generating new tokens) have different characteristics. The scheduler can:

  1. Combined Batching: Mix prefill and decode in same batch
  2. Separate Batching: Process prefill and decode in distinct batches
  3. Chunked Prefill: Split large prefill requests into smaller chunks

Chunked Prefill

When enable_chunked_prefill is enabled, large prefill requests are split into smaller chunks to:

  • Reduce memory pressure
  • Allow shorter requests to be scheduled faster
  • Improve fairness between requests of different lengths

Sources: vllm/v1/core/sched/scheduler.py

Request Processing Flow

Full Request Lifecycle

sequenceDiagram
    participant Client
    participant API
    participant Queue
    participant Scheduler
    participant Engine
    participant GPU
    
    Client->>API: Submit Request
    API->>API: Validate & Tokenize
    API->>Queue: Enqueue Request
    Scheduler->>Queue: Dequeue Requests
    Scheduler->>Scheduler: Evaluate Batching
    Scheduler->>Engine: Dispatch Batch
    Engine->>GPU: Execute Forward Pass
    GPU-->>Engine: Output Tensors
    Engine-->>Scheduler: Update State
    Scheduler->>Queue: Requeue if Needed
    Scheduler->>API: Stream Tokens
    API-->>Client: Response Stream

Scheduling Iteration

Each scheduling iteration follows these steps:

  1. Request Evaluation: Scan all pending requests for eligibility
  2. Resource Calculation: Determine available GPU memory
  3. Batch Composition: Select requests based on scheduling policy
  4. Chunk Assignment: Divide prefill requests if needed
  5. Batch Dispatch: Send batch to execution engine
  6. State Update: Update request states and metrics

Configuration Example

from vllm.config import SchedulerConfig

config = SchedulerConfig(
    max_num_seqs=256,
    max_num_batched_tokens=8192,
    max_model_len=32768,
    enable_chunked_prefill=True,
    gpu_memory_utilization=0.9,
)

Performance Considerations

Memory Management

The scheduler must balance memory allocation between:

  • KV Cache: Storing attention key-value pairs
  • Model Weights: The LLM parameters (typically pre-loaded)
  • Activation Memory: Temporary tensors during computation

Latency vs Throughput

The scheduling configuration affects the latency-throughput tradeoff:

SettingEffect
Smaller max_num_seqsLower latency, lower throughput
Larger max_num_batched_tokensHigher throughput, variable latency
enable_chunked_prefill=TrueBetter latency fairness

Preemption

When GPU memory is insufficient for incoming requests, the scheduler may preempt existing requests to free memory. Preempted requests are returned to the queue and rescheduled later.

Integration with Execution Engine

The scheduler interfaces with the execution engine through a well-defined API:

  • schedule(): Main entry point for scheduling decisions
  • add_request(): Register new inference request
  • abort_request(): Cancel running request
  • update_from_output(): Process execution results

The execution engine receives scheduled batches and returns completed or paused requests, allowing the scheduler to update its internal state and make subsequent scheduling decisions.

For a complete understanding of vLLM's request processing, also refer to:

  • Engine: Coordinates between scheduler and model execution
  • Worker: Executes model operations on GPU
  • Cache: KV cache management and allocation
  • 采样参数: Sampling configuration affecting scheduling

Summary

The Scheduling and Request Processing system is fundamental to vLLM's ability to serve large language models efficiently. By employing continuous batching, intelligent memory management, and flexible scheduling policies, vLLM achieves high throughput while maintaining low latency for diverse workloads. The modular design with separate scheduler, queue, and configuration components allows for fine-tuned control over serving behavior.

Sources: [vllm/v1/request.py]()

PagedAttention and KV Cache Management

Related topics: Scheduling and Request Processing, Distributed Inference and Parallelism, Attention Backends and Kernels

Section Related Pages

Continue reading this section for the full explanation and source context.

Section High-Level System Design

Continue reading this section for the full explanation and source context.

Section PagedAttention Version Comparison

Continue reading this section for the full explanation and source context.

Section Responsibilities

Continue reading this section for the full explanation and source context.

Related topics: Scheduling and Request Processing, Distributed Inference and Parallelism, Attention Backends and Kernels

PagedAttention and KV Cache Management

Overview

PagedAttention is a novel attention mechanism that enables efficient virtual memory-based management of the Key-Value (KV) cache in large language model (LLM) inference. Inspired by the memory management technique in operating systems called paging, PagedAttention divides the KV cache into fixed-size "pages" that can be flexibly allocated and managed, eliminating the need for contiguous memory allocation.

The KV cache stores the key and value tensors from attention computation for each token position. During autoregressive decoding, this cache grows dynamically as new tokens are generated. Traditional LLM serving systems allocate contiguous memory blocks for the KV cache, leading to significant memory waste due to internal and external fragmentation when handling variable-length sequences and multi-user workloads.

vLLM's PagedAttention implementation provides:

  • Memory efficiency: Eliminates fragmentation by using non-contiguous page-based allocation
  • Flexible batching: Supports arbitrary sequence lengths and concurrent requests
  • Dynamic memory management: Allocates cache pages on-demand during generation
  • GPU memory optimization: Maximizes GPU memory utilization for higher throughput

Architecture

High-Level System Design

graph TD
    subgraph "Application Layer"
        Req[Inference Request]
        Sched[Scheduler]
    end
    
    subgraph "Memory Management Layer"
        KVM[KV Cache Manager]
        BP[Block Pool]
    end
    
    subgraph "GPU Memory Layer"
        GPU[GPU Memory]
        Pages[KV Cache Pages]
    end
    
    Req --> Sched
    Sched --> KVM
    KVM --> BP
    BP --> GPU
    Pages -.>|Physical Memory| GPU

PagedAttention Version Comparison

vLLM implements two versions of PagedAttention with different performance characteristics:

AspectPagedAttention V1PagedAttention V2
Kernel TypeFused attention kernelsOptimized fused kernels
Memory AccessStandard page table lookupEnhanced page table optimization
PerformanceBaseline optimized~2.2x speedup over V1
Use CaseGeneral purposeProduction workloads
Implementationpaged_attention_v1.cupaged_attention_v2.cu

Sources: csrc/attention/paged_attention_v1.cu:1-100, csrc/attention/paged_attention_v2.cu:1-100, docs/design/paged_attention.md:1-50

KV Cache Manager

The KV Cache Manager (kv_cache_manager.py) is the core component responsible for tracking and managing the allocation of KV cache pages across all in-flight sequences.

Responsibilities

ResponsibilityDescription
Block AllocationAllocates and deallocates cache blocks as sequences grow or complete
Reference CountingTracks how many sequences reference each physical block
Page Table ManagementMaintains virtual-to-physical page mappings per sequence
Cache EvictionHandles cache eviction when memory pressure occurs

Key Data Structures

# Simplified representation of block metadata
class Block:
    block_id: int          # Physical block identifier
    page_indices: List[int]  # Virtual page indices mapping
    ref_count: int         # Number of sequences referencing this block
    is_computed: bool      # Whether this block has computed KV cache

Block Allocation Workflow

sequenceDiagram
    participant Scheduler
    participant KVM as KV Cache Manager
    participant BP as Block Pool
    participant GPU as GPU Memory
    
    Scheduler->>KVM: allocate_blocks(sequence_id, num_tokens)
    KVM->>BP: reserve_blocks(count)
    BP->>GPU: allocate_contiguous_pages
    GPU-->>BP: block_handles
    BP-->>KVM: allocated_blocks
    KVM-->>Scheduler: block_table

Sources: vllm/v1/core/kv_cache_manager.py:1-150, vllm/v1/core/block_pool.py:1-100

Block Pool

The Block Pool (block_pool.py) manages the physical memory allocation of KV cache pages on GPU memory.

Memory Organization

ParameterDescriptionTypical Value
block_sizeNumber of tokens per cache block16 tokens
num_blocksTotal number of available blocksDynamic based on GPU memory
num_layersNumber of attention layersModel-dependent (e.g., 32-80)
num_kv_headsNumber of key/value attention headsModel-dependent
head_dimDimension of each attention head128 or 256

Block Pool Operations

graph LR
    subgraph "Allocation States"
        Free[Free Blocks]
        Used[Used Blocks]
        Partial[Partially Filled]
    end
    
    Allocate -->|Allocate Block| Free
    Allocate -->|Release Block| Used
    Used -->|Free| Free
    Partial -->|Append Token| Used

The block pool maintains three primary states:

  1. Free Blocks: Available for immediate allocation
  2. Used Blocks: Currently assigned to active sequences
  3. Partially Filled Blocks: Blocks with available slots for new tokens

Sources: vllm/v1/core/block_pool.py:1-100

PagedAttention Kernel Implementation

CUDA Kernel Architecture

Both PagedAttention V1 and V2 are implemented as optimized CUDA kernels that handle attention computation with page-table-based memory access.

#### V1 Kernel Flow

graph TD
    A[Load Query Tokens] --> B[Get Block Table Entry]
    B --> C[Load K/V from Pages]
    C --> D[Compute Attention Scores]
    D --> E[Write Output to GEMM]

#### V2 Kernel Optimizations

PagedAttention V2 introduces several optimizations:

  • Reduced Page Table Lookups: Consolidates multiple lookups into single operations
  • Improved Memory Coalescing: Better memory access patterns for KV cache
  • Warp-Level Primitives: Utilizes warp-level reductions for faster softmax computation
  • Fused Operations: Combines multiple operations into single kernel launches

Sources: csrc/attention/paged_attention_v1.cu:1-200, csrc/attention/paged_attention_v2.cu:1-200

Kernel Parameters

ParameterDescriptionRange
THREADS_PER_BLOCKCUDA threads per block128-256
BLOCK_SIZETokens per block16
CACHE_BLOCK_SIZEBytes per cache blockDynamic
num_headsTotal query headsModel-dependent
head_dimAttention head dimension128/256

Page Table Management

Virtual-to-Physical Mapping

Each sequence maintains a page table that maps virtual token positions to physical cache blocks:

# Virtual page table structure
page_table = [
    physical_block_id,  # virtual position 0-15
    physical_block_id,  # virtual position 16-31
    physical_block_id,  # virtual position 32-47
    ...
]

Block Table Structure

graph TD
    subgraph "Sequence 1"
        S1_VP1[Virtual Page 0] --> S1_PP1[Physical Block 5]
        S1_VP2[Virtual Page 1] --> S1_PP2[Physical Block 12]
        S1_VP3[Virtual Page 2] --> S1_PP3[Physical Block 3]
    end
    
    subgraph "Sequence 2"
        S2_VP1[Virtual Page 0] --> S2_PP1[Physical Block 5]
        S2_VP2[Virtual Page 1] --> S2_PP2[Physical Block 7]
    end

The page table allows:

  • Non-contiguous storage: Physical blocks need not be contiguous
  • Shared blocks: Multiple sequences can share the same physical block (for prefixes)
  • Dynamic growth: New pages allocated as sequence extends

Sources: vllm/v1/core/kv_cache_manager.py:100-200, docs/design/paged_attention.md:50-150

Memory Management Strategies

Allocation Policy

vLLM uses a dynamic allocation policy that balances memory efficiency and allocation overhead:

StrategyDescriptionTrade-off
On-demandAllocate blocks as tokens are generatedLower memory waste, higher overhead
Pre-allocationReserve blocks when sequence startsLower overhead, potential waste
HybridPre-allocate prefix, on-demand for new tokensBalanced approach

Reference Counting

Reference counting prevents premature block deallocation:

  1. Initial allocation: Reference count = 1
  2. Fork/continue: Reference count incremented
  3. Completion: Reference count decremented
  4. Free block: When count reaches 0

Memory Reclamation

When GPU memory is exhausted:

graph TD
    A[Memory Pressure Detected] --> B{Sequence Can Evict?}
    B -->|Yes| C[Evict Cold Blocks]
    B -->|No| D[Wait or Reject Request]
    C --> E[Update Page Tables]
    E --> F[Allocate New Blocks]

Sources: vllm/v1/core/kv_cache_manager.py:200-300, vllm/v1/core/block_pool.py:100-200

Integration with Scheduler

Request Lifecycle

stateDiagram-v2
    [*] --> Received: New Request
    Received --> Scheduled: Add to Queue
    Scheduled --> Allocating: Acquire Blocks
    Allocating --> Running: Blocks Available
    Running --> Running: Generate Token
    Running --> Completed: EOS Token
    Completed --> [*]: Free Blocks
    Running --> Waiting: Block Unavailable
    Waiting --> Running: Blocks Acquired

Scheduling Integration Points

StageKV Cache Interaction
Request ArrivalPre-allocate blocks for known prefix length
Token GenerationAllocate new block when current fills up
Sequence CompletionRelease all associated blocks
Prefix CachingShare blocks with identical prefixes

Configuration Options

Server Configuration

vllm serve <model> \
    --block-size 16 \
    --num-gpu-blocks-override 1000 \
    --gpu-memory-utilization 0.9 \
    --max-num-batched-tokens 8192

Key Parameters

ParameterDescriptionDefault
block_sizeNumber of tokens per KV cache block16
gpu_memory_utilizationFraction of GPU memory for KV cache0.9
num_gpu_blocks_overrideOverride auto-computed block countAuto
max_num_batched_tokensMaximum tokens in a single batchDynamic

Memory Calculation

total_cache_memory = num_blocks × block_size × num_layers × 2 × num_kv_heads × head_dim × dtype_size

Where:

  • num_blocks = GPU memory × utilization / per_block_memory
  • 2 accounts for both K and V caches
  • dtype_size = 2 bytes for FP16, 1 byte for INT8, etc.

Sources: docs/design/paged_attention.md:150-250

Performance Characteristics

Throughput Improvements

PagedAttention enables significant performance improvements compared to traditional contiguous allocation:

MetricTraditionalPagedAttention V1PagedAttention V2
Memory Waste30-60%<5%<5%
ThroughputBaseline~2.1x~2.2x
BS=1 LatencyBaseline~1.9x~2.0x
BS=16+ LatencyBaseline~2.0x~2.2x

Scalability

The system scales efficiently with:

  • Longer sequences: O(seq_len) memory, no fragmentation
  • More concurrent requests: Block sharing for shared prefixes
  • Larger models: Better memory utilization per GPU

Sources: csrc/attention/paged_attention_v2.cu:200-300, docs/design/paged_attention.md:250-350

Implementation Details

Block Metadata Tracking

# Core block metadata structure
@dataclass
class Block:
    block_id: int
    device: Device
    block_size: int
    num_tokens: int = 0
    
    # Physical location
    physical_block_id: Optional[int] = None
    
    # Content tracking
    content_hash: Optional[int] = None
    computed: bool = False
    
    # Reference counting for shared blocks
    ref_count: int = 0

Attention Computation Flow

graph TD
    subgraph "Pre-computation"
        Q[Query Tensors] --> QS[Query Split]
        K[Key Tensors] --> KS[Key Split]
        V[Value Tensors] --> VS[Value Split]
    end
    
    subgraph "Paged Attention"
        QS --> AL[Attention Layer]
        KS --> AL
        VS --> AL
        AL --> PT[Page Table Lookup]
    end
    
    subgraph "Output"
        PT --> OUT[Attention Output]
        OUT --> SOFTMAX[Softmax]
        SOFTMAX --> GEMM[Final GEMM]
    end

Best Practices

Memory Configuration

  1. Set appropriate GPU memory utilization based on other memory needs (model weights, activations)
  2. Adjust block size for typical sequence lengths (larger blocks = less overhead for long sequences)
  3. Monitor fragmentation using vLLM metrics

Request Optimization

  1. Use consistent prefixes to enable block sharing across requests
  2. Batch similar requests to maximize cache hit rates
  3. Set appropriate max sequence length to avoid excessive block allocation

Debugging

# Enable verbose logging
export VLL_LOG_LEVEL=DEBUG

# Check memory status
curl http://localhost:8000/memory_stats
ComponentFileRole
Attention Backendvllm/attention/backends/Pluggable attention implementations
Schedulervllm/v1/core/sched.pyCoordinates cache allocation with scheduling
Model Runnervllm/v1/core/model_runner.pyExecutes attention with managed KV cache
Workervllm/v1/worker/worker.pyGPU-side cache management

References

Sources: [csrc/attention/paged_attention_v1.cu:1-100](), [csrc/attention/paged_attention_v2.cu:1-100](), [docs/design/paged_attention.md:1-50]()

Attention Backends and Kernels

Related topics: PagedAttention and KV Cache Management, Quantization Support

Section Related Pages

Continue reading this section for the full explanation and source context.

Section High-Level Design

Continue reading this section for the full explanation and source context.

Section Backend Selection Flow

Continue reading this section for the full explanation and source context.

Section Backend Registry

Continue reading this section for the full explanation and source context.

Related topics: PagedAttention and KV Cache Management, Quantization Support

Attention Backends and Kernels

Overview

The Attention Backends and Kernels system in vLLM provides a pluggable, hardware-accelerated implementation of attention mechanisms for Large Language Model (LLM) inference. This abstraction layer enables vLLM to leverage different optimized attention implementations (FlashAttention, FlashInfer, FlashMLA) depending on hardware capabilities and model requirements, while maintaining a unified interface for the rest of the engine.

Architecture

High-Level Design

vLLM implements a backend registry pattern that allows runtime selection of the optimal attention implementation. The attention system is designed to support both the legacy v0 engine and the optimized v1 engine architecture.

graph TD
    A[Attention Interface] --> B[Attention Backends Registry]
    B --> C[FlashAttention Backend]
    B --> D[FlashInfer Backend]
    B --> E[FlashMLA Backend]
    C --> F[CUDA Kernels]
    D --> G[Template-based Kernels]
    E --> H[MLA-optimized Kernels]
    
    I[Model Request] --> J[Scheduler]
    J --> K[Attention Layer]
    K --> A

Backend Selection Flow

The registry-based architecture enables automatic backend selection based on hardware and configuration:

graph TD
    A[Engine Initialization] --> B{Check user-specified backend}
    B -->|Specified| C[Validate backend compatibility]
    B -->|Auto| D{Detect GPU Architecture}
    D -->|H100/H200| E[Select FlashAttention3]
    D -->|A100/A10| F[Select FlashAttention2]
    D -->|Other| G[Select FlashAttention2]
    C -->|Valid| H[Initialize Backend]
    C -->|Invalid| I[Raise Configuration Error]
    E --> H
    F --> H
    G --> H

Attention Backend Components

Backend Registry

The registry.py module provides the central factory for attention backend selection:

ComponentResponsibility
AttentionBackendAbstract base class defining the backend interface
get_available_backends()Returns list of compiled/available backends
get_backend(name)Retrieves a specific backend by name
AttentionImplConcrete implementation wrapper per backend

FlashAttention Backend

Located at vllm/v1/attention/backends/flash_attn.py, this backend provides:

  • FlashAttention 2/3 optimized CUDA kernels
  • Support for head dimensions 64, 80, 96, 128, 160, 192, 256
  • Paged attention integration
  • Cross-attention support for encoder-decoder models

Key Features:

FeatureDescription
Fused kernelsCombines attention computation steps
Dynamic persistent memoryReuses KV cache blocks efficiently
Strided bias supportEnables sliding window attention

FlashInfer Backend

Located at vllm/v1/attention/backends/flashinfer.py, this backend offers:

  • Template-based kernel generation for flexibility
  • Improved performance for specific head dimensions
  • Better integration with vLLM's block manager
  • Support for custom attention patterns

FlashMLA Backend

Located at vllm/v1/attention/backends/mla/flashmla.py, this backend is optimized for:

  • Multi-head Latent Attention (MLA) architectures
  • Low-rank KV cache compression
  • Reduced memory bandwidth usage
  • Optimized for DeepSeek-style models

Attention Interface

Core Abstraction

All attention backends implement a common interface defined by the Attention class:

class Attention:
    def __init__(
        self,
        num_heads: int,
        head_size: int,
        scale: float,
        num_kv_heads: int,
        alibi_slopes: Optional[List[float]],
        cache_config: Optional[CacheConfig],
        block_size: int,
    )
    
    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        kv_cache: Optional[torch.Tensor],
        attn_metadata: AttentionMetadata,
    ) -> torch.Tensor

Attention Metadata

The AttentionMetadata structure carries runtime information required for attention computation:

FieldTypePurpose
seq_lensList[int]Sequence lengths for each request
max_seq_lenintMaximum sequence length in batch
block_tablestorch.TensorPaged attention block mappings
seq_start_idxintStarting index in output buffer

Paged Attention Integration

vLLM's attention system integrates tightly with paged memory management:

graph LR
    A[Logical KV Cache] --> B[Physical KV Blocks]
    C[Query Tensors] --> D[Block Lookup]
    D --> E[Attention Computation]
    F[Block Table] --> D
    B --> E
    E --> G[Output Tensors]

Block Table Structure

Block IndexPhysical AddressValid Length
01516
12316
2716
3-1 (pending)0

Backend Configuration

Runtime Selection

Backends can be selected through multiple mechanisms:

MethodPriorityExample
Environment variableLowestVLLM_ATTENTION_BACKEND=FLASHINFER
Model configMedium"attention_backend": "flash_attn"
CLI argumentHighest--attention-backend flashinfer

Configuration Parameters

ParameterBackendDescription
head_sizeAllDimension of each attention head
num_kv_headsAllNumber of key/value heads (for GQA)
scaleAllAttention scaling factor
sliding_windowFlashAttnSliding window size
page_sizeFlashInferKV cache block size

Performance Characteristics

Memory Efficiency

BackendKV Cache FormatMemory Overhead
FlashAttentionStandardBaseline
FlashInferBlockedSimilar to FlashAttn
FlashMLALow-rank50-70% reduction

Throughput Optimization

  • Kernel Fusion: Combines multiple operations to reduce memory bandwidth
  • Persistent RNN: Reuses computation across decode steps
  • Async Execution: Overlaps attention with other operations

Implementation Details

FlashAttention Kernel Flow

sequenceDiagram
    participant Q as Query
    participant K as Key
    participant V as Value
    participant Kernel as FlashAttn Kernel
    participant Cache as KV Cache
    
    Q->>Kernel: Load Q tiles
    K->>Kernel: Load K tiles
    V->>Kernel: Load V tiles
    Kernel->>Cache: Update (if write)
    Kernel->>Kernel: Compute attention scores
    Kernel->>Cache: Read (if read)
    Kernel->>Kernel: softmax & scale
    Kernel-->>Q: Output attention

Backend Initialization Sequence

graph TD
    A[EngineArgs parse] --> B[Attention backend selection]
    B --> C[Backend registry lookup]
    C --> D[Load backend module]
    D --> E[Initialize CUDA streams]
    E --> F[Allocate persistent buffers]
    F --> G[Register attention layers]
    G --> H[Ready for inference]

Extending Attention Backends

To implement a custom attention backend:

  1. Create backend class: Inherit from AttentionBackend
  2. Implement required methods: get_name(), get_impl()
  3. Register backend: Add to ATTENTION_BACKENDS registry
  4. Implement kernels: Write CUDA/C++ kernels or wrap existing libraries
class CustomAttentionBackend(AttentionBackend):
    @staticmethod
    def get_name() -> str:
        return "custom_attention"
    
    @staticmethod
    def get_impl() -> AttentionImpl:
        return CustomAttentionImpl

Troubleshooting

Common Issues

IssueCauseSolution
CUDA out of memoryLarge batch/sequenceReduce gpu_memory_utilization
Incorrect outputsWrong backend selectedVerify with --attention-backend flag
Kernel launch failureUnsupported head sizeUse supported head dimensions
Slow inferenceSuboptimal backendBenchmark available backends

Debugging Tips

  • Set VLLM_LOGGING_LEVEL=DEBUG for attention kernel timing
  • Use VLLM_ATTENTION_BACKEND=FLASH_ATTN to force specific backend
  • Check nvidia-smi for kernel execution times

Summary

The Attention Backends and Kernels system provides vLLM's computational core for transformer attention. Through the registry pattern, it enables seamless switching between optimized implementations while maintaining a stable interface for the rest of the engine. The pluggable architecture supports both established backends (FlashAttention, FlashInfer) and specialized implementations (FlashMLA) optimized for specific model architectures.

Source: https://github.com/vllm-project/vllm / Human Manual

Quantization Support

Related topics: Model Architecture Support, Attention Backends and Kernels

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Quantization System Components

Continue reading this section for the full explanation and source context.

Section Quantization Base Architecture

Continue reading this section for the full explanation and source context.

Section Layer Structure

Continue reading this section for the full explanation and source context.

Related topics: Model Architecture Support, Attention Backends and Kernels

Quantization Support

Overview

vLLM provides comprehensive quantization support to reduce model memory footprint and accelerate inference through various quantization schemes. Quantization compresses model weights from higher precision (typically FP16/BF16) to lower precision formats (such as INT8, INT4, or FP8), enabling larger models to run on limited GPU resources while maintaining acceptable accuracy.

The quantization system in vLLM is designed with a modular, extensible architecture that supports multiple quantization methods through a common abstraction layer. This allows users to easily switch between different quantization schemes or implement custom quantization strategies.

Sources: docs/features/quantization/README.md

Architecture

Quantization System Components

The quantization system consists of several interconnected components:

graph TD
    A[Model Loading] --> B[QuantizationConfig]
    B --> C[QuantizationMethod]
    C --> D[Quantized Linear Layers]
    C --> E[Quantized Embedding Layers]
    D --> F[CUDA/ROCm Kernels]
    E --> F
    G[Weight Loading] --> H[Pre-quantized Weights]
    G --> I[On-the-fly Quantization]

Quantization Base Architecture

The base quantization architecture defines a common interface that all quantization methods must implement:

classDiagram
    class QuantizationConfig {
        +get_supported_methods() List[QuantizationMethods]
        +get_override_quant_config() Optional[dict]
        +get_quant_config() dict
        +verify_quant_config() None
    }
    
    class QuantizationMethods {
        <<enumeration>>
        FP8
        GGUF
        AWQ
        GPTQ
        QUANTITY
    }
    
    class QuantizedLinear {
        <<interface>>
        +create_weights()
        +apply_weights()
    }
    
    QuantizationConfig --> QuantizationMethods
    QuantizationMethods --> QuantizedLinear

Sources: vllm/model_executor/layers/quantization/base_config.py

Layer Structure

Each quantization implementation defines its own quantized layer classes:

Layer TypePurposeQuantization Scope
QuantizedLinearMatrix multiplication with quantized weightsWeight-only or Activation+Weight
QuantizedEmbeddingLookup table with quantized weightsWeight-only
QuantizedMoEMixture-of-Experts with quantized componentsPer-expert quantization

Sources: vllm/model_executor/layers/quantization/fp8.py

Supported Quantization Methods

FP8 (8-bit Floating Point)

FP8 quantization uses 8-bit floating point representation with two formats:

FormatExponent BitsMantissa BitsUse Case
E4M343Activations and weights
E5M252Gradients and optimizer states

FP8 quantization is particularly well-supported in vLLM with optimized CUDA kernels for inference acceleration.

Sources: vllm/model_executor/layers/quantization/fp8.py

GGUF (GPT-Generated Unified Format)

GGUF is a quantized model format commonly used with llama.cpp. vLLM supports loading GGUF-quantized models directly:

# Load GGUF quantized model
llm = LLM(model="unsloth/Qwen3-0.6B-GGUF:Q4_K_M")

GGUF supports multiple quantization levels:

Quantization TypeBits per ParameterDescription
Q2_K~2.52-bit quantization with 4-bit key-values
Q3_K~3.53-bit quantization with 4-bit key-values
Q4_K~4.54-bit quantization with 8-bit key-values
Q5_K~5.55-bit quantization with 8-bit key-values
Q6_K~6.56-bit quantization
Q8_0~8.08-bit quantization (baseline)

Sources: vllm/model_executor/layers/quantization/gguf.py

AWQ (Activation-Aware Weight Quantization)

AWQ identifies weights with significant activation contributions and preserves them at higher precision while quantizing others aggressively.

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ performs post-training quantization with optional layer-wise precision adjustment and GPU optimization.

Quantization Configuration

Configuration Parameters

ParameterTypeDescriptionDefault
quantizationstrQuantization method nameNone
quantization_param_pathstrPath to quantization parameters fileNone
dtypestrModel precision (if not pre-quantized)auto
kv_cache_dtypestrKV cache quantization formatauto

Enabling Quantization

Quantization is enabled through the --quantization CLI argument or quantization parameter in AsyncEngineArgs:

vllm serve model/path --quantization fp8
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    quantization="fp8",
    gpu_memory_utilization=0.9
)

Mixed Quantization

Some layers may remain in higher precision when quantization cannot be applied uniformly:

Layer TypeQuantization Behavior
Input/Output embeddingsOften kept in FP16/BF16
Output projectionMay use different precision
Attention softmaxUsually in FP32 for stability

Implementation Details

Weight Loading Pipeline

graph LR
    A[Model Checkpoint] --> B{Pre-quantized?}
    B -->|Yes| C[Load Quantized Weights]
    B -->|No| D[Dynamic Quantization]
    C --> E[Apply Quantization Config]
    D --> E
    E --> F[Initialize Quantized Layer]
    F --> G[Verify Weight Shape]
    G --> H[CUDA Kernel Ready]

C++ Backend (cuBLAS/cutlass)

High-performance quantization kernels are implemented in C++ and CUDA:

ComponentLocationPurpose
FP8 GEMMcsrc/quantization/fp8/FP8 matrix multiplication
W8A8 GEMMcsrc/quantization/fp8/INT8 weight, INT8 activation
W4A16 GEMMcsrc/quantization/INT4 weight, FP16 activation
Dequantizationcsrc/quantization/Convert quantized to compute dtype

Sources: csrc/quantization

Quantized Linear Layer Implementation

The QuantizedLinear layer handles the core computation:

class QuantizedLinear(QuantizedLayer):
    """Base class for quantized linear layers."""
    
    def __init__(
        self,
        input_size: int,
        output_size: int,
        quantization_config: QuantizationConfig,
        bias: bool = False,
    ):
        self.input_size = input_size
        self.output_size = output_size
        self.quantization_config = quantization_config
        
    def create_weights(self):
        """Initialize quantized weight tensors."""
        raise NotImplementedError
        
    def forward(self, input_):
        """Forward pass with quantized computation."""
        raise NotImplementedError

Sources: vllm/model_executor/layers/quantization/base_config.py

Usage Examples

Loading Pre-quantized Models

from vllm import LLM

# FP8 quantized model
llm_fp8 = LLM(
    model="meta-llama/Llama-2-70b-hf",
    quantization="fp8",
    tensor_parallel_size=4
)

# GGUF quantized model
llm_gguf = LLM(
    model="TheBloke/Llama-2-70B-Chat-GGUF",
    quantization="gguf",
    tokenizer="meta-llama/Llama-2-70b-chat"
)

Quantization with Different Precisions

# Load with specific precision settings
llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct",
    quantization="fp8",
    dtype="half",  # Compute precision
    kv_cache_dtype="fp8_e4m3"  # KV cache precision
)

CLI Usage

# Serve with FP8 quantization
vllm serve meta-llama/Llama-2-70b-hf \
    --quantization fp8 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.95

# Serve GGUF model directly
vllm serve TheBloke/Llama-2-7B-Chat-GGUF:Q4_K_M

Performance Considerations

Memory Reduction

QuantizationMemory ReductionQuality Impact
FP8 (E4M3)~50%Minimal
INT8~50%Low
INT4~75%Moderate
INT2~87.5%High

Throughput Impact

Quantization improves throughput through:

  1. Increased batch size: More sequences fit in GPU memory
  2. Higher memory bandwidth utilization: Smaller weights load faster
  3. Accelerated compute: INT8/FP8 operations on tensor cores

Accuracy Considerations

  • Post-training quantization (PTQ): May introduce accuracy degradation
  • Activation-aware methods (AWQ): Better preserves model capabilities
  • Calibration: Some methods require calibration data for optimal accuracy

Extension Points

Custom Quantization Methods

To implement a custom quantization method, extend the base classes:

from vllm.model_executor.layers.quantization import QuantizationConfig

class CustomQuantizationConfig(QuantizationConfig):
    """Custom quantization configuration."""
    
    @staticmethod
    def get_name() -> str:
        return "custom_quant"
    
    @staticmethod
    def get_supported_methods(cls) -> list[str]:
        return ["weight_only"]
    
    def get_quant_config(self) -> dict:
        return {"method": self.quant_method}

Registering Custom Quantization

Custom quantizations must be registered in the quantization registry to be discoverable at runtime.

Summary

vLLM's quantization support provides a flexible, extensible system for serving large language models with reduced memory footprint. The architecture separates concerns through well-defined interfaces, allowing seamless integration of new quantization methods while maintaining high performance through optimized CUDA kernels.

Key takeaways:

  • Multiple quantization formats supported (FP8, GGUF, AWQ, GPTQ)
  • Modular architecture enables easy extension
  • Optimized C++/CUDA kernels for inference acceleration
  • Simple API through CLI and Python interface
  • Memory reduction up to 75% with INT4 quantization

Sources: [docs/features/quantization/README.md]()

Distributed Inference and Parallelism

Related topics: Scheduling and Request Processing, PagedAttention and KV Cache Management

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Tensor Parallelism (TP)

Continue reading this section for the full explanation and source context.

Section Pipeline Parallelism (PP)

Continue reading this section for the full explanation and source context.

Section Data Parallelism (DP)

Continue reading this section for the full explanation and source context.

Related topics: Scheduling and Request Processing, PagedAttention and KV Cache Management

Distributed Inference and Parallelism

vLLM provides comprehensive support for distributed inference and parallelism, enabling efficient serving of large language models across multiple GPUs and nodes. This document covers the architecture, configuration, and implementation details of vLLM's distributed computing capabilities.

Overview

vLLM's distributed inference system enables horizontal scaling of LLM workloads by distributing model computation across multiple devices. The system supports multiple parallelism strategies, including tensor parallelism, pipeline parallelism, and data parallelism, along with specialized features like disaggregated prefill/decode and KV cache transfer.

The core components of distributed inference in vLLM include:

ComponentPurpose
Parallel State ManagerCoordinates process groups for distributed communication
Device CommunicatorsHandle low-level tensor communication (NCCL, CUDA)
KV Transfer SystemEnables disaggregated prefill/decode architectures
Configuration SystemManages parallelism parameters and device placement

Parallelism Strategies

vLLM supports three primary parallelism strategies, each addressing different aspects of distributed computation.

Tensor Parallelism (TP)

Tensor parallelism splits individual weight matrices across multiple GPUs, allowing computation of large matrices that would not fit in a single device's memory. This is particularly effective for dense layers like attention and feed-forward networks.

Tensor parallelism requires:

  • NVIDIA GPUs with NCCL support
  • High-bandwidth interconnects (NVLink preferred)
  • Each tensor-parallel rank requires full model weights in terms of optimizer states

Pipeline Parallelism (PP)

Pipeline parallelism distributes layers (stages) of the model across different GPUs or nodes. This approach reduces memory requirements per device while maintaining high GPU utilization through micro-batch pipelining.

Data Parallelism (DP)

Data parallelism replicates the entire model across multiple GPUs, with each replica processing different batches of requests. This is the simplest form of parallelism and scales throughput linearly with the number of replicas.

Parallel State Management

The ParallelState class in vllm/distributed/parallel_state.py is the central coordinator for distributed execution.

graph TD
    A[ParallelState] --> B[Tensor Parallel Group]
    A --> C[Pipeline Parallel Group]
    A --> D[Data Parallel Group]
    A --> E[World Communicator]
    B --> F[Rank 0, 1, 2, 3]
    C --> G[Stage 0, Stage 1]
    D --> H[Replica 1, Replica 2]

Process Group Initialization

Parallel state is initialized through init_distributed_environment(), which creates the necessary process groups for communication.

# From vllm/distributed/parallel_state.py:45-78
def init_distributed_environment(
    rank: int,
    world_size: int,
    local_rank: int,
    init_method: str = "env://",
    backend: str = "nccl"
):
    # Initialize distributed context
    torch.distributed.init_process_group(
        backend=backend,
        init_method=init_method,
        rank=rank,
        world_size=world_size
    )

Rank and World Size Management

ParameterDescription
rankUnique identifier for each process in the distributed group
world_sizeTotal number of processes in the distributed group
local_rankRank of the process within its local node

Sources: vllm/distributed/parallel_state.py:45-78

Device Communication

CUDA Communicator

The CUDACommunicator class in vllm/distributed/device_communicators/cuda_communicator.py provides NCCL-based communication primitives optimized for CUDA tensors.

graph LR
    A[Tensor] -->|all_reduce| B[Aggregated Tensor]
    A -->|broadcast| C[Same Tensor on All Ranks]
    A -->|reduce_scatter| D[Partitioned Results]
    E[Partial Tensors] -->|all_gather| F[Complete Tensor]

Supported Communication Primitives

PrimitiveFunctionUse Case
all_reduceReduce tensors across all ranksGradient synchronization in TP
broadcastSend tensor from one rank to allWeight updates
all_gatherCollect tensors from all ranksOutput aggregation
reduce_scatterReduce and partition across ranksGradient partitioning
# From vllm/distributed/device_communicators/cuda_communicator.py:23-65
class CUDACommunicator:
    def __init__(self, group: ProcessGroup):
        self.group = group
        self.world_size = group.size()
        self.rank = group.rank()
    
    def all_reduce(self, tensor: torch.Tensor) -> torch.Tensor:
        # NCCL all-reduce implementation
        torch.distributed.all_reduce(tensor, group=self.group)
        return tensor

Sources: vllm/distributed/device_communicators/cuda_communicator.py:23-65

Communication Patterns in Distributed Inference

graph TD
    subgraph "Tensor Parallel Region"
        A[Attention AllReduce] --> B[FFN AllReduce]
        B --> C[AllReduce Output]
    end
    
    subgraph "Pipeline Parallel Region"
        D[Send Hidden States] --> E[Receive Hidden States]
        E --> F[Backward Pass]
    end
    
    subgraph "Data Parallel Region"
        G[Synchronize KV Cache] --> H[Load Balance Requests]
    end

Disaggregated Prefill and Decode

vLLM supports disaggregated prefill/decode architectures where prefill (initial prompt processing) and decode (token generation) stages run on separate GPU clusters. This enables independent scaling of prefill and decode resources.

KV Transfer Architecture

The KV transfer system enables sharing of KV cache between prefill and decode instances.

graph LR
    A[Prefill Instance] -->|KV Transfer| B[Shared Storage]
    B -->|KV Load| C[Decode Instance]
    
    subgraph "Prefill Process"
        A1[Tokenize Prompts] --> A2[Process Prefill]
        A2 --> A3[Save KV Cache]
    end
    
    subgraph "Decode Process"
        C1[Load KV Cache] --> C2[Generate Tokens]
        C2 --> C3[Streaming Output]
    end

Sources: examples/disaggregated/example_connector/README.md

KV Connector Base

The KVConnectorBase class defines the interface for KV cache transfer implementations.

# From vllm/distributed/kv_transfer/kv_connector/base.py:15-85
class KVConnectorBase(ABC):
    def __init__(self, kv_transfer_config: KVTransferParams):
        self.config = kv_transfer_config
    
    @abstractmethod
    def load_kv_cache(
        self,
        requests: List[TransferJob],
        scheduler: Any
    ) -> None:
        """Load KV cache from external source during prefill"""
        pass
    
    @abstractmethod
    def save_kv_cache(
        self,
        blocks: List[PhysicalTokenBlock],
        kv_pair: KVCache,
        callback: Callable
    ) -> TransferJob:
        """Save KV cache to external storage during decode"""
        pass
MethodPurpose
load_kv_cacheLoad KV cache during prefill phase
save_kv_cachePersist KV cache during decode phase
profile_num_available_blocksDetermine available transfer capacity

Sources: vllm/distributed/kv_transfer/kv_connector/base.py:15-85

Example Connector Implementation

The ExampleConnector provides a reference implementation for KV transfer:

# From examples/disaggregated/example_connector/
class ExampleConnector(KVConnectorBase):
    def __init__(self, kv_transfer_config: KVTransferParams):
        super().__init__(kv_transfer_config)
        self.local_storage = "./local_storage"

The connector workflow:

  1. Prefill Phase: Process prompts and save KV state to local_storage directory
  2. Decode Phase: Load KV state from storage and continue generation

Sources: examples/disaggregated/example_connector/README.md

Configuration

Parallel Configuration Parameters

The ParallelConfig class in vllm/config/parallel.py manages all parallelism settings.

ParameterTypeDefaultDescription
tensor_parallel_sizeint1Number of GPUs for tensor parallelism
pipeline_parallel_sizeint1Number of pipeline stages
data_parallel_sizeint1Number of data parallel replicas
data_parallel_size_per_regionint-DP size for hybrid parallelism
data_parallel_master_portint29500Port for DP master communication
data_parallel_master_addrstr-Address for DP master
numa_awareboolFalseEnable NUMA-aware GPU placement
# From vllm/config/parallel.py:10-45
@dataclass
class ParallelConfig:
    tensor_parallel_size: int = 1
    pipeline_parallel_size: int = 1
    data_parallel_size: int = 1
    data_parallel_size_per_region: Optional[int] = None
    data_parallel_master_port: int = 29500
    data_parallel_master_addr: Optional[str] = None
    data_parallel_standalone: bool = False
    numa_aware: bool = False

Sources: vllm/config/parallel.py:10-45

Environment Variables

VariableDescription
CUDA_VISIBLE_DEVICESComma-separated list of GPU IDs to use
VLLM_HOST_IPHost IP for distributed communication
VLLM_PORTPort for worker communication

Launching Distributed Inference

#### Multi-GPU Launch with torchrun

torchrun --nproc_per_node=4 \
    --nnodes=1 \
    vllm/entrypoints/llm.py \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4

#### Multi-Node Launch

torchrun --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=10.0.0.1 \
    --master_port=29500 \
    vllm/entrypoints/llm.py \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 4

Scaling Strategies

Scaling Guidelines

Model SizeGPU MemoryRecommended Configuration
7B24GBTP=1, DP=single node
13B48GBTP=2, DP=single node
70B320GB+TP=4 or 8, PP=2+
405B800GB+TP=8, PP=8, multi-node

Sources: docs/serving/parallelism_scaling.md

Disaggregated Prefill Scaling

For disaggregated prefill/decode, consider:

Workload PatternPrefill ResourcesDecode Resources
Short prompts, many requestsScale prefillScale decode
Long prompts, few requestsScale prefill with TPScale decode
Mixed workloadBalance bothBalance both

Scaling Best Practices

  1. Start with tensor parallelism for intra-node scaling
  2. Add pipeline parallelism for multi-node deployments
  3. Use data parallelism to increase throughput on same-stage workloads
  4. Enable disaggregation when prefill and decode have different resource needs

Sources: docs/serving/parallelism_scaling.md

Architecture Diagram

graph TB
    subgraph "vLLM Distributed Architecture"
        subgraph "Process 0"
            P0_M[Model Shard 0]
            P0_S[Scheduler]
            P0_C[Cache Engine]
        end
        
        subgraph "Process 1"
            P1_M[Model Shard 1]
            P1_S[Scheduler]
            P1_C[Cache Engine]
        end
        
        subgraph "Process N"
            PN_M[Model Shard N]
            PN_S[Scheduler]
            PN_C[Cache Engine]
        end
        
        NCCL[NCCL AllReduce]
        
        P0_S <-->|NCCL| P1_S
        P1_S <-->|NCCL| PN_S
        P0_S <-->|NCCL| PN_S
        
        P0_M <-->|Forward Pass| P0_S
        P1_M <-->|Forward Pass| P1_S
        PN_M <-->|Forward Pass| PN_S
    end
    
    subgraph "External Services"
        KV[KV Transfer Service]
        Redis[(Redis/Storage)]
    end
    
    P0_C <-->|Save KV| Redis
    PN_C <-->|Load KV| Redis
    Redis <--> KV

See Also

Sources: [vllm/distributed/parallel_state.py:45-78](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py)

Model Architecture Support

Related topics: Model Executor and Worker Architecture, Quantization Support

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Python API (Offline Inference)

Continue reading this section for the full explanation and source context.

Section CLI-based Model Serving

Continue reading this section for the full explanation and source context.

Section Language Models

Continue reading this section for the full explanation and source context.

Related topics: Model Executor and Worker Architecture, Quantization Support

Model Architecture Support

Overview

vLLM provides comprehensive support for various LLM architectures, enabling users to serve, fine-tune, and run inference with a wide range of transformer-based models. The system is designed to be architecture-agnostic while providing optimized implementations for popular model families.

Core Model Loading Architecture

Python API (Offline Inference)

The primary Python interface for running offline inference is the LLM class, which handles model loading and inference without requiring a separate inference server.

# Basic usage example
from vllm import LLM

llm = LLM(model="Qwen/Qwen3-0.6B")
output = llm.generate("Hello, world!")

Sources: examples/basic/offline_inference/README.md

CLI-based Model Serving

The vLLM CLI provides a serve subcommand for HTTP-based model serving:

vllm serve Qwen/Qwen3-0.6B

If no model is specified, the CLI defaults to Qwen/Qwen3-0.6B.

Sources: vllm/entrypoints/cli/serve.py

Supported Model Categories

Language Models

vLLM supports a broad range of causal language models including:

  • Decoder-only transformers: Standard autoregressive models
  • Mixture of Experts (MoE): Sparse architectures like Mixtral
  • Multimodal models: Models that process multiple input types

Quantization Support

vLLM supports quantized models through GGUF format, enabling deployment of compressed models with reduced memory footprint.

Example loading a quantized model directly from HuggingFace:

--model unsloth/Qwen3-0.6B-GGUF:Q4_K_M --tokenizer Qwen/Qwen3-0.6B

Sources: examples/basic/offline_inference/README.md

Model Configuration

Generation Configuration

The --generation-config argument specifies where the generation config is loaded from:

ValueSourceDescription
autoModel pathLoads from model's configuration directory
<folder_path>Local folderLoads from specified directory
Not providedvLLM defaultsUses built-in default parameters
If max_new_tokens is specified in generation config, it sets a server-wide limit on output tokens for all requests.

Sources: examples/basic/offline_inference/README.md

Engine Arguments

Model configuration is controlled through AsyncEngineArgs, which processes CLI arguments and creates the model configuration:

engine_args = AsyncEngineArgs.from_cli_args(args)
model_config = engine_args.create_model_config()

Sources: vllm/entrypoints/cli/launch.py

Structured Outputs

vLLM supports structured output generation for models that support it, including reasoning models:

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --reasoning-parser deepseek_r1

This enables compliance with output format constraints defined by the model.

Sources: examples/features/structured_outputs/README.md

CPU Offload Support

For models that exceed available GPU memory, vLLM provides CPU offload capabilities:

--cpu-offload-gb 10

This creates virtual GPU memory by offloading portions of the model to CPU RAM. For example, with a 24GB GPU and 10GB offload, you can effectively load a 13B model requiring ~26GB.

Note: This requires fast CPU-GPU interconnect for acceptable performance.

Sources: examples/basic/offline_inference/README.md

Model Registry Architecture

The vLLM CLI uses a modular command structure where model support is registered through subcommand modules:

for cmd_module in CMD_MODULES:
    new_cmds = cmd_module.cmd_init()
    for cmd in new_cmds:
        cmd.subparser_init(subparsers).set_defaults(dispatch_function=cmd.cmd)

Each registered command includes validation logic to ensure model configurations are valid before execution.

Sources: vllm/entrypoints/cli/main.py

Serving Modes

Standard API Server

The default serving mode starts an HTTP server with OpenAI-compatible endpoints:

vllm serve <model_name>

Headless Mode

For distributed deployments, headless mode skips API server initialization:

if args.headless:
    if args.api_server_count is not None and args.api_server_count > 0:
        raise ValueError(
            f"--api-server-count={args.api_server_count} cannot be "
            "used with --headless (no API servers are started in "
            "headless mode)."
        )
    args.api_server_count = 0

Sources: vllm/entrypoints/cli/serve.py

gRPC Server Mode

For high-performance scenarios, vLLM supports gRPC-based serving:

if getattr(args, "grpc", False):
    from vllm.entrypoints.grpc_server import serve_grpc
    uvloop.run(serve_grpc(args))

Sources: vllm/entrypoints/cli/serve.py

Data Parallel Modes

vLLM supports distributed model serving through multiple load balancing strategies:

ModeFlagDescription
External LB--data-parallel-external-lb or --data-parallel-rankExternal load balancer manages request distribution
Hybrid LB--data-parallel-hybrid-lb or --data-parallel-start-rankHybrid approach with internal and external coordination

The system auto-detects load balancing mode to set appropriate default values for api_server_count.

Sources: vllm/entrypoints/cli/serve.py

Installation and Quickstart

For users getting started with model architecture support:

  1. Install vLLM following the installation guide
  2. Review the quickstart documentation
  3. Check the list of supported models

Sources: README.md

Architecture Flow Diagram

graph TD
    A[User Request] --> B{CLI or Python API?}
    B -->|CLI| C[vllm serve command]
    B -->|Python| D[LLM class instantiation]
    C --> E[CLISubcommand processing]
    D --> F[AsyncEngineArgs configuration]
    E --> G[Model Registry Lookup]
    F --> H[Model Config Creation]
    G --> I[Load Model Architecture]
    H --> I
    I --> J{Quantization?}
    J -->|GGUF| K[Load Quantized Weights]
    J -->|BF16/FP8| L[Load Standard Weights]
    K --> M[PagedAttention Engine]
    L --> M
    M --> N[Inference Execution]

Key Implementation Files

ComponentFile PathPurpose
CLI Entryvllm/entrypoints/cli/main.pyMain CLI dispatcher and argument parsing
Serve Commandvllm/entrypoints/cli/serve.pyHTTP server startup and configuration
Launch Layervllm/entrypoints/cli/launch.pyFastAPI-based serving layer
Offline Inferenceexamples/basic/offline_inference/Python API usage examples
Structured Outputsexamples/features/structured_outputs/Advanced output formatting

Sources: [examples/basic/offline_inference/README.md](https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/README.md)

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium [Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling

First-time setup may fail or require extra isolation and rollback planning.

medium [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0

First-time setup may fail or require extra isolation and rollback planning.

medium [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_l…

First-time setup may fail or require extra isolation and rollback planning.

medium v0.18.1

First-time setup may fail or require extra isolation and rollback planning.

Doramagic Pitfall Log

Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: [Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: [Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/42182

2. Installation risk: [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/40896

3. Installation risk: [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_l…

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_l…. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/42207

4. Installation risk: v0.18.1

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v0.18.1. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/releases/tag/v0.18.1

5. Capability assumption: [Feature]: Qwen3.5-Moe LoRA Support (experts)

  • Severity: medium
  • Finding: Capability assumption is backed by a source signal: [Feature]: Qwen3.5-Moe LoRA Support (experts). Treat it as a review item until the current version is checked.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/40005

6. Capability assumption: README/documentation is current enough for a first validation pass.

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: capability.assumptions | github_repo:599547518 | https://github.com/vllm-project/vllm | README/documentation is current enough for a first validation pass.

7. Project risk: v0.20.2

  • Severity: medium
  • Finding: Project risk is backed by a source signal: v0.20.2. Treat it as a review item until the current version is checked.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/releases/tag/v0.20.2

8. Maintenance risk: Maintainer activity is unknown

  • Severity: medium
  • Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: evidence.maintainer_signals | github_repo:599547518 | https://github.com/vllm-project/vllm | last_activity_observed missing

9. Security or permission risk: no_demo

  • Severity: medium
  • Finding: no_demo
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: downstream_validation.risk_items | github_repo:599547518 | https://github.com/vllm-project/vllm | no_demo; severity=medium

10. Security or permission risk: No sandbox install has been executed yet; downstream must verify before user use.

  • Severity: medium
  • Finding: No sandbox install has been executed yet; downstream must verify before user use.
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: risks.safety_notes | github_repo:599547518 | https://github.com/vllm-project/vllm | No sandbox install has been executed yet; downstream must verify before user use.

11. Security or permission risk: no_demo

  • Severity: medium
  • Finding: no_demo
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: risks.scoring_risks | github_repo:599547518 | https://github.com/vllm-project/vllm | no_demo; severity=medium

12. Security or permission risk: [Bug]: ngram speculative decoding changes greedy output on Qwen3-0.6B / A100

  • Severity: medium
  • Finding: Security or permission risk is backed by a source signal: [Bug]: ngram speculative decoding changes greedy output on Qwen3-0.6B / A100. Treat it as a review item until the current version is checked.
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vllm-project/vllm/issues/41758

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using vllm with real data or production workflows.

  • [[Bug]: vLLM v1 with prefix caching: first request differs from subsequen](https://github.com/vllm-project/vllm/issues/40896) - github / github_issue
  • [[AMD][CI Failure][Tracker] Static dashboard tracker for current CI failu](https://github.com/vllm-project/vllm/issues/40554) - github / github_issue
  • [[Usage]: How to proactively clear CPU-resident memory left behind by unl](https://github.com/vllm-project/vllm/issues/42207) - github / github_issue
  • [[Feature]: Qwen3.5-Moe LoRA Support (experts)](https://github.com/vllm-project/vllm/issues/40005) - github / github_issue
  • [[Bug]: ngram speculative decoding changes greedy output on Qwen3-0.6B /](https://github.com/vllm-project/vllm/issues/41758) - github / github_issue
  • [[Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async sch](https://github.com/vllm-project/vllm/issues/42182) - github / github_issue
  • v0.20.2 - github / github_release
  • v0.20.1 - github / github_release
  • v0.20.0 - github / github_release
  • v0.19.1 - github / github_release
  • v0.19.0 - github / github_release
  • v0.18.1 - github / github_release

Source: Project Pack community evidence and pitfall evidence