# https://github.com/antoinezambelli/forge 项目说明书

生成时间：2026-05-19 20:57:03 UTC

## 目录

- [Introduction to Forge](#page-introduction)
- [Installation Guide](#page-installation)
- [Quick Start Guide](#page-quickstart)
- [System Architecture](#page-architecture)
- [Module Structure and API](#page-module-structure)
- [Architecture Decision Records](#page-adr-index)
- [WorkflowRunner and Agentic Loop](#page-workflowrunner)
- [Guardrails Middleware for External Loops](#page-guardrails-middleware)
- [Proxy Server Setup](#page-proxy-server)
- [Backend Clients](#page-backend-clients)
- [Backend Setup Guide](#page-backend-setup)
- [Model Selection Guide](#page-model-guide)

<a id='page-introduction'></a>

## Introduction to Forge

### 相关页面

相关主题：[System Architecture](#page-architecture), [Quick Start Guide](#page-quickstart)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)
- [src/forge/guardrails/guardrails.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/guardrails.py)
- [src/forge/core/workflow.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)
- [src/forge/clients/sampling_defaults.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)
- [src/forge/errors.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/errors.py)
- [src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)
- [examples/foreign_loop.py](https://github.com/antoinezambelli/forge/blob/main/examples/foreign_loop.py)
- [CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)
</details>

# Introduction to Forge

Forge is a framework-agnostic LLM orchestration library that provides structured tool-calling workflows, guardrail enforcement, and context management for building reliable AI agents. It supports multiple LLM backends (Ollama, Llamafile, llama.cpp) and exposes both high-level runners and granular APIs for embedding into foreign orchestration loops.

## Overview

Forge addresses the core challenges of LLM-based agent development:

| Challenge | Forge Solution |
|-----------|----------------|
| Unreliable tool parsing | Rescue parsing with retry mechanisms |
| Missing required steps | StepEnforcer validates call sequences |
| Context overflow | Tiered context compaction strategies |
| Model-specific sampling | Per-model verified sampling defaults |
| Multi-backend support | Unified client abstraction layer |

资料来源：[README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)

## Core Architecture

Forge follows a layered architecture with clear separation of concerns:

```mermaid
graph TD
    subgraph "User Layer"
        A[User Workflow] --> B[WorkflowRunner]
    end
    
    subgraph "Core Layer"
        B --> C[ContextManager]
        B --> D[LLMClient]
        B --> E[Guardrails]
    end
    
    subgraph "Client Layer"
        D --> F[OllamaClient]
        D --> G[LlamafileClient]
        D --> H[AnthropicClient]
    end
    
    subgraph "Backend Layer"
        F --> I[Ollama Backend]
        G --> J[Llamafile Backend]
        H --> K[Anthropic API]
    end
```

资料来源：[CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

### Project Structure

```
src/forge/           # Library source
  clients/           # LLM backend adapters (one per backend)
  core/              # Workflow, runner, messages, steps
  context/           # Context management and compaction
  prompts/           # Prompt templates and nudges
  guardrails/        # Response validation and step enforcement
  proxy/             # OpenAI-compatible proxy server
```

资料来源：[CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

## Quick Start

The fundamental building blocks of a Forge workflow:

```python
import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())
```

资料来源：[README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)

## Core Components

### Workflow

The `Workflow` class defines the structure and constraints of an agent task:

| Parameter | Type | Description |
|-----------|------|-------------|
| `name` | `str` | Workflow identifier |
| `description` | `str` | Human-readable description |
| `tools` | `dict[str, ToolDef]` | Tool definitions keyed by name |
| `required_steps` | `list[str]` | Tools that must precede terminal tool |
| `terminal_tool` | `str` | Tool(s) that can end the workflow |

资料来源：[src/forge/core/workflow.py:1-50](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)

### ToolDef

Binds a tool schema to its implementation:

```python
@dataclass
class ToolDef:
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)
```

The `prerequisites` field supports:
- **String entries**: Name-only requirements ("read_file" — any prior call satisfies)
- **Dict entries**: Arg-matched requirements (`{"tool": "read_file", "match_arg": "path"}`)

资料来源：[src/forge/core/workflow.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)

### LLM Clients

Forge provides backend-specific clients:

| Client | Backend | Features |
|--------|---------|----------|
| `OllamaClient` | Ollama API | recommended_sampling, async streaming |
| `LlamafileClient` | Llamafile binary | Context length detection, reasoning extraction |
| `AnthropicClient` | Anthropic API | Native Claude support |

All clients support `send()` and `send_stream()` methods with sampling parameter overrides:

```python
# Instance-level sampling
client = OllamaClient(model="qwen3:8b", recommended_sampling=True)

# Per-call override (merged without mutation)
await client.send(messages, sampling={"temperature": 0.5})
```

资料来源：[src/forge/clients/sampling_defaults.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)

### Per-Model Sampling Defaults

The `sampling_defaults` module provides verified per-model sampling parameters sourced from HuggingFace model cards:

```python
def get_sampling_defaults(model: str) -> dict[str, float | int]:
    """Pure lookup - returns copy of map value or {} for unknown models."""
    return dict(MODEL_SAMPLING_DEFAULTS.get(model, {}))

def apply_sampling_defaults(model: str, *, strict: bool) -> dict[str, float | int]:
    """Policy layer with four-quadrant behavior."""
```

| strict | model in map | behavior |
|--------|--------------|----------|
| True | yes | return dict |
| True | no | raise `UnsupportedModelError` |
| False | yes | one-shot INFO log; return `{}` |
| False | no | return `{}` (silent) |

资料来源：[src/forge/clients/sampling_defaults.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)

## Guardrails System

Forge's guardrails provide multi-layered response validation:

```mermaid
graph LR
    A[LLM Response] --> B[ResponseValidator]
    B --> C{Valid ToolCall?}
    C -->|No| D[Rescue Parsing]
    D --> E{Rescued?}
    E -->|Yes| F[Return ToolCall]
    E -->|No| G[Retry Nudge]
    C -->|Yes| H[StepEnforcer]
    H --> I{Valid Sequence?}
    I -->|Yes| J[ErrorTracker]
    I -->|No| K[Step Blocked Nudge]
    J --> L{Error Limit?}
    L -->|Yes| M[Fatal Error]
    L -->|No| N[Execute]
```

资料来源：[src/forge/guardrails/guardrails.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/guardrails.py)

### Guardrails API

```python
class Guardrails:
    def __init__(
        self,
        tool_names: list[str],
        terminal_tool: str | frozenset[str],
        required_steps: list[str] | None = None,
        max_retries: int = 3,
        max_tool_errors: int = 2,
        rescue_enabled: bool = True,
        max_premature_attempts: int = 3,
        retry_nudge: Callable[[str], str] | None = None,
    ) -> None:
```

| Parameter | Default | Description |
|-----------|---------|-------------|
| `max_retries` | 3 | Consecutive bad responses before fatal |
| `max_tool_errors` | 2 | Tool execution failures before exhaustion |
| `max_premature_attempts` | 3 | Premature terminal attempts before fatal |
| `rescue_enabled` | True | Parse tool calls from plain text |
| `retry_nudge` | None | Custom nudge for bare text responses |

资料来源：[src/forge/guardrails/guardrails.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/guardrails.py)

### CheckResult Actions

```python
@dataclass
class CheckResult:
    action: Literal["execute", "retry", "step_blocked", "fatal"]
    tool_calls: list[ToolCall] | None = None
    nudge: Nudge | None = None  # Set when action is "retry" or "step_blocked"
    reason: str | None = None  # Only when action == "fatal"
```

资料来源：[src/forge/guardrails/guardrails.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/guardrails.py)

## Context Management

The `ContextManager` handles token budget enforcement and message compaction:

| Strategy | Purpose |
|----------|---------|
| `TieredCompact` | Keeps recent N messages, compacts older ones |
| Custom strategies | Pluggable compaction algorithms |

```python
ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
```

资料来源：[src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)

## Server Manager

For local GGUF model execution, `ServerManager` handles backend lifecycle:

```python
server, ctx = await forge.serve(
    backend="llamaserver",
    gguf_path="/path/to/model.gguf",
    mode=BudgetMode.FORGE_FAST,
    client=client,
)
runner = WorkflowRunner(client=client, context_manager=ctx)
```

| Parameter | Description |
|-----------|-------------|
| `backend` | "ollama", "llamaserver", or "llamafile" |
| `gguf_path` | Path to GGUF file (not for Ollama) |
| `mode` | Budget mode (FORGE_FAST, FORGE_BALANCED, FORGE_DEEP) |
| `n_slots` | Concurrent slots for multi-agent |
| `kv_unified` | Single shared KV cache across slots |

资料来源：[src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)

## Error Handling

Forge defines a structured exception hierarchy:

```python
class ForgeError(Exception):  # Base exception
    pass

class UnsupportedModelError(ForgeError):
    """raised when recommended_sampling=True for unknown model."""
    
class ToolCallError(ForgeError):
    """LLM failed to produce valid tool call after retries."""
    
class ToolExecutionError(ForgeError):
    """Tool callable raised during execution."""
```

资料来源：[src/forge/errors.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/errors.py)

## Foreign Loop Integration

Forge can be embedded into existing orchestration systems using the `Guardrails` middleware API:

```python
from forge.guardrails import Guardrails

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

def handle_response(response):
    result = guardrails.check(response)
    
    if result.action == "fatal":
        return f"FATAL: {result.reason}"
    
    if result.action in ("retry", "step_blocked"):
        return f"{result.action}: {result.nudge.content[:80]}..."
    
    # Execute tools, then record
    executed = [tc.tool for tc in result.tool_calls]
    done = guardrails.record(executed)
    return f"executed {executed}" + (" -- DONE" if done else "")
```

资料来源：[examples/foreign_loop.py](https://github.com/antoinezambelli/forge/blob/main/examples/foreign_loop.py)

### Granular Component Access

For fine-grained control, access components directly:

```python
from forge.guardrails import ErrorTracker, ResponseValidator, StepEnforcer

validator = ResponseValidator(tool_names=["search", "lookup", "answer"], rescue_enabled=True)
enforcer = StepEnforcer(required_steps=["search", "lookup"], terminal_tool="answer")
errors = ErrorTracker(max_retries=3, max_tool_errors=2)
```

资料来源：[examples/foreign_loop.py](https://github.com/antoinezambelli/forge/blob/main/examples/foreign_loop.py)

## Code Style and Requirements

| Requirement | Details |
|-------------|---------|
| Python | 3.12+ |
| Async | `asyncio` throughout — all client methods and runner are async |
| Type Safety | Pydantic for tool parameter schemas |
| Type Syntax | Modern unions with `\|` |

资料来源：[CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

## Running Tests

```bash
# Full suite (865 tests)
python -m pytest tests/unit/ -v --tb=short

# With coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing

# Skip integration tests (require live backend)
python -m pytest tests/ -m "not integration"
```

资料来源：[CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

---

<a id='page-installation'></a>

## Installation Guide

### 相关页面

相关主题：[Backend Setup Guide](#page-backend-setup), [Quick Start Guide](#page-quickstart)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)
- [CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)
- [CHANGELOG.md](https://github.com/antoinezambelli/forge/blob/main/CHANGELOG.md)
- [src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)
- [src/forge/clients/sampling_defaults.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)
- [src/forge/proxy/__main__.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__main__.py)
</details>

# Installation Guide

This guide covers all aspects of setting up the **forge** framework, from basic installation to advanced configuration for different LLM backends.

## Overview

Forge is a Python library for building LLM-powered workflows with automatic tool calling, context management, and multi-backend support. The installation process involves:

1. Python environment setup (Python 3.12+)
2. Package installation via pip
3. LLM backend selection and configuration
4. Optional development environment for contributors

**资料来源：[CONTRIBUTING.md:1-15]()**

## Prerequisites

### System Requirements

| Component | Requirement |
|-----------|-------------|
| Python | 3.12+ |
| OS | Linux, macOS, Windows |
| LLM Backend | Ollama, llama-server, or llamafile |
| VRAM | Varies by model (8GB minimum for small models) |

### Backend Options

Forge supports three LLM backends:

| Backend | Description | Use Case |
|---------|-------------|----------|
| **Ollama** | Local model management with simple API | Quick setup, model management |
| **llama-server** | llama.cpp server binary | Production, GGUF files |
| **llamafile** | Single-file executable models | Distribution, portability |

Each backend requires different installation steps:

- **Ollama**: Install via `ollama run` commands
- **llama-server**: Download llama.cpp server binary
- **llamafile**: Download pre-built executables or convert GGUF files

**资料来源：[src/forge/server.py:1-50]()**

## Installation Methods

### Standard Installation

For end users wanting to use forge as a library:

```bash
pip install forge
```

This installs the core package with all necessary dependencies.

### Development Installation

For contributors and those wanting to modify the codebase:

```bash
git clone https://github.com/antoinezambelli/forge.git
cd forge
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or: .venv\Scripts\activate  # Windows
pip install -e ".[dev]"
```

**资料来源：[CONTRIBUTING.md:7-14]()**

The `.[dev]` extras include:

- Testing dependencies (`pytest`)
- Development tools
- Documentation build tools

### Package Configuration

The project uses `pyproject.toml` for dependency management:

```toml
[project]
name = "forge"
version = "0.5.0"
requires-python = ">=3.12"

[project.optional-dependencies]
dev = [
    "pytest>=7.0",
    "pytest-asyncio",
    "pytest-cov",
]
```

## Environment Setup

### Virtual Environment

Using a virtual environment is **strongly recommended** to avoid dependency conflicts:

```bash
python -m venv forge-env
source forge-env/bin/activate
```

### Backend Installation

#### Ollama Setup

1. Install Ollama from [ollama.ai](https://ollama.ai)
2. Pull a model:

```bash
ollama pull ministral-3:8b-instruct-2512-q4_K_M
```

3. Verify the installation:

```bash
ollama list
```

#### llama-server Setup

1. Download the llama.cpp server binary for your platform
2. Place the binary in your models directory or system PATH
3. Verify with:

```bash
./llama-server --help
```

#### llamafile Setup

1. Download a pre-built llamafile (e.g., from TheBloke)
2. Make it executable:

```bash
chmod +x model-name.Q4_K_M.llamafile
```

**资料来源：[src/forge/server.py:180-220]()**

## Configuration Flow

```mermaid
graph TD
    A[Install forge] --> B{Use Case}
    B -->|Library| C[pip install forge]
    B -->|Development| D[pip install -e .[dev]]
    C --> E[Choose Backend]
    D --> E
    E -->|Ollama| F[Install Ollama + Pull Model]
    E -->|llama-server| G[Download llama.cpp binary]
    E -->|llamafile| H[Download llamafile]
    F --> I[Verify with test script]
    G --> I
    H --> I
```

## Quick Start Verification

After installation, verify your setup with this minimal example:

```python
import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())
```

**资料来源：[README.md:1-60]()**

## Advanced Backend Configuration

### KV Cache Quantization

Forge supports KV cache quantization to reduce VRAM usage:

| Setting | VRAM Savings | Quality Impact |
|---------|--------------|----------------|
| `q8_0` | ~50% vs F16 | Minimal |
| `q4_0` | ~75% vs F16 | Low |

```python
# In setup_backend call
from forge.server import setup_backend, BudgetMode

server, ctx = await setup_backend(
    backend="llamaserver",
    gguf_path="/path/to/model.gguf",
    budget_mode=BudgetMode.FORGE_FAST,
    cache_type_k="q4_0",  # Key cache quantization
    cache_type_v="q4_0",   # Value cache quantization
)
```

**资料来源：[src/forge/server.py:20-45]()**

### Multi-Slot Configuration

For multi-agent architectures:

```python
server, ctx = await setup_backend(
    backend="llamaserver",
    gguf_path="/path/to/model.gguf",
    n_slots=4,           # Number of concurrent slots
    kv_unified=True,     # Shared KV cache pool
)
```

When `kv_unified=True`, all slots share a single KV cache pool, allowing each slot to use the full context window.

**资料来源：[src/forge/server.py:40-55]()**

## Recommended Sampling Parameters

Forge provides recommended sampling defaults for specific models:

```python
client = OllamaClient(
    model="qwen3:8b-q4_K_M",
    recommended_sampling=True  # Enable recommended defaults
)
```

The `recommended_sampling=True` parameter enables tuned temperature, top_p, top_k, and other sampling parameters sourced from HuggingFace model cards.

**资料来源：[src/forge/clients/sampling_defaults.py:1-80]()**

## Testing Your Installation

### Unit Tests

Run the deterministic unit test suite (no backend required):

```bash
python -m pytest tests/unit/ -v --tb=short
```

**资料来源：[CONTRIBUTING.md:18-25]()**

### Integration Tests

Integration tests require a running backend:

```bash
# Skip integration tests
python -m pytest tests/ -m "not integration"

# Run with coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing
```

## Troubleshooting

### Common Issues

| Issue | Solution |
|-------|----------|
| `ModuleNotFoundError: forge` | Run `pip install forge` or check virtual environment |
| Backend connection refused | Verify backend is running on correct port |
| Model not found (Ollama) | Run `ollama pull <model-name>` |
| VRAM out of memory | Enable KV cache quantization or use smaller model |

### Backend Health Check

Verify backend connectivity:

```python
import httpx
import asyncio

async def check_backend():
    try:
        async with httpx.AsyncClient(timeout=5.0) as client:
            resp = await client.get("http://localhost:8080/props")
            if resp.status_code == 200:
                print("Backend is ready")
    except httpx.ConnectError:
        print("Backend not reachable")
```

**资料来源：[src/forge/server.py:250-280]()**

## Next Steps

After installation, refer to:

- [User Guide](docs/USER_GUIDE.md) — Complete workflow creation and execution
- [Backend Setup Guide](docs/BACKEND_SETUP.md) — Detailed backend configuration
- [Model Guide](docs/MODEL_GUIDE.md) — Recommended models by hardware tier
- [Architecture Decision Records](docs/decisions/) — Design rationale documentation

---

<a id='page-quickstart'></a>

## Quick Start Guide

### 相关页面

相关主题：[WorkflowRunner and Agentic Loop](#page-workflowrunner), [System Architecture](#page-architecture)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)
- [src/forge/core/workflow.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)
- [src/forge/core/runner.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/runner.py)
- [src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)
- [examples/foreign_loop.py](https://github.com/antoinezambelli/forge/blob/main/examples/foreign_loop.py)
- [CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)
</details>

# Quick Start Guide

This guide provides a practical introduction to **Forge**, a reliability layer for self-hosted LLM tool-calling. Forge elevates an 8B local model to top-tier performance on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction).

## Purpose and Scope

The Quick Start Guide covers:

- **Installation and environment setup** for local LLM backends
- **Core concepts** including Workflows, ToolDefs, and the WorkflowRunner
- **Basic usage patterns** for single-step and multi-step agentic workflows
- **Integration options** for foreign orchestration loops
- **Backend management** with auto-start capabilities

Forge targets developers building agentic applications that require structured tool-calling with local models. It works with llama.cpp-based backends (llama-server, llamafile) and Ollama.

资料来源：[README.md:1-20](https://github.com/antoinezambelli/forge/blob/main/README.md)

## Installation

### Prerequisites

| Requirement | Version | Notes |
|-------------|---------|-------|
| Python | 3.12+ | Modern syntax required (type unions with `\|`) |
| pip | Latest | For package installation |
| LLM Backend | llama.cpp / Ollama | For inference |

### Setup Commands

```bash
git clone https://github.com/antoinezambelli/forge.git
cd forge
python -m venv .venv
pip install -e ".[dev]"
```

资料来源：[CONTRIBUTING.md:1-15](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

## Core Concepts

### Architecture Overview

```mermaid
graph TD
    A[User Input] --> B[WorkflowRunner]
    B --> C[LLM Client]
    C --> D[Tool Call Response]
    D --> E[Guardrails Check]
    E -->|execute| F[Tool Execution]
    F --> G[Context Manager]
    G -->|compact| B
    E -->|retry| C
    E -->|fatal| H[Error Handling]
```

### Workflow

The `Workflow` is the central definition for an agentic task. It binds together:

| Component | Type | Purpose |
|-----------|------|---------|
| `name` | `str` | Workflow identifier |
| `description` | `str` | Human-readable description |
| `tools` | `dict[str, ToolDef]` | Tool name → definition mapping |
| `required_steps` | `list[str]` | Tools that must execute before terminal |
| `terminal_tool` | `str` | Tool that ends the workflow |
| `system_prompt_template` | `str` | System prompt for the LLM |

资料来源：[src/forge/core/workflow.py:1-50](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)

### ToolDef and ToolSpec

**ToolSpec** defines the schema exposed to the LLM:

```python
class ToolSpec(BaseModel):
    name: str
    description: str
    parameters: type[BaseModel]  # Pydantic model
```

**ToolDef** binds the schema to its Python implementation:

```python
@dataclass
class ToolDef:
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)
```

The `prerequisites` field enables conditional dependencies:
- `str`: "if you call this tool, you must have called tool X first"
- `dict`: `{"tool": "read_file", "match_arg": "path"}` — arg-matched prerequisites

资料来源：[src/forge/core/workflow.py:60-90](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)

### WorkflowRunner

The `WorkflowRunner` manages the full lifecycle:

- System prompt injection
- Tool execution and result handling
- Context compaction
- Guardrail enforcement
- Multi-turn conversation state

```python
class WorkflowRunner:
    def __init__(self, client, context_manager):
        ...
    
    async def run(self, workflow, user_message):
        ...
```

资料来源：[src/forge/core/runner.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/runner.py)

### ContextManager and Budget Modes

Forge provides VRAM-aware context management through budget modes:

| Mode | Behavior |
|------|----------|
| `BudgetMode.MANUAL` | User-specified token budget |
| `BudgetMode.FORGE_FAST` | VRAM-optimized fast inference budget |
| `BudgetMode.FORGE_DEEP` | Extended context for complex reasoning |

The `ContextManager` resolves budgets at runtime based on the backend:

```python
async def resolve_budget(self, mode: BudgetMode, manual_tokens: int | None = None) -> int:
    if mode == BudgetMode.MANUAL:
        if self._backend == "ollama":
            return manual_tokens
        return await self.get_server_context()
```

资料来源：[src/forge/server.py:80-120](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)

## Quick Start Example

### Basic Single-Tool Workflow

```python
import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())
```

资料来源：[README.md:20-60](https://github.com/antoinezambelli/forge/blob/main/README.md)

### Key Components Explained

| Component | Import | Purpose |
|-----------|--------|---------|
| `OllamaClient` | `forge` | LLM backend adapter for Ollama |
| `TieredCompact` | `forge` | Context compaction strategy |
| `ContextManager` | `forge` | Token budget management |
| `WorkflowRunner` | `forge` | Orchestrates the agent loop |

## Multi-Step Workflows

For workflows requiring sequential tool execution:

```python
# Define multi-step workflow with prerequisites
workflow = Workflow(
    name="research_assistant",
    description="Research and answer questions",
    tools={
        "search": ToolDef(spec=search_spec, callable=do_search),
        "lookup": ToolDef(spec=lookup_spec, callable=do_lookup),
        "answer": ToolDef(spec=answer_spec, callable=final_answer),
    },
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)
```

The `required_steps` list enforces that `search` and `lookup` must execute before `answer`. Attempting to call the terminal tool prematurely triggers a retry nudge.

资料来源：[examples/foreign_loop.py:1-50](https://github.com/antoinezambelli/forge/blob/main/examples/foreign_loop.py)

## Guardrails API

For foreign orchestration loops (non-WorkflowRunner usage), Forge provides standalone guardrails:

### Simple API

```python
from forge.guardrails import Guardrails

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

def handle_response(response):
    result = guardrails.check(response)
    
    if result.action == "fatal":
        return f"FATAL: {result.reason}"
    
    if result.action in ("retry", "step_blocked"):
        return f"{result.action}: {result.nudge.content[:80]}..."
    
    # Execute tools
    executed = [tc.tool for tc in result.tool_calls]
    done = guardrails.record(executed)
    return f"executed {executed}" + (" -- DONE" if done else "")
```

### Granular API

Direct access to individual components:

```python
from forge.guardrails import ErrorTracker, ResponseValidator, StepEnforcer

validator = ResponseValidator(
    tool_names=["search", "lookup", "answer"],
    rescue_enabled=True,
)
enforcer = StepEnforcer(
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)
errors = ErrorTracker(max_retries=3, max_tool_errors=2)
```

| Component | Purpose |
|-----------|---------|
| `ResponseValidator` | Parses tool calls from LLM responses, rescue mode |
| `StepEnforcer` | Enforces required step sequence |
| `ErrorTracker` | Tracks retry attempts and tool errors |

资料来源：[examples/foreign_loop.py:80-150](https://github.com/antoinezambelli/forge/blob/main/examples/foreign_loop.py)

## Respond Tool Pattern

For conversational turns where the model should respond directly:

```python
from forge.tools import RESPOND_TOOL_NAME, respond_spec

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer", RESPOND_TOOL_NAME],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

def handle_response_with_respond(response):
    result = guardrails.check(response)
    
    # Check for respond() call
    for tc in result.tool_calls:
        if tc.tool == RESPOND_TOOL_NAME:
            message = tc.args.get("message", "")
            return f"MODEL SAYS: {message}"
    
    # Normal tool execution
    ...
```

资料来源：[examples/foreign_loop.py:160-200](https://github.com/antoinezambelli/forge/blob/main/examples/foreign_loop.py)

## Backend Auto-Management

Forge can auto-start backends for multi-agent architectures:

```python
from forge.server import run_with_server
from forge.clients import LlamafileClient, BudgetMode

async with run_with_server(
    backend="llamafile",
    gguf_path="/path/to/model.gguf",
    budget_mode=BudgetMode.FORGE_FAST,
) as (server, ctx):
    client = LlamafileClient(model="my-model")
    runner = WorkflowRunner(client=client, context_manager=ctx)
    # Run workflows...
```

### Backend Options

| Backend | Model Source | GGUF Support |
|---------|-------------|--------------|
| `ollama` | Model name (e.g., `ministral-3:8b`) | No |
| `llamaserver` | GGUF file path | Yes |
| `llamafile` | GGUF file path | Yes |

资料来源：[src/forge/server.py:1-80](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)

## Recommended Sampling Parameters

Forge provides curated sampling defaults for supported models:

```python
from forge.clients import OllamaClient, get_sampling_defaults

# Opt-in to recommended sampling
client = OllamaClient(
    model="ministral-3:8b-q4_K_M",
    recommended_sampling=True  # Raises error for unknown models
)
```

| Parameter | Source | Verification |
|-----------|--------|--------------|
| `temperature` | HF model cards | Per-model verification |
| `top_p` | HF model cards | Per-model verification |
| `top_k` | HF model cards | Per-model verification |
| `min_p` | HF model cards | Per-model verification |
| `repeat_penalty` | HF model cards | Per-model verification |

资料来源：[src/forge/clients/sampling_defaults.py:1-50](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)

## Testing

### Unit Tests (No Backend Required)

```bash
# Full suite (865 tests)
python -m pytest tests/unit/ -v --tb=short

# With coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing

# Single file
python -m pytest tests/unit/test_runner.py -v
```

### Integration Tests (Requires Backend)

```bash
# Skip integration tests
python -m pytest tests/ -m "not integration"
```

资料来源：[CONTRIBUTING.md:15-30](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

## Next Steps

| Topic | Description |
|-------|-------------|
| [User Guide](docs/USER_GUIDE.md) | Multi-step workflows, long-running sessions |
| [Model Guide](docs/MODEL_GUIDE.md) | Model-specific configurations |
| [Architecture Decisions](docs/decisions/) | Design rationale and ADRs |
| [Eval Suite](tests/eval/) | Performance evaluation methodology |

---

<a id='page-architecture'></a>

## System Architecture

### 相关页面

相关主题：[WorkflowRunner and Agentic Loop](#page-workflowrunner)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)
- [src/forge/core/workflow.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)
- [src/forge/guardrails/guardrails.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/guardrails.py)
- [src/forge/clients/sampling_defaults.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)
- [src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)
- [examples/foreign_loop.py](https://github.com/antoinezambelli/forge/blob/main/examples/foreign_loop.py)
</details>

# System Architecture

## Overview

Forge is an LLM agent framework that orchestrates multi-step tool-calling workflows with built-in guardrails, context management, and automatic backend server management. The architecture follows a clean separation of concerns: clients abstract LLM backends, workflows define agent behavior, guardrails enforce execution policies, and the context manager handles token budgeting. 资料来源：[CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

The framework is designed for determinism in unit tests and supports three backend types: Ollama, llama-server, and llamafile. 资料来源：[src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)

## Project Layout

```
src/forge/           # Library source
  clients/           # LLM backend adapters (one per backend)
  core/              # Workflow, runner, messages, steps
  context/           # Context management and compaction
  prompts/           # Prompt templates and nudges
tests/
  unit/              # Deterministic tests
  eval/              # Eval harness (requires live backends)
    scenarios/       # Eval scenario definitions
    dashboard/       # React-based HTML dashboard (separate npm build)
docs/                # User-facing documentation
  decisions/         # Architecture Decision Records (ADRs)
  results/           # Eval results and raw data tables
```

资料来源：[CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

## Core Architecture Components

### Architecture Diagram

```mermaid
graph TD
    User["User / Application"]
    Runner["WorkflowRunner"]
    Workflow["Workflow"]
    Guardrails["Guardrails"]
    ContextMgr["ContextManager"]
    Client["LLM Client"]
    ServerMgr["ServerManager"]
    
    User --> Runner
    Runner --> Workflow
    Runner --> Guardrails
    Runner --> ContextMgr
    Runner --> Client
    Client --> ServerMgr
    
    subgraph "forge Library"
        Runner
        Workflow
        Guardrails
        ContextMgr
        Client
    end
    
    subgraph "Backend"
        ServerMgr
    end
```

### LLM Clients

The client layer abstracts different LLM backends behind a common async interface. Each client handles backend-specific protocol differences.

| Client | Backend | Protocol |
|--------|---------|----------|
| `OllamaClient` | Ollama | OpenAI-compatible REST |
| `LlamafileClient` | Llamafile | OpenAI-compatible REST |
| `AnthropicClient` | Anthropic API | Anthropic native |
| `OpenAIClient` | OpenAI API | OpenAI native |

资料来源：[CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

#### Sampling Defaults

Each client can optionally apply recommended sampling parameters sourced from HuggingFace model cards. The policy layer provides four-quadrant behavior:

| `strict` | Model in map | Behavior |
|----------|--------------|----------|
| `True` | Yes | Return dict |
| `True` | No | Raise `UnsupportedModelError` |
| `False` | Yes | One-shot INFO log; return `{}` |
| `False` | No | Return `{}` (silent) |

资料来源：[src/forge/clients/sampling_defaults.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)

Clients no longer ship hardcoded temperature defaults. With `recommended_sampling=False` (default), forge sends nothing and the backend's default applies. 资料来源：[CHANGELOG.md](https://github.com/antoinezambelli/forge/blob/main/CHANGELOG.md)

### Workflow System

The `Workflow` class defines an agent's behavior declaratively:

```python
workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(...),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools.",
)
```

资料来源：[README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)

#### Tool Definition

`ToolDef` binds a tool schema to its implementation:

```python
@dataclass
class ToolDef:
    """Binds a tool schema to its implementation."""
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)
```

Prerequisites express conditional dependencies:
- `str`: Name-only (`"read_file"` — any prior call satisfies it)
- `dict`: Arg-matched (`{"tool": "read_file", "match_arg": "path"}`)

资料来源：[src/forge/core/workflow.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)

### Guardrails System

Guardrails enforce execution policies through three coordinated components:

```mermaid
graph LR
    Response["LLM Response"] --> Validator["ResponseValidator"]
    Validator --> Enforcer["StepEnforcer"]
    Enforcer --> Tracker["ErrorTracker"]
    
    Validator --> "rescue parsing"
    Enforcer --> "required steps"
    Tracker --> "retry limits"
```

#### Guardrails Configuration

| Parameter | Purpose | Default |
|-----------|---------|---------|
| `tool_names` | List of available tools | Required |
| `terminal_tool` | Final allowed tool | Required |
| `required_steps` | Ordered prerequisite chain | `None` |
| `max_retries` | Total retry attempts | 3 |
| `max_tool_errors` | Consecutive tool failures | 2 |
| `rescue_enabled` | Enable XML rescue parsing | `True` |
| `max_premature_attempts` | Premature terminal attempts before fatal | 3 |

资料来源：[src/forge/guardrails/guardrails.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/guardrails.py)

#### Check Result Actions

The `check()` method returns a `CheckResult` with these actions:

| Action | Meaning |
|--------|---------|
| `proceed` | Response passes all guardrails |
| `retry` | Invalid response, apply nudge and retry |
| `step_blocked` | Missing required step |
| `fatal` | Max retries exceeded |

### Context Manager

The `ContextManager` handles token budgeting and context compaction to prevent context overflow during long conversations. 资料来源：[CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

#### Budget Resolution

Budget is resolved based on the mode:

| Mode | Resolution Strategy |
|------|---------------------|
| `MANUAL` | Use `manual_tokens` parameter or query server |
| `FORGE_FAST` | Server-reported context / 4 |
| `FORGE_BALANCED` | Server-reported context / 2 |
| `FORGE_DEEP` | Server-reported context * 3 / 4 |

For Ollama backends, the context length is obtained from `ollama show`. For llama-server/llamafile, a `/props` query retrieves the actual `n_ctx`. 资料来源：[src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)

### Server Manager

`ServerManager` handles lifecycle management of backend servers (llama-server and llamafile only; Ollama is managed externally).

```python
server = ServerManager(backend="llamaserver", port=8080)
context, ctx_mgr = await server.start_with_budget(
    model="qwen3:8b-q4_K_M",
    budget_mode=BudgetMode.FORGE_FAST,
    client=client,
)
```

#### Server Configuration Parameters

| Parameter | Description | Backend |
|-----------|-------------|---------|
| `model` | Model identity for server | All |
| `gguf_path` | Path to GGUF file | llamaserver/llamafile |
| `mode` | Operation mode | All |
| `extra_flags` | Additional CLI flags | llamaserver/llamafile |
| `ctx_override` | Override context length (`-c value`) | llamaserver/llamafile |
| `cache_type_k` | KV cache quantization type for keys | llamaserver |
| `cache_type_v` | KV cache quantization type for values | llamaserver |
| `n_slots` | Concurrent slots count | llamaserver |
| `kv_unified` | Single unified KV cache | llamaserver |

The server reuses an existing process if the same configuration is requested, avoiding unnecessary restarts. 资料来源：[src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)

## Execution Flow

```mermaid
sequenceDiagram
    participant User
    participant Runner
    participant Workflow
    participant Guardrails
    participant Context
    participant Client
    
    User->>Runner: run(workflow, input)
    Runner->>Context: begin_session()
    Runner->>Workflow: Get system prompt
    Runner->>Client: send(messages)
    
    loop Until terminal or max iterations
        Client-->>Runner: LLMResponse
        Runner->>Guardrails: check(response)
        Guardrails-->>Runner: CheckResult
        
        alt proceed
            Runner->>Runner: Execute tools
            Runner->>Context: append(messages)
            Runner->>Client: send(messages)
        else retry
            Runner->>Runner: Apply nudge, retry
        else fatal
            Runner->>User: Return error
        end
    end
    
    Runner-->>User: Final result
```

## Workflow Runner Integration

The `WorkflowRunner` orchestrates all components:

```python
async def main():
    client = OllamaClient(
        model="ministral-3:8b-instruct-2512-q4_K_M",
        recommended_sampling=True
    )
    ctx = ContextManager(
        strategy=TieredCompact(keep_recent=2),
        budget_tokens=8192
    )
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")
```

资料来源：[README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)

## Design Principles

### Async-First

All client methods and the runner are async, enabling efficient I/O handling across multiple concurrent requests. 资料来源：[CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

### Type Safety

Pydantic is used for tool parameter schemas and response validation, ensuring runtime type safety for tool arguments. 资料来源：[CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

### Modern Python

The codebase targets Python 3.12+ and uses modern syntax including:
- Type unions with `|` (e.g., `str | None`)
- `dataclass` decorators
- `field(default_factory=list)` patterns

资料来源：[CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)

## Guardrails Configuration Examples

```python
# Basic guardrails
guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

# With respond tool for middleware
respond_guardrails = Guardrails(
    tool_names=["search", "lookup", "answer", RESPOND_TOOL_NAME],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)
```

资料来源：[examples/foreign_loop.py](https://github.com/antoinezambelli/forge/blob/main/examples/foreign_loop.py)

## Related Documentation

- [User Guide](docs/USER_GUIDE.md) — Multi-step workflows and backend auto-management
- [Model Guide](docs/MODEL_GUIDE.md) — Model recommendations by tier
- [Architecture Decisions](docs/decisions/) — Design rationale for significant changes

---

<a id='page-module-structure'></a>

## Module Structure and API

### 相关页面

相关主题：[System Architecture](#page-architecture), [WorkflowRunner and Agentic Loop](#page-workflowrunner)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/forge/__init__.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/__init__.py)
- [src/forge/core/messages.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/messages.py)
- [src/forge/core/workflow.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)
- [src/forge/errors.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/errors.py)
- [src/forge/guardrails/guardrails.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/guardrails.py)
- [src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)
- [src/forge/clients/sampling_defaults.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)
- [src/forge/clients/llamafile.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/llamafile.py)
</details>

# Module Structure and API

## Overview

The forge repository implements a modular LLM orchestration framework designed to handle multi-step tool-calling workflows with built-in guardrails, context management, and support for multiple LLM backends. The architecture follows a clean separation of concerns with distinct modules for client handling, workflow orchestration, guardrails enforcement, and server management.

## Core Architecture

The forge library is organized into a layered architecture that separates backend communication, workflow definition, execution enforcement, and context management.

```mermaid
graph TD
    A[User Code] --> B[WorkflowRunner]
    B --> C[LLM Clients]
    C --> D[llama.cpp / Ollama / Llamafile]
    B --> E[Guardrails]
    B --> F[ContextManager]
    E --> G[ResponseValidator]
    E --> H[StepEnforcer]
    E --> I[ErrorTracker]
```

## Module Hierarchy

| Module | Purpose | Key Classes |
|--------|---------|-------------|
| `forge.core` | Core workflow orchestration | `Workflow`, `WorkflowRunner`, `ToolSpec`, `ToolDef`, `ToolCall` |
| `forge.clients` | LLM backend adapters | `OllamaClient`, `LlamafileClient`, `LlamaServerClient` |
| `forge.guardrails` | Response validation and enforcement | `Guardrails`, `ResponseValidator`, `StepEnforcer`, `ErrorTracker` |
| `forge.context` | Token budget and context management | `ContextManager`, `TieredCompact` |
| `forge.server` | Backend server lifecycle management | `ServerManager`, `BudgetMode` |

资料来源：[src/forge/__init__.py]()

## Tool Definition API

### ToolSpec

The `ToolSpec` class defines the interface for tools that the LLM can invoke. It wraps a Pydantic model representing the tool's parameters.

```python
class ToolSpec(BaseModel):
    name: str
    description: str
    parameters: type[BaseModel]
```

**Construction from OpenAI Schema:**

Tools can be defined from an OpenAI-style JSON Schema:

```python
tool_spec = ToolSpec.from_openai_schema(
    name="get_weather",
    description="Get current weather for a city",
    schema={
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name"}
        },
        "required": ["city"]
    }
)
```

资料来源：[src/forge/core/workflow.py:1-50]()

### ToolDef

The `ToolDef` dataclass binds a tool schema to its implementation callable, along with prerequisites:

```python
@dataclass
class ToolDef:
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)
```

**Prerequisites Syntax:**

Prerequisites express conditional dependencies between tool calls:

| Type | Example | Behavior |
|------|---------|----------|
| String (name-only) | `"read_file"` | Any prior call to `read_file` satisfies it |
| Dict (arg-matched) | `{"tool": "read_file", "match_arg": "path"}` | Prior call with same `path` value required |

```python
tool_def = ToolDef(
    spec=tool_spec,
    callable=get_weather_function,
    prerequisites=[{"tool": "search", "match_arg": "query"}]
)
```

资料来源：[src/forge/core/workflow.py:52-72]()

## Workflow Definition API

### Workflow Class

The `Workflow` class is the central configuration object for a multi-step LLM task:

```python
workflow = Workflow(
    name="weather",
    description="Look up weather for a city",
    tools={
        "get_weather": ToolDef(spec=..., callable=get_weather)
    },
    required_steps=["search", "lookup"],
    terminal_tool="answer",
    system_prompt_template="You are a helpful assistant."
)
```

**Key Parameters:**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | `str` | Yes | Workflow identifier |
| `description` | `str` | Yes | Human-readable description for the LLM |
| `tools` | `dict[str, ToolDef]` | Yes | Map of tool name to ToolDef |
| `required_steps` | `list[str]` | No | Tools that must be called before terminal_tool |
| `terminal_tool` | `str` | Yes | Tool(s) that can end the workflow |
| `system_prompt_template` | `str` | No | System prompt injected into context |

资料来源：[src/forge/core/workflow.py]()

## LLM Client API

### Client Architecture

Forge provides backend-agnostic client adapters that implement a common interface:

```mermaid
graph LR
    A[WorkflowRunner] --> B[Client Interface]
    B --> C[OllamaClient]
    B --> D[LlamafileClient]
    B --> E[LlamaServerClient]
```

### OllamaClient

```python
client = OllamaClient(
    model="ministral-3:8b-instruct-2512-q4_K_M",
    recommended_sampling=True
)
```

### Sampling Defaults

Per-model recommended sampling parameters are managed through `sampling_defaults.py`:

```python
def apply_sampling_defaults(
    model: str,
    *,
    strict: bool,
) -> dict[str, float | int]:
    """Apply the recommended-sampling policy for model."""
```

**Sampling Policy Quadrant:**

| `strict` | Model in Map | Behavior |
|----------|-------------|----------|
| `True` | Yes | Return dict copy |
| `True` | No | Raise `UnsupportedModelError` |
| `False` | Yes | One-shot INFO log; return `{}` |
| `False` | No | Return `{}` (silent) |

资料来源：[src/forge/clients/sampling_defaults.py:1-80]()

### ToolCall Response Model

The `ToolCall` class represents a validated tool invocation returned by an LLM client:

```python
class ToolCall(BaseModel):
    tool: str
```

Additional fields may be populated by client implementations (e.g., `args`, `reasoning`).

资料来源：[src/forge/core/workflow.py:74-77]()

## Guardrails API

The guardrails system provides middleware for orchestrating LLM responses with built-in validation, step enforcement, and error handling.

### Guardrails Class

The main entry point for the guardrails system:

```python
guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
    max_retries=3,
    max_tool_errors=2,
    rescue_enabled=True,
    max_premature_attempts=3
)
```

**Constructor Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `tool_names` | `list[str]` | Required | Valid tool names for this workflow |
| `required_steps` | `list[str]` | `None` | Tools that must be called before terminal_tool |
| `terminal_tool` | `str \| frozenset` | Required | Tool(s) that can end the workflow |
| `max_retries` | `int` | `3` | Consecutive bad responses before fatal |
| `max_tool_errors` | `int` | `2` | Consecutive tool failures before exhaustion |
| `rescue_enabled` | `bool` | `True` | Attempt to parse tool calls from plain text |
| `max_premature_attempts` | `int` | `3` | Premature terminal attempts before fatal |
| `retry_nudge` | `Callable[[str], str]` | `None` | Custom nudge for bare text responses |

资料来源：[src/forge/guardrails/guardrails.py:1-80]()

### CheckResult

The return type of `Guardrails.check()`:

```python
class CheckResult:
    action: Literal["execute", "retry", "step_blocked", "fatal"]
    tool_calls: list[ToolCall] | None
    nudge: Nudge | None
    reason: str | None
```

**Action Meanings:**

| Action | Description |
|--------|-------------|
| `execute` | Safe to proceed; `tool_calls` contains valid calls |
| `retry` | Invalid response; inject `nudge` and retry |
| `step_blocked` | Attempted terminal tool before required steps |
| `fatal` | Max retries exhausted; `reason` contains explanation |

### Two-Method Guardrails API

```python
# After each LLM response
result = guardrails.check(response)

if result.action == "fatal":
    return f"FATAL: {result.reason}"

if result.action in ("retry", "step_blocked"):
    return f"{result.action}: {result.nudge.content}"

# result.action == "execute"
# Run tools yourself, then record results
tool_calls = result.tool_calls
executed = [tc.tool for tc in tool_calls]
done = guardrails.record(executed)
```

资料来源：[src/forge/guardrails/guardrails.py:82-130]()

### Granular API

For advanced use cases, individual guardrail components can be used directly:

```python
from forge.guardrails import ResponseValidator, StepEnforcer, ErrorTracker

validator = ResponseValidator(
    tool_names=["search", "lookup", "answer"],
    rescue_enabled=True,
)
enforcer = StepEnforcer(
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)
errors = ErrorTracker(max_retries=3, max_tool_errors=2)
```

## Server Management API

### ServerManager

The `ServerManager` class handles lifecycle management for llama.cpp-based backends:

```python
server = ServerManager(backend="llamaserver", port=8080)
```

**Backend Options:**

| Backend | Description |
|---------|-------------|
| `"ollama"` | Ollama server (model name, no GGUF path) |
| `"llamaserver"` | llama.cpp server via `llama-server` |
| `"llamafile"` | Mozilla llamafile binary |

### Budget Resolution

The `resolve_budget()` method determines context length based on mode:

```python
async def resolve_budget(
    self,
    mode: BudgetMode,
    manual_tokens: int | None = None,
) -> int:
```

| Mode | Behavior |
|------|----------|
| `MANUAL` | Use `manual_tokens` directly |
| `FORGE_FAST` / `FORGE_DEEP` | Query server `/props` for context |

资料来源：[src/forge/server.py:1-100]()

## Context Management

### ContextManager

Token budget management for long-running conversations:

```python
ctx = ContextManager(
    strategy=TieredCompact(keep_recent=2),
    budget_tokens=8192
)
```

### BudgetMode Enum

```python
class BudgetMode(Enum):
    MANUAL = "manual"
    FORGE_FAST = "forge_fast"
    FORGE_DEEP = "forge_deep"
```

资料来源：[src/forge/server.py]()

## Error Types

Forge defines custom exceptions for specific error conditions:

```python
class UnsupportedModelError(Exception):
    """Raised when strict sampling defaults are requested for unknown models."""
    pass
```

Additional error types in `errors.py`:

| Error | Use Case |
|-------|----------|
| `BudgetResolutionError` | Server unreachable or missing n_ctx |
| `BackendError` | Backend communication failures |

资料来源：[src/forge/errors.py]()

## Quick Start Example

```python
import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())
```

资料来源：[README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)

## Summary

The forge module structure provides:

1. **Tool Definition** — `ToolSpec` and `ToolDef` for declaring LLM-callable functions with prerequisites
2. **Workflow Orchestration** — `Workflow` and `WorkflowRunner` for managing multi-step tasks
3. **Client Abstraction** — Backend-agnostic clients with sampling defaults
4. **Guardrails Middleware** — Built-in validation, step enforcement, and error handling
5. **Server Management** — Lifecycle control for llama.cpp backends
6. **Context Management** — Token budget and compaction strategies

---

<a id='page-adr-index'></a>

## Architecture Decision Records

### 相关页面

相关主题：[System Architecture](#page-architecture)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)
- [src/forge/core/workflow.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)
- [src/forge/guardrails/guardrails.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/guardrails.py)
- [src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)
- [src/forge/clients/sampling_defaults.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)
- [src/forge/proxy/__main__.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__main__.py)
</details>

# Architecture Decision Records

## Overview

Architecture Decision Records (ADRs) serve as the authoritative documentation for significant design choices within the forge project. They capture the *why* behind implementation decisions, enabling current and future contributors to understand the reasoning without reconstructing the original context.

The forge project stores ADRs in `docs/decisions/`, using a numbered naming convention (e.g., `001-ablation-framework.md`, `011-guardrail-middleware.md`, `013-text-response-intent.md`). This numbering scheme allows for easy chronological tracking and establishes precedent relationships between decisions.

## Purpose and Scope

ADRs in forge address several critical aspects:

| Category | Description | Example Documents |
|----------|-------------|-------------------|
| Framework Design | Ablation study methodology and tooling | `001-ablation-framework.md` |
| Middleware Patterns | Guardrail implementation and composition | `011-guardrail-middleware.md` |
| Response Handling | Intent classification for text responses | `013-text-response-intent.md` |
| Backend Integration | LLM client configuration and sampling defaults | `sampling_defaults.py` |
| Server Management | Context resolution and budget modes | `server.py` |

Each ADR documents not only the chosen approach but also considered alternatives and the tradeoffs that influenced the final decision. This creates a historical record that prevents repeated debates over settled questions while enabling informed reconsideration when circumstances change.

## ADR Contribution Workflow

According to the contribution guidelines, the process for introducing a new architecture decision follows a structured review pattern:

```mermaid
graph TD
    A[Identify Design Decision] --> B[Review Existing ADRs]
    B --> C{Decision Already Documented?}
    C -->|Yes| D[Reference Existing ADR]
    C -->|No| E[Draft New ADR]
    E --> F[Propose ADR Format]
    F --> G[Review Against Project Standards]
    G --> H[Merge and Publish]
    
    style A fill:#e1f5fe
    style H fill:#c8e6c9
```

The contribution workflow integrates with the broader project development cycle:

1. **Proposal Phase**: Before implementing significant changes, contributors should draft an ADR following the established format
2. **Review Phase**: The ADR undergoes peer review alongside code review
3. **Adoption Phase**: Once approved, the ADR becomes the reference for implementation decisions
4. **Maintenance Phase**: ADRs may be updated if subsequent decisions supersede them

资料来源：[CONTRIBUTING.md:1-50]()

## Core Architecture Components

### Workflow Engine

The `Workflow` and `WorkflowRunner` classes form the central orchestration layer. A workflow defines the available tools, required execution steps, and terminal conditions.

```mermaid
graph TD
    subgraph Workflow Definition
        W[Workflow] --> TD[Tool Definitions]
        W --> RS[Required Steps]
        W --> TT[Terminal Tool]
        W --> SP[System Prompt Template]
    end
    
    subgraph Execution Layer
        WR[WorkflowRunner] --> CM[Context Manager]
        WR --> GR[Guardrails]
        WR --> CL[LLM Client]
    end
    
    subgraph Tool Layer
        TC[ToolCall] --> T[Tool Execution]
        T --> TR[Tool Response]
    end
    
    WR --> TC
    TC -->|Result| CM
    CM -->|Context| WR
    
    style W fill:#fff3e0
    style WR fill:#e3f2fd
```

The `ToolDef` dataclass binds tool schemas to implementations, while `ToolSpec` defines the JSON Schema for parameter validation. Tool calls are represented as `ToolCall` objects containing the tool name and arguments.

资料来源：[src/forge/core/workflow.py:1-100]()

### Guardrail Middleware

The guardrail system provides a composable validation layer that intercepts LLM responses before tool execution:

```mermaid
graph LR
    LLM[LLM Response] --> GR[Guardrails.check]
    GR --> RV[ResponseValidator]
    GR --> SE[StepEnforcer]
    GR --> ET[ErrorTracker]
    
    RV -->|Valid| TC[ToolCalls]
    RV -->|Invalid| NR[Retry Nudge]
    SE -->|Correct Order| TC
    SE -->|Wrong Order| SB[Step Blocked]
    ET -->|OK| TC
    ET -->|Max Errors| FT[Fatal]
    
    style GR fill:#fce4ec
    style TC fill:#c8e6c9
```

The `Guardrails` class orchestrates three sub-components:

| Component | Responsibility | Key Parameters |
|-----------|---------------|----------------|
| `ResponseValidator` | Parses tool calls, enables rescue parsing | `rescue_enabled`, `retry_nudge_fn` |
| `StepEnforcer` | Ensures required steps precede terminal tool | `required_steps`, `max_premature_attempts` |
| `ErrorTracker` | Tracks consecutive errors and retries | `max_retries`, `max_tool_errors` |

资料来源：[src/forge/guardrails/guardrails.py:1-100]()

### Server Management and Budget Resolution

The `ServerManager` handles lifecycle management for llama.cpp-based backends, while the `ContextManager` implements token budget strategies:

```mermaid
graph TD
    SM[ServerManager] --> BM[BudgetMode]
    BM -->|FORGE_FAST| FT[Fast Budget]
    BM -->|FORGE_BALANCED| BT[Balanced Budget]
    BM -->|FORGE_DEEP| DT[Deep Budget]
    BM -->|MANUAL| MT[Manual Tokens]
    
    CM[ContextManager] --> TC[TieredCompact]
    CM --> SC[SimpleCompact]
    
    SM -->|Context Query| Props[/props endpoint]
    Props -->|n_ctx| CM
```

Budget resolution follows platform-specific paths:

- **Ollama**: Uses `manual_tokens` parameter for `MANUAL` mode
- **Llamafile/Llama Server**: Queries `/props` endpoint for server-configured context length

资料来源：[src/forge/server.py:1-100]()

## Sampling Configuration System

The sampling defaults system separates lookup from policy, enabling fine-grained control over model parameters:

```mermaid
graph TD
    subgraph Lookup Layer
        GM[get_model_defaults] --> MAP[MODEL_SAMPLING_DEFAULTS]
    end
    
    subgraph Policy Layer
        AS[apply_sampling_defaults] --> |strict=True| KR[Known + Known]
        AS --> |strict=False| KU[Known + Unknown]
        KR -->|In Map| ReturnDict[Return Dict]
        KU -->|Not In Map| InfoLog[INFO Log Once]
    end
    
    subgraph Client Integration
        OC[OllamaClient] --> AS
        LC[LlamafileClient] --> AS
        AC[AnthropicClient] --> AS
    end
```

The two-function design (`get_sampling_defaults` for pure lookup, `apply_sampling_defaults` for policy) ensures that:

- Unknown models don't cause errors when `strict=False`
- Known models log a one-time INFO message when not opted in
- Explicit opt-in via `recommended_sampling=True` enables strict behavior

资料来源：[src/forge/clients/sampling_defaults.py:1-100]()

## Proxy Server Architecture

The `ProxyServer` provides a forwarding layer with additional control features:

```mermaid
graph TD
    subgraph Proxy Layer
        PS[ProxyServer] --> SF[Serialize Flag]
        PS --> RT[Retry Logic]
        PS --> RC[Rescue Parser]
    end
    
    subgraph Backend Routing
        PS --> Ollama[Ollama Backend]
        PS --> Llama[Llama Backend]
        PS --> LLF[Llamafile Backend]
    end
    
    subgraph Configuration
        SF --> |serialize=True| Serial[Serialize Requests]
        SF --> |serialize=False| Parallel[Parallel Requests]
        RT --> |max_retries=N| RetryN[N Attempts]
    end
```

Key proxy options include:

| Flag | Default | Purpose |
|------|---------|---------|
| `--host` | `127.0.0.1` | Proxy listen address |
| `--port` | `8081` | Proxy listen port |
| `--serialize` | `None` | Request serialization control |
| `--max-retries` | `3` | Retries per request |
| `--no-rescue` | `False` | Disable rescue parsing |

资料来源：[src/forge/proxy/__main__.py:1-80]()

## ADR Format and Standards

Each ADR in the forge repository follows a consistent structure:

1. **Title**: Descriptive name with ADR number
2. **Status**: Proposed, Accepted, Deprecated, or Superseded
3. **Context**: Background and problem statement
4. **Decision**: The chosen approach with rationale
5. **Consequences**: Benefits, drawbacks, and tradeoffs
6. **Related Decisions**: Links to dependent or related ADRs

This format ensures that future maintainers can quickly assess whether an ADR is current and understand the full context of each decision.

## Versioning and Evolution

The CHANGELOG maintains a parallel record of implementation milestones, cross-referenced with ADRs. Major architectural changes increment the minor version number, while bug fixes increment the patch version (semantic versioning).

Changes that require ADR updates include:

- New LLM backend support
- Guardrail algorithm modifications
- Context management strategy changes
- Tool execution model alterations
- Breaking API changes

资料来源：[CHANGELOG.md:1-100]()

## Best Practices for ADR Readers

When reviewing ADRs to understand forge's architecture:

1. **Start with the index**: The `docs/decisions/` directory lists all ADRs chronologically
2. **Check status**: Deprecated ADRs indicate historical context, not current practice
3. **Cross-reference implementations**: Source files in `src/forge/` implement ADR decisions
4. **Review CHANGELOG**: Implementation dates and version numbers provide temporal context
5. **Examine tests**: Unit tests in `tests/unit/` validate ADR-enforced behaviors

---

<a id='page-workflowrunner'></a>

## WorkflowRunner and Agentic Loop

### 相关页面

相关主题：[System Architecture](#page-architecture)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/forge/core/runner.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/runner.py)
- [src/forge/core/workflow.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)
- [src/forge/core/steps.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/steps.py)
- [src/forge/guardrails/guardrails.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/guardrails.py)
- [src/forge/guardrails/response_validator.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/response_validator.py)
- [src/forge/guardrails/step_enforcer.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/step_enforcer.py)
- [src/forge/guardrails/error_tracker.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/error_tracker.py)
- [src/forge/prompts/nudges.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/prompts/nudges.py)
</details>

# WorkflowRunner and Agentic Loop

## Overview

The `WorkflowRunner` is the central execution engine in the Forge framework, implementing an **agentic loop** that orchestrates multi-step tool-calling workflows with an LLM backend. It manages the complete lifecycle of a workflow: from initializing messages, through iterative LLM inference and tool execution, to context management and termination.

**Core responsibilities:**

1. Building initial message lists (system prompt + user input)
2. Coordinating LLM inference with streaming or batch responses
3. Validating and executing tool calls returned by the LLM
4. Managing context budget through the `ContextManager`
5. Enforcing required step sequences via the `StepEnforcer`
6. Handling retries for malformed responses
7. Terminating on terminal tool execution or max iterations

资料来源：[src/forge/core/runner.py:1-50]()

---

## Architecture

### Component Relationships

```mermaid
graph TD
    subgraph "Forge Agentic System"
        WR[WorkflowRunner] --> CM[ContextManager]
        WR --> IV[Inference Validator]
        WR --> SE[StepEnforcer]
        WR --> ET[ErrorTracker]
        CM --> CC[Compaction Chain]
        IV --> RV[ResponseValidator]
    end
    
    WR --> Client[LLMClient]
    Client --> Ollama[OllamaClient]
    Client --> LlamaServer[LlamafileClient]
    Client --> Proxy[ProxyClient]
    
    WF[Workflow] --> WR
    WF --> TD[ToolDefs]
    TD --> Impl[Callable Implementations]
```

资料来源：[src/forge/core/runner.py:1-50]()

### Agentic Loop Flow

The `WorkflowRunner.run()` method implements a 7-phase agentic loop:

```mermaid
graph TD
    A[Start: User Query] --> B["1. Build Messages<br/>system_prompt + user_input"]
    B --> C["2. Send to LLM<br/>Streaming or Batch"]
    C --> D{Response Type?}
    D -->|TextResponse<br/>Malformed/Refusal| E["3. Retry with Nudge"]
    E --> C
    D -->|ToolCall| F["4. Validate Tool Calls<br/>Schema + Prerequisites"]
    F --> G{"5. Execute Tools<br/>Batch-aware"}
    G --> H{Success?}
    H -->|Error| I["Track Error<br/>Retry if < max_tool_errors"]
    I --> C
    H -->|Success| J["6. Compact Context<br/>ContextManager"]
    J --> K{"7. Check Terminal<br/>terminal_tool called?"}
    K -->|No| L["Increment iteration<br/>Check < max_iterations"]
    L -->|Yes| C
    K -->|Yes| M[Done]
    L -->|No| N[MaxIterationsError]
    
    style E fill:#ffcccc
    style N fill:#ffcccc
    style M fill:#ccffcc
```

资料来源：[src/forge/core/runner.py:50-200]()

---

## Core Data Models

### Workflow

The `Workflow` dataclass defines the complete task specification:

```python
@dataclass
class Workflow:
    name: str
    description: str
    tools: dict[str, ToolDef]
    required_steps: list[str] = field(default_factory=list)
    terminal_tool: str | frozenset[str] = ""
    system_prompt_template: str = ""
    max_retries: int = 3
    cancel_event: asyncio.Event | None = None
```

资料来源：[src/forge/core/workflow.py:1-100]()

| Field | Type | Description |
|-------|------|-------------|
| `name` | `str` | Workflow identifier |
| `description` | `str` | Human-readable task description |
| `tools` | `dict[str, ToolDef]` | Tool name → ToolDef mapping |
| `required_steps` | `list[str]` | Tool names that must execute before terminal tool |
| `terminal_tool` | `str \| frozenset[str]` | Tool(s) that end the workflow |
| `system_prompt_template` | `str` | System prompt for the LLM |
| `max_retries` | `int` | Consecutive bad responses before fatal (default: 3) |
| `cancel_event` | `asyncio.Event \| None` | Optional cancellation signal |

### ToolDef

`ToolDef` binds a tool's schema to its implementation:

```python
@dataclass
class ToolDef:
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)
```

资料来源：[src/forge/core/workflow.py:100-130]()

**Prerequisites** express conditional dependencies:

| Type | Example | Behavior |
|------|---------|----------|
| `str` | `"read_file"` | Any prior call to `read_file` satisfies it |
| `dict` | `{"tool": "read_file", "match_arg": "path"}` | Prior call with matching `path` value required |

### ToolSpec

Defines the tool's interface for LLM communication using Pydantic:

```python
@dataclass
class ToolSpec:
    name: str
    description: str
    parameters: type[BaseModel]
```

资料来源：[src/forge/core/workflow.py:40-60]()

### ToolCall

Validated tool invocation returned by the LLM:

```python
class ToolCall(BaseModel):
    tool: str
    args: dict[str, Any] = Field(default_factory=dict)
```

资料来源：[src/forge/core/workflow.py:160-170]()

---

## StepTracker

The `StepTracker` maintains workflow state outside the message history, tracking which required steps have been completed and what tools have been executed with what arguments.

```python
@dataclass
class StepTracker:
    required_steps: list[str]
    completed_steps: dict[str, None] = field(default_factory=dict)
    executed_tools: dict[str, list[dict[str, Any]]] = field(default_factory=dict)
    
    def record(self, tool_name: str, args: dict[str, Any] | None = None) -> None
    def is_satisfied(self) -> bool
    def pending(self) -> list[str]
```

资料来源：[src/forge/core/steps.py:20-50]()

**Key behaviors:**

- `completed_steps` uses `dict` (insertion order preserved) to track step order
- `executed_tools` stores argument history for arg-matched prerequisites
- Compaction cannot invalidate step completion (state lives outside message history)

### PrerequisiteCheck

```python
@dataclass
class PrerequisiteCheck:
    satisfied: bool
    missing: list[str]
```

资料来源：[src/forge/core/steps.py:8-15]()

---

## Guardrails System

The guardrails system provides composable middleware for reliability without requiring `WorkflowRunner`:

```mermaid
graph LR
    subgraph "Guardrails Bundle"
        G[Guardrails]
        G --> RV[ResponseValidator]
        G --> SE[StepEnforcer]
        G --> ET[ErrorTracker]
    end
    
    Input[LLM Response] --> G
    G --> Output[CheckResult]
```

资料来源：[src/forge/guardrails/__init__.py:1-30]()

### Guardrails (Bundled API)

```python
class Guardrails:
    def __init__(
        self,
        tool_names: list[str],
        terminal_tool: str | frozenset[str],
        required_steps: list[str] | None = None,
        max_retries: int = 3,
        max_tool_errors: int = 2,
        rescue_enabled: bool = True,
        max_premature_attempts: int = 3,
        retry_nudge: Callable[[str], str] | None = None,
    ) -> None
    
    def check(self, response: LLMResponse) -> CheckResult
    def record(self, executed_tools: list[str]) -> bool
```

资料来源：[src/forge/guardrails/guardrails.py:60-100]()

### CheckResult

```python
@dataclass
class CheckResult:
    action: Literal["execute", "retry", "step_blocked", "fatal"]
    tool_calls: list[ToolCall] | None = None
    nudge: Nudge | None = None
    reason: str | None = None
```

资料来源：[src/forge/guardrails/guardrails.py:20-35]()

| Action | Meaning |
|--------|---------|
| `execute` | Response is valid, proceed with tool calls |
| `retry` | Bare text or malformed, retry with nudge |
| `step_blocked` | Tried to call terminal tool prematurely |
| `fatal` | Exhausted retries or too many errors |

### ResponseValidator

Validates LLM responses and extracts tool calls:

- **Normal extraction**: Parses structured `tool_calls` from response
- **Rescue parsing**: Handles plain text with regex-based tool extraction for models like Qwen Coder (XML format: `<function=name>...`)
- **Retry nudging**: Injects guidance when model produces bare text

资料来源：[src/forge/guardrails/response_validator.py:1-50]()

### StepEnforcer

Enforces required step sequences:

```python
class StepEnforcer:
    def check(
        self,
        tool_calls: list[ToolCall],
        step_tracker: StepTracker,
    ) -> StepCheck
```

资料来源：[src/forge/guardrails/step_enforcer.py:1-50]()

**Nudge escalation tiers:**

| Tier | Tone | Example |
|------|------|---------|
| 1 | Polite | "You cannot call X yet. You must first complete: Y, Z." |
| 2 | Direct | "You must call one of these tools now: Y, Z. Pick one." |
| 3 | Aggressive | "STOP. You MUST call one of: Y, Z. Do NOT call X." |

资料来源：[src/forge/prompts/nudges.py:1-40]()

### ErrorTracker

Tracks consecutive errors for retry/exhaustion logic:

```python
class ErrorTracker:
    def record_retry(self) -> int
    def record_tool_error(self) -> int
    def reset(self) -> None
    @property
    def should_fatal(self) -> bool
```

资料来源：[src/forge/guardrails/error_tracker.py:1-40]()

---

## WorkflowRunner Configuration

### Constructor Parameters

```python
class WorkflowRunner:
    def __init__(
        self,
        client: LLMClient,
        context_manager: ContextManager,
        max_iterations: int = 10,
        max_retries_per_step: int = 3,
        max_tool_errors: int = 2,
        max_premature_attempts: int = 3,
        retry_nudge: Callable[[str], str] | None = None,
    ) -> None:
```

资料来源：[src/forge/core/runner.py:50-80]()

| Parameter | Default | Description |
|-----------|---------|-------------|
| `client` | required | `LLMClient` instance (OllamaClient, LlamafileClient, ProxyClient) |
| `context_manager` | required | `ContextManager` instance for context compaction |
| `max_iterations` | `10` | Maximum LLM turns before raising `MaxIterationsError` |
| `max_retries_per_step` | `3` | Consecutive bare-text responses before fatal |
| `max_tool_errors` | `2` | Consecutive tool execution failures before exhaustion |
| `max_premature_attempts` | `3` | Premature terminal tool attempts before fatal |
| `retry_nudge` | `None` | Custom nudge function for bare text responses |

### Run Method

```python
async def run(
    self,
    workflow: Workflow,
    user_input: str,
    cancel_event: asyncio.Event | None = None,
) -> list[Message]:
```

资料来源：[src/forge/core/runner.py:80-120]()

**Returns:** List of all `Message` objects in the conversation history

**Raises:**

| Exception | Condition |
|-----------|-----------|
| `MaxIterationsError` | Exceeded `max_iterations` without reaching terminal tool |
| `StepEnforcementError` | `max_premature_attempts` exceeded |
| `PrerequisiteError` | Tool called without satisfying prerequisites |
| `ToolCallError` | Tool name not found in workflow |
| `ToolExecutionError` | Tool callable raised an exception |
| `WorkflowCancelledError` | `cancel_event` was set |

---

## WorkflowRunner vs Guardrails

Forge provides two integration patterns:

| Aspect | WorkflowRunner | Guardrails |
|--------|---------------|------------|
| **Abstraction** | Full agentic loop | Lightweight middleware |
| **Context management** | Built-in via ContextManager | Handled externally |
| **Best for** | Standard workflows | Custom orchestration loops |
| **API surface** | `run()` | `check()` + `record()` |

资料来源：[src/forge/guardrails/__init__.py:1-20]()

### Guardrails Usage Pattern

```python
from forge.guardrails import Guardrails, CheckResult

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

# After each LLM response
result = guardrails.check(llm_response)

if result.action == "execute":
    for tc in result.tool_calls:
        execute_tool(tc.tool, tc.args)
    guardrails.record([tc.tool for tc in result.tool_calls])
elif result.action == "retry":
    # Re-prompt with nudge
    pass
elif result.action == "fatal":
    # Stop
    pass
```

---

## Context Management

The `ContextManager` handles conversation context compaction:

```mermaid
graph TD
    A[New Tool Result] --> B{Token Budget?}
    B -->|OK| C[Continue]
    B -->|Exceeded| D["Compact Messages<br/>TieredCompact"]
    D --> C
```

The runner automatically calls context compaction after each tool execution cycle. Compaction decisions use **real token counting** from backend-reported token usage.

资料来源：[src/forge/core/runner.py:150-200]()

### TieredCompact Strategy

Keeps recent messages intact while compacting older history:

- `keep_recent`: Number of recent message pairs to preserve
- Older messages are summarized or removed

---

## Error Handling

### Error Flow

```mermaid
graph TD
    A[Tool Execution] --> B{Success?}
    B -->|Yes| C[Record Success]
    B -->|No| D[Increment Error Count]
    D --> E{max_tool_errors?}
    E -->|Exceeded| F[Raise ToolExecutionError]
    E -->|Not exceeded| G[Retry Tool Call]
    G --> A
    F --> H[Abort Workflow]
```

### Recovery Mechanisms

| Error Type | Recovery Strategy |
|------------|-------------------|
| Malformed response | Retry with nudge (up to `max_retries_per_step`) |
| Tool execution failure | Retry tool call (up to `max_tool_errors`) |
| Premature terminal | Escalating nudge (3 tiers), then fatal |
| Prerequisite violation | Prerequisite nudge, model re-plans |

---

## Cancellation Support

Workflows can be cancelled via `asyncio.Event`:

```python
cancel_event = asyncio.Event()

runner = WorkflowRunner(client=client, context_manager=ctx)
asyncio.create_task(runner.run(workflow, user_input, cancel_event))

# Later, from another task:
cancel_event.set()  # Raises WorkflowCancelledError
```

资料来源：[src/forge/core/workflow.py:80-90]()

---

## Usage Examples

### Basic Workflow

```python
import asyncio
from pydantic import BaseModel, Field
from forge import Workflow, ToolDef, ToolSpec, WorkflowRunner, OllamaClient
from forge.context import ContextManager, TieredCompact

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    messages = await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())
```

资料来源：[README.md:1-60]()

### Multi-Step Workflow with Required Steps

```python
workflow = Workflow(
    name="research",
    description="Research a topic with multiple steps",
    tools={
        "search": ToolDef(spec=search_spec, callable=search_impl),
        "lookup": ToolDef(
            spec=lookup_spec,
            callable=lookup_impl,
            prerequisites=["search"],  # Must search before lookup
        ),
        "answer": ToolDef(
            spec=answer_spec,
            callable=answer_impl,
            prerequisites=["search", "lookup"],  # All steps required
        ),
    },
    required_steps=["search", "lookup"],
    terminal_tool="answer",
    system_prompt_template="You are a research assistant.",
)
```

---

## See Also

- [User Guide](docs/USER_GUIDE.md) — Full workflow configuration and best practices
- [Eval Guide](docs/EVAL_GUIDE.md) — Running evaluation scenarios
- [ADR-011: Guardrail Middleware](docs/decisions/011-guardrail-middleware.md) — Design rationale for the guardrails system
- [Backend Setup](docs/BACKEND_SETUP.md) — LLM backend configuration (Ollama, llamafile, llama.cpp)

---

<a id='page-guardrails-middleware'></a>

## Guardrails Middleware for External Loops

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/forge/guardrails/guardrails.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/guardrails.py)
- [src/forge/guardrails/__init__.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/guardrails/__init__.py)
- [examples/foreign_loop.py](https://github.com/antoinezambelli/forge/blob/main/examples/foreign_loop.py)
- [src/forge/prompts/nudges.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/prompts/nudges.py)
- [src/forge/core/workflow.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)
</details>

# Guardrails Middleware for External Loops

Forge's **Guardrails Middleware** provides a composable reliability layer that can be integrated into any external orchestration loop. Rather than requiring adoption of the full `WorkflowRunner`, external projects can embed forge's retry nudges, rescue parsing, step enforcement, and error tracking directly within their own agent execution frameworks.

---

## Overview

The guardrails system addresses a fundamental challenge in LLM tool-calling: models frequently produce malformed, premature, or incomplete tool invocations that require intervention rather than silent failure.

| Concern | Component | Purpose |
|---------|-----------|---------|
| Malformed responses | `ResponseValidator` | Parse tool calls from text, retry on bad format |
| Premature termination | `StepEnforcer` | Enforce required step sequence before terminal tool |
| Consecutive failures | `ErrorTracker` | Count retries/tool errors, signal exhaustion |
| User-facing messages | `Nudge` | Structured retry guidance injected into prompts |

资料来源：[src/forge/guardrails/__init__.py:1-29]()

---

## Architecture

```mermaid
graph TD
    A["LLM Response"] --> B["Guardrails.check()"]
    B --> C{ResponseValidator}
    C -->|Valid tool call| D{StepEnforcer}
    C -->|Bare text / malformed| E["Generate retry_nudge"]
    D -->|Step violated| F["Generate premature_terminal_nudge"]
    D -->|All checks pass| G{ErrorTracker}
    G -->|Under threshold| H["action: execute"]
    G -->|Over threshold| I["action: fatal"]
    
    J["Tool Execution"] --> K["Guardrails.record()"]
    K --> G
```

The `Guardrails` class bundles three components into a single two-method API:

```python
class Guardrails:
    def check(self, response: LLMResponse) -> CheckResult: ...
    def record(self, executed: list[str]) -> bool: ...
```

资料来源：[src/forge/guardrails/guardrails.py:86-109]()

---

## CheckResult Data Model

Every call to `check()` returns a `CheckResult` that encodes the appropriate action for the orchestration loop:

| Field | Type | Description |
|-------|------|-------------|
| `action` | `Literal["execute", "retry", "step_blocked", "fatal"]` | Directive for the caller |
| `tool_calls` | `list[ToolCall] \| None` | Parsed tool invocations (when action is `execute`) |
| `nudge` | `Nudge \| None` | Structured message to inject (when action is `retry` or `step_blocked`) |
| `reason` | `str \| None` | Human-readable explanation (only when action is `fatal`) |

资料来源：[src/forge/guardrails/guardrails.py:66-85]()

### Action Semantics

| Action | Meaning | Caller Behavior |
|--------|---------|-----------------|
| `execute` | Response is valid; proceed to tool execution | Extract `tool_calls`, execute them, call `record()` |
| `retry` | Response invalid but recoverable; inject nudge | Append `nudge` to messages, re-prompt model |
| `step_blocked` | Terminal tool called before required steps | Append step-enforcement nudge, re-prompt |
| `fatal` | Exhausted retry/error budget | Log `reason`, abort workflow |

---

## Guardrails Constructor Parameters

```python
Guardrails(
    tool_names: list[str],
    terminal_tool: str | frozenset[str],
    required_steps: list[str] | None = None,
    max_retries: int = 3,
    max_tool_errors: int = 2,
    rescue_enabled: bool = True,
    max_premature_attempts: int = 3,
    retry_nudge: Callable[[str], str] | None = None,
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `tool_names` | `list[str]` | — | Valid tool names for this workflow |
| `terminal_tool` | `str \| frozenset[str]` | — | Tool(s) that can end the workflow |
| `required_steps` | `list[str] \| None` | `None` | Tools that must be called before `terminal_tool` |
| `max_retries` | `int` | `3` | Consecutive bad responses before `fatal` |
| `max_tool_errors` | `int` | `2` | Consecutive tool execution failures before exhaustion |
| `rescue_enabled` | `bool` | `True` | Attempt to parse tool calls from plain text |
| `max_premature_attempts` | `int` | `3` | Premature terminal attempts before `fatal` |
| `retry_nudge` | `Callable[[str], str] \| None` | `None` | Custom nudge generator for bare text responses |

资料来源：[src/forge/guardrails/guardrails.py:47-69]()

---

## Integration Patterns

### Bundled API (Recommended)

For most integrations, use the two-method `Guardrails` interface:

```python
from forge.guardrails import Guardrails, CheckResult

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

def handle_response(response) -> str:
    result = guardrails.check(response)
    
    if result.action == "fatal":
        return f"ABORT: {result.reason}"
    
    if result.action in ("retry", "step_blocked"):
        return f"RETRY with nudge: {result.nudge.content}"
    
    # Execute tools
    executed = [tc.tool for tc in result.tool_calls]
    done = guardrails.record(executed)
    return f"executed {executed}" + (" -- DONE" if done else "")
```

资料来源：[examples/foreign_loop.py:60-78]()

### Granular API (Advanced)

For fine-grained control, instantiate components directly:

```python
from forge.guardrails import ResponseValidator, StepEnforcer, ErrorTracker

validator = ResponseValidator(
    tool_names=["search", "lookup", "answer"],
    rescue_enabled=True,
)
enforcer = StepEnforcer(
    required_steps=["search", "lookup"],
    terminal_tools=frozenset(["answer"]),
)
errors = ErrorTracker(max_retries=3, max_tool_errors=2)

def handle_response_granular(response):
    # Reset state for each new response
    errors._consecutive_retries = 0
    enforcer._premature_attempts = 0
    
    validation = validator.validate(response)
    if not validation.valid:
        return "RETRY: " + validation.nudge.content
    
    step_check = enforcer.check(response.tool_calls)
    if step_check.blocked:
        return "BLOCKED: " + step_check.nudge.content
    
    # Execute, then record
    errors.record(tool_name="search", error=None)
    return "proceed to execution"
```

资料来源：[examples/foreign_loop.py:32-59]()

---

## Nudge Generation

The `nudge.py` module provides structured retry messages with three escalation tiers:

```mermaid
graph LR
    A["tier=1"] --> B["Polite:<br/>'You cannot call X yet...'"]
    A --> C["tier=2"]
    C --> D["Direct:<br/>'You must call one of...'"]
    C --> E["tier=3"]
    E --> F["Aggressive:<br/>'STOP. You MUST call...'"]
```

### Premature Terminal Nudge

Generated when the model attempts to call the terminal tool before completing required steps:

```python
def premature_terminal_nudge(
    terminal_tool: str,
    pending_steps: list[str],
    tier: int = 1,
) -> str:
    tier = max(1, min(3, tier))  # Clamped to 1-3
```

资料来源：[src/forge/prompts/nudges.py:1-26]()

| Tier | Tone | Example Output |
|------|------|----------------|
| 1 | Polite | "You cannot call `answer` yet. You must first complete these required steps: search, lookup. Call one of them now." |
| 2 | Direct | "You must call one of these tools now: search, lookup. Pick one." |
| 3 | Aggressive | "STOP. You MUST call one of: search, lookup. Do NOT call `answer`. Your next response MUST be a tool call to one of: search, lookup." |

### Prerequisite Nudge

Generated when a tool is called without its prerequisites:

```python
def prerequisite_nudge(tool_name: str, missing_prereqs: list[str]) -> str:
```

资料来源：[src/forge/prompts/nudges.py:28-46]()

---

## ToolCall Model

The `tool_calls` extracted from a valid response are represented as:

```python
class ToolCall(BaseModel):
    tool: str  # Tool name
    # Additional fields inherited from LLMClient extraction
```

For OpenAI-style tool definitions, the `ToolSpec` in `workflow.py` handles parameter schema conversion:

```python
class ToolSpec:
    name: str
    description: str
    parameters: type[BaseModel]  # Pydantic model
    
    @classmethod
    def from_openai_schema(cls, name: str, description: str, schema: dict) -> ToolSpec:
        properties = schema.get("properties", {})
        required = set(schema.get("required", []))
```

资料来源：[src/forge/core/workflow.py:1-40]()

---

## Respond Tool Integration

For conversational use cases where the model may produce a final text response instead of a tool call, forge provides a special `respond` tool:

```python
from forge.tools import RESPOND_TOOL_NAME, respond_spec

respond_guardrails = Guardrails(
    tool_names=["search", "lookup", "answer", RESPOND_TOOL_NAME],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

def handle_response_with_respond(response):
    result = respond_guardrails.check(response)
    
    if result.action == "fatal":
        return f"FATAL: {result.reason}"
    
    if result.action in ("retry", "step_blocked"):
        return f"{result.action}: {result.nudge.content[:80]}..."
    
    tool_calls = result.tool_calls
    
    # Check if the model called respond
    for tc in tool_calls:
        if tc.tool == RESPOND_TOOL_NAME:
            message = tc.args.get("message", "")
            return f"MODEL SAYS: {message}"
    
    executed = [tc.tool for tc in tool_calls]
    done = respond_guardrails.record(executed)
    return f"executed {executed}" + (" -- DONE" if done else "")
```

资料来源：[examples/foreign_loop.py:83-114]()

---

## Public API Reference

```python
from forge.guardrails import (
    # Bundled API
    CheckResult,
    Guardrails,
    
    # Granular components
    ErrorTracker,
    Nudge,
    ResponseValidator,
    StepCheck,
    StepEnforcer,
    ValidationResult,
)
```

资料来源：[src/forge/guardrails/__init__.py:7-27]()

---

## Design Rationale

The guardrails middleware was designed to solve the foreign-loop integration problem described in ADR-011. The core principle is **orthogonal concerns**: each component handles one aspect of reliability, and the `Guardrails` bundler coordinates them without hiding their individual behavior.

Key design decisions:

| Decision | Rationale |
|----------|-----------|
| Two-method API | Simple contract; orchestrator controls the loop |
| `record()` returns `bool` | `True` signals terminal tool reached; orchestrator decides when to stop |
| `Nudge` dataclass | Structured message type allows metadata (e.g., tier) beyond raw string |
| `rescue_enabled` | Opt-in parsing from plain text; some backends emit structured output natively |
| `frozenset[str]` for terminal_tool | Multiple terminal tools supported without API change |

资料来源：[src/forge/guardrails/__init__.py:1-10]()

---

<a id='page-proxy-server'></a>

## Proxy Server Setup

### 相关页面

相关主题：[Backend Clients](#page-backend-clients)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/forge/proxy/__main__.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__main__.py)
- [src/forge/proxy/proxy.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/proxy.py)
- [src/forge/proxy/__init__.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__init__.py)
- [src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)
- [README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)
- [scripts/smoke_test_proxy.py](https://github.com/antoinezambelli/forge/blob/main/scripts/smoke_test_proxy.py)
</details>

# Proxy Server Setup

## Overview

The forge proxy server is an OpenAI-compatible HTTP proxy that transparently applies forge's guardrail stack to any backend that speaks the Chat Completions API. It acts as a drop-in replacement for local model servers, enabling existing OpenAI-compatible clients to benefit from forge's reliability layer without code changes.

资料来源：[README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)

## Architecture

The proxy operates in two distinct modes, determined at startup:

```mermaid
graph TD
    subgraph "Managed Mode"
        P["ProxyServer<br/>:8081"] --> SM["ServerManager"]
        SM --> BE["llama-server<br/>llamafile<br/>ollama<br/>:8080"]
    end
    
    subgraph "External Mode"
        P2["ProxyServer<br/>:8081"] --> BU["User-managed<br/>Backend<br/>:8080"]
    end
    
    C["OpenAI Client"] --> P
    C2["OpenAI Client"] --> P2
```

### Managed Mode

In managed mode, forge starts and controls the backend process lifecycle. The `ServerManager` class handles:

- Backend binary discovery and execution
- Model loading and initialization
- Health verification via `/props` endpoint polling
- Graceful shutdown and restart

资料来源：[src/forge/proxy/proxy.py:1-40](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/proxy.py)

### External Mode

In external mode, the proxy connects to a user-managed backend. This is useful when:

- The backend runs on a different machine or container
- Custom backend configurations are required
- The backend is managed by an external orchestration system

资料来源：[src/forge/proxy/proxy.py:35-45](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/proxy.py)

## Supported Backends

| Backend | Description | Requirements |
|---------|-------------|--------------|
| `llamaserver` | Llama.cpp's HTTP server | Local GGUF model file |
| `llamafile` | Mozilla's single-file model executable | Single-file executable |
| `ollama` | Ollama local inference server | Ollama runtime + model pulled |

资料来源：[src/forge/proxy/__main__.py:25-29](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__main__.py)

## CLI Usage

### Basic Invocation

```bash
# External mode — you manage the backend
python -m forge.proxy --backend-url http://localhost:8080 --port 8081

# Managed mode — forge starts llama-server and proxy together
python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081

# Managed mode with ollama
python -m forge.proxy --backend ollama --model llama3.2 --port 8081
```

资料来源：[README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)

### Command-Line Arguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--backend-url` | string | - | External backend URL (mutually exclusive with `--backend`) |
| `--backend` | choice | - | Backend type: `llamaserver`, `llamafile`, `ollama` |
| `--model` | string | - | Model name (required for ollama) |
| `--gguf` | string | - | Path to GGUF file (llamaserver/llamafile) |
| `--backend-port` | int | 8080 | Backend port for managed mode |
| `--budget-mode` | choice | backend | Context budget: `backend`, `manual`, `forge-full`, `forge-fast` |
| `--budget-tokens` | int | - | Manual token budget override |
| `--extra-flags` | list | - | Additional backend CLI flags |
| `--host` | string | 127.0.0.1 | Proxy listen host |
| `--port` | int | 8081 | Proxy listen port |
| `--serialize` | flag | - | Force request serialization |
| `--no-serialize` | flag | - | Disable request serialization |
| `--max-retries` | int | 3 | Max retries per request |
| `--no-rescue` | flag | - | Disable rescue parsing |
| `-v, --verbose` | flag | - | Enable debug logging |

资料来源：[src/forge/proxy/__main__.py:13-53](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__main__.py)

## Programmatic API

### ProxyServer Class

The `ProxyServer` class provides a programmatic interface for embedding the proxy in Python applications.

```python
from forge.proxy import ProxyServer

# External mode
proxy = ProxyServer(backend_url="http://localhost:8080")
proxy.start()
print(f"Proxy running at {proxy.url}")  # http://127.0.0.1:8081
# ... use proxy ...
proxy.stop()
```

```python
# Managed mode
proxy = ProxyServer(
    backend="llamaserver",
    gguf="model.gguf",
    budget_mode="forge-fast",
    port=8081
)
proxy.start()
proxy.stop()  # Stops both backend and proxy
```

资料来源：[src/forge/proxy/proxy.py:50-75](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/proxy.py)

### Constructor Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `backend_url` | `str \| None` | None | External backend URL |
| `backend` | `str \| None` | None | Backend type: `llamaserver`, `llamafile`, `ollama` |
| `model` | `str \| None` | None | Model name for ollama |
| `gguf` | `str \| Path \| None` | None | Path to GGUF file |
| `backend_port` | int | 8080 | Backend port |
| `budget_mode` | BudgetMode | BudgetMode.BACKEND | Context budget strategy |
| `budget_tokens` | int | - | Manual token budget |
| `extra_flags` | `list[str] \| None` | None | Additional CLI flags |
| `host` | str | 127.0.0.1 | Listen host |
| `port` | int | 8081 | Listen port |
| `serialize` | `bool \| None` | None | Request serialization control |
| `max_retries` | int | 3 | Max retries per request |
| `rescue_enabled` | bool | True | Enable rescue parsing |

资料来源：[src/forge/proxy/proxy.py:56-100](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/proxy.py)

### Lifecycle Methods

| Method | Description |
|--------|-------------|
| `start()` | Start the proxy (blocks until ready, max 120s timeout) |
| `stop()` | Stop the proxy and managed backend (30s shutdown timeout) |
| `url` | Property returning the proxy's base URL |

资料来源：[src/forge/proxy/proxy.py:102-125](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/proxy.py)

## Respond Tool Injection

### Purpose

Small local models (~8B parameters) cannot reliably choose between text output and tool calls. The proxy automatically injects a synthetic `respond` tool when tools are present in the request, forcing the model into tool-calling mode.

### Behavior

1. When the request contains `tools`, forge injects a `respond(message="...")` tool into the tools list
2. The model calls `respond(message="...")` instead of producing bare text
3. The `respond` call is stripped from the outbound response
4. The client receives a normal text response with `finish_reason: "stop"`

This keeps the model in tool-calling mode where forge's full guardrail stack applies.

资料来源：[README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)

```mermaid
sequenceDiagram
    participant C as OpenAI Client
    participant P as ProxyServer
    participant B as Backend
    
    C->>P: POST /v1/chat/completions<br/>(with tools)
    P->>P: Inject respond tool
    P->>B: Forward request<br/>(tools + respond)
    B->>P: respond(message="answer")
    P->>P: Strip respond call
    P->>C: Normal text response<br/>(finish_reason: "stop")
```

## Context Budget Modes

The proxy supports different strategies for managing context window usage:

| Mode | Description |
|------|-------------|
| `backend` | Let the backend manage context (default) |
| `manual` | Use `--budget-tokens` for fixed budget |
| `forge-full` | Full tiered compaction strategy |
| `forge-fast` | Fast tiered compaction (reduced) |

资料来源：[src/forge/proxy/__main__.py:35-38](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__main__.py)

### Tiered Compaction

The `forge-full` and `forge-fast` modes utilize `TieredCompact`, a three-phase compaction strategy:

1. **Truncate** — Remove oldest messages
2. **Drop results** — Remove tool result content
3. **Sliding window** — Maintain recent context

资料来源：[src/forge/proxy/proxy.py:22](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/proxy.py)

## Request Serialization

By default, the proxy handles concurrent requests independently. The serialization flags control this behavior:

| Flag | Behavior |
|------|----------|
| (none) | Proxy decides based on backend capabilities |
| `--serialize` | Force sequential request processing |
| `--no-serialize` | Allow concurrent processing |

资料来源：[src/forge/proxy/__main__.py:31-34](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__main__.py)

## Sampling Parameters Pass-Through

The proxy forwards OpenAI-compatible sampling fields directly to the backend without modification:

- `temperature`
- `top_p`
- `top_k`
- `min_p`
- `repeat_penalty`
- `presence_penalty`
- `seed`

资料来源：[CHANGELOG.md](https://github.com/antoinezambelli/forge/blob/main/CHANGELOG.md)

To use model-card-recommended sampling in proxy mode:

```python
from forge.clients import get_sampling_defaults

# Look up recommended sampling parameters
sampling = get_sampling_defaults("ministral-3-8b-instruct")
# Include in request body
response = client.post("/v1/chat/completions", json={
    "model": "ministral-3-8b-instruct",
    "messages": [...],
    **sampling
})
```

## Signal Handling

The proxy gracefully handles shutdown signals:

- `SIGINT` (Ctrl+C) — Immediate shutdown
- `SIGTERM` — Graceful shutdown

The main thread uses a timed sleep loop (`time.sleep(0.1)`) to allow Python to deliver signals between iterations, ensuring proper shutdown on Windows.

资料来源：[src/forge/proxy/__main__.py:95-105](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__main__.py)

## Testing with Smoke Test Script

The repository includes a smoke test at `scripts/smoke_test_proxy.py` that:

1. Starts a mock backend on port 18080
2. Launches the proxy in external mode on port 18081
3. Verifies health endpoint
4. Sends a test chat completion request
5. Validates the response structure

```bash
python scripts/smoke_test_proxy.py
```

资料来源：[scripts/smoke_test_proxy.py](https://github.com/antoinezambelli/forge/blob/main/scripts/smoke_test_proxy.py)

## Health Endpoint

The proxy exposes a `/health` endpoint for monitoring:

```bash
curl http://127.0.0.1:8081/health
```

资料来源：[scripts/smoke_test_proxy.py:70](https://github.com/antoinezambelli/forge/blob/main/scripts/smoke_test_proxy.py)

## Configuration Example: Complete Setup

```bash
# Start llama-server with custom flags, proxy it
python -m forge.proxy \
    --backend llamaserver \
    --gguf ./models/ministral-3-8b-instruct-q8_0.gguf \
    --model ministral-3-8b-instruct \
    --budget-mode forge-full \
    --backend-port 8080 \
    --port 8081 \
    --host 0.0.0.0 \
    --extra-flags --reasoning-format auto \
    --verbose
```

Then configure your client:

```python
import httpx

client = httpx.AsyncClient(
    base_url="http://localhost:8081/v1",
    timeout=120.0
)

response = await client.post("/chat/completions", json={
    "model": "ministral-3-8b-instruct",
    "messages": [
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string"}
                    },
                    "required": ["city"]
                }
            }
        }
    ]
})

---

<a id='page-backend-clients'></a>

## Backend Clients

### 相关页面

相关主题：[Backend Setup Guide](#page-backend-setup), [Model Selection Guide](#page-model-guide)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/forge/clients/base.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/base.py)
- [src/forge/clients/ollama.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/ollama.py)
- [src/forge/clients/llamafile.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/llamafile.py)
- [src/forge/clients/anthropic.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/anthropic.py)
- [src/forge/clients/sampling_defaults.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)
- [src/forge/core/workflow.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/core/workflow.py)
- [src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)
- [src/forge/proxy/__main__.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__main__.py)
</details>

# Backend Clients

## Overview

The Backend Clients subsystem provides a unified abstraction layer over various LLM backends, enabling forge to interact with different inference engines through a consistent interface. This modular design allows users to switch between Ollama, llamafile, and Anthropic backends without modifying workflow code.

Each client handles backend-specific communication protocols, response parsing, streaming, and tool call extraction while exposing a common async API for send operations and context resolution.

资料来源：[src/forge/clients/base.py:1-50]()

## Architecture

### Client Hierarchy

```mermaid
graph TD
    A[BaseClient] --> B[OllamaClient]
    A --> C[LlamafileClient]
    A --> D[AnthropicClient]
    
    E[sampling_defaults.py] --> B
    E --> C
    E --> D
    
    F[WorkflowRunner] --> B
    F --> C
    F --> D
```

All clients inherit from `BaseClient`, which defines the core async interface including `send()`, `send_stream()`, and `get_context_length()` methods. Backend-specific implementations override these methods to handle vendor-specific APIs and response formats.

资料来源：[src/forge/clients/base.py:1-100]()

### Common Interface

All clients implement the following async methods:

| Method | Purpose |
|--------|---------|
| `send(messages, tools, **kwargs)` | Send a request and receive a complete response |
| `send_stream(messages, tools, **kwargs)` | Stream responses as an async generator |
| `get_context_length()` | Query the backend for maximum context window |
| `stop()` | Stop any ongoing generation |

资料来源：[src/forge/clients/base.py:50-150]()

## OllamaClient

### Purpose

The `OllamaClient` connects to a local Ollama server instance, supporting both standard models and GGUF-formatted models served through Ollama's model management system.

资料来源：[src/forge/clients/ollama.py:1-100]()

### Configuration Options

```python
OllamaClient(
    model: str,                          # Model name (e.g., "qwen3:8b-q4_K_M")
    base_url: str = "http://localhost:11434",
    recommended_sampling: bool = False, # Use verified per-model sampling params
    **kwargs                            # Passed to httpx client
)
```

### Key Features

- **Recommended Sampling**: When `recommended_sampling=True`, the client retrieves verified sampling parameters from `forge.clients.sampling_defaults` for known models. If a model is not in the map and `strict=True`, an `UnsupportedModelError` is raised.

- **Streaming Support**: Full streaming support with token-level async generation through `send_stream()`.

- **Tool Call Extraction**: Parses Ollama's JSON tool call format and converts to forge's internal `ToolCall` format.

资料来源：[src/forge/clients/sampling_defaults.py:1-80]()

## LlamafileClient

### Purpose

The `LlamafileClient` communicates with llamafile or llama-server instances, providing support for GGUF models served directly without Ollama's model management layer.

资料来源：[src/forge/clients/llamafile.py:1-100]()

### Context Resolution

Unlike Ollama, llamafile and llama-server require querying the `/props` endpoint to determine the configured context length:

```python
async def get_context_length(self) -> int | None:
    """Query the Llamafile /props endpoint for configured context length."""
    base = self.base_url.rstrip("/")
    if base.endswith("/v1"):
        base = base[:-3]

    resp = await self._http.get(f"{base}/props")
    data = resp.json()
    n_ctx = data.get("default_generation_settings", {}).get("n_ctx")
    return int(n_ctx) if n_ctx is not None else None
```

资料来源：[src/forge/clients/llamafile.py:180-200]()

### Tool Call Modes

The client supports multiple tool call parsing strategies:

| Mode | Description |
|------|-------------|
| `native` | Uses backend's native tool call format |
| `function` | Parses `<function=name>...</function>` style tags |
| `prompt` | Extracts tool calls from prompted responses |

资料来源：[src/forge/clients/llamafile.py:100-180]()

## AnthropicClient

### Purpose

The `AnthropicClient` integrates with Anthropic's Claude API, enabling forge workflows to leverage Claude Opus, Sonnet, and Haiku models.

资料来源：[src/forge/clients/anthropic.py:1-100]()

### Key Differences

- No hardcoded temperature defaults — relies on Anthropic API's own defaults
- Supports Anthropic-specific headers and request formatting
- Compatible with tools via Anthropic's tool use API

## Sampling Defaults System

### Overview

The `sampling_defaults` module provides verified per-model sampling parameters sourced from HuggingFace model cards. This ensures optimal generation quality for supported models without requiring users to manually tune hyperparameters.

资料来源：[src/forge/clients/sampling_defaults.py:1-50]()

### Supported Parameters

| Parameter | Description | Typical Range |
|-----------|-------------|---------------|
| `temperature` | Sampling temperature | 0.0 - 1.0 |
| `top_p` | Nucleus sampling threshold | 0.0 - 1.0 |
| `top_k` | Top-k sampling | 1 - 100 |
| `min_p` | Minimum probability threshold | 0.0 - 1.0 |
| `repeat_penalty` | Repetition penalty | 0.0 - 2.0 |
| `presence_penalty` | Presence penalty (OpenAI compat) | -2.0 - 2.0 |

### Policy Behavior

The `apply_sampling_defaults()` function implements a four-quadrant policy:

```python
def apply_sampling_defaults(model: str, *, strict: bool) -> dict[str, float | int]:
    """Apply the recommended-sampling policy for model."""
    in_map = model in MODEL_SAMPLING_DEFAULTS
    if strict:
        if not in_map:
            raise UnsupportedModelError(model)
        return dict(MODEL_SAMPLING_DEFAULTS[model])
    
    # strict=False: one-shot INFO log if known, else silent
    if in_map and model not in _INFO_LOGGED:
        log.info("Recommended sampling params exist for %r...", model)
        _INFO_LOGGED.add(model)
    return {}
```

| `strict` | Model in Map | Behavior |
|----------|--------------|----------|
| `True` | Yes | Return dict copy |
| `True` | No | Raise `UnsupportedModelError` |
| `False` | Yes | One-shot INFO log; return `{}` |
| `False` | No | Return `{}` (silent) |

资料来源：[src/forge/clients/sampling_defaults.py:60-120]()

### Verified Models

The following model families are currently supported with verified sampling parameters:

- Qwen3 / Qwen3.5 / Qwen3.6
- Qwen3-Coder
- Gemma 4
- Mistral Small 3.2
- Devstral Small 2
- Ministral 3 Instruct + Reasoning
- Mistral Nemo
- Granite 4.0

Each entry includes an inline HuggingFace card URL comment for verification.

资料来源：[src/forge/clients/sampling_defaults.py:50-80]()

## Tool Call Processing

### Extraction Flow

```mermaid
graph TD
    A[LLM Response] --> B{Response Type}
    B -->|tool_calls| C[extract_tool_call]
    B -->|text| D[TextResponse]
    
    C --> E{Tool Call Format}
    E -->|OpenAI style| F[Parse name + arguments]
    E -->|function tags| G[Parse XML-style tags]
    E -->|dict style| H[Parse dict with name field]
    
    F --> I[ToolCall object]
    G --> I
    H --> I
```

### Supported Formats

| Backend | Format | Example |
|---------|--------|---------|
| Ollama | OpenAI-style function calls | `{"name": "get_weather", "arguments": {"city": "Paris"}}` |
| Llamafile | Function tags or native | `<function=name><parameter=city>Paris</parameter></function>` |
| Anthropic | Claude tool_use blocks | `{name: "get_weather", input: {city: "Paris"}}` |

资料来源：[src/forge/core/workflow.py:1-50]()

## Proxy Mode Integration

### Request Passthrough

When running in proxy mode, the client plumbs OpenAI-compatible body fields through to backends without modification:

```python
# Proxy plumbs these fields through per request:
- temperature
- top_p
- top_k
- min_p
- repeat_penalty
- presence_penalty
- seed
```

For per-model recommended sampling in proxy mode, the calling client must look up `forge.clients.get_sampling_defaults(model)` and include the values in the request body.

资料来源：[src/forge/proxy/__main__.py:1-50]()

## Usage Examples

### Basic Workflow with OllamaClient

```python
from forge import OllamaClient, WorkflowRunner, ContextManager, TieredCompact

async def main():
    client = OllamaClient(
        model="ministral-3:8b-instruct-2512-q4_K_M",
        recommended_sampling=True  # Use verified sampling params
    )
    ctx = ContextManager(
        strategy=TieredCompact(keep_recent=2),
        budget_tokens=8192
    )
    runner = WorkflowRunner(client=client, context_manager=ctx)
    # ... run workflows

asyncio.run(main())
```

### Per-Call Sampling Override

```python
response = await client.send(
    messages,
    tools,
    sampling={
        "temperature": 0.7,
        "top_p": 0.9
    }
)
```

The caller's explicit non-None fields merge with client-level defaults without mutating the original configuration.

## Error Handling

| Error | Cause | Resolution |
|-------|-------|------------|
| `UnsupportedModelError` | Model not in sampling defaults map | Add model to `sampling_defaults.py` or pass `recommended_sampling=False` |
| `httpx.HTTPError` | Backend unreachable | Verify backend is running on correct port |
| `BudgetResolutionError` | Cannot determine context length | Check backend `/props` endpoint returns `n_ctx` |

资料来源：[src/forge/server.py:1-50]()

## Backend Server Management

### ServerManager Integration

Forge can auto-manage backend servers through `ServerManager`, which handles starting, stopping, and context resolution:

```python
from forge.server import create_server_and_context

server, ctx = await create_server_and_context(
    backend="ollama",
    model="qwen3:8b-q4_K_M",
    budget_mode=BudgetMode.FORGE_FAST,
    client=client,
)
```

### Supported Backends

| Backend | Model Specification | Port | Features |
|---------|---------------------|------|----------|
| `ollama` | Model name string | 11434 | Auto model management |
| `llamaserver` | GGUF file path | 8080 | Direct GGUF serving |
| `llamafile` | GGUF file path | 8080 | Single-file server |

资料来源：[src/forge/server.py:50-150]()

---

<a id='page-backend-setup'></a>

## Backend Setup Guide

### 相关页面

相关主题：[Backend Clients](#page-backend-clients), [Model Selection Guide](#page-model-guide)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)
- [src/forge/proxy/__main__.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__main__.py)
- [src/forge/clients/sampling_defaults.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)
- [CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)
- [README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)
- [CHANGELOG.md](https://github.com/antoinezambelli/forge/blob/main/CHANGELOG.md)
</details>

# Backend Setup Guide

## Overview

The Backend Setup Guide covers how to configure, initialize, and manage LLM backend servers within forge. Forge supports multiple backend types—**Ollama**, **llama-server**, and **llamafile**—each with distinct initialization patterns, context management strategies, and operational characteristics. Understanding these backends is essential for running forge's Workflow and WorkflowRunner components effectively.

Forge abstracts backend management through two primary classes: `ServerManager` (for direct server lifecycle control) and `setup_backend()` (a high-level async factory that combines server startup with `ContextManager` creation). 资料来源：[src/forge/server.py:1-50]()

## Supported Backend Types

Forge supports three backend implementations, each targeting different deployment scenarios:

| Backend | Identity | Configuration | Use Case |
|---------|----------|---------------|----------|
| `ollama` | Model name (e.g., `qwen3:8b`) | Uses Ollama's native model management | Quick local development, model switching |
| `llamaserver` | GGUF file path | Requires explicit GGUF path and context size | Production GGUF inference with fine control |
| `llamafile` | GGUF file path | Auto-discovers llamafile runtime | Single-file distribution, portable setups |

The backend type is determined at `ServerManager` instantiation and cannot be changed afterward. 资料来源：[src/forge/server.py:140-145]()

### Ollama Backend

The Ollama backend leverages Ollama's built-in model management system. Models are pulled and managed through the Ollama CLI rather than requiring manual GGUF file handling.

```python
server = ServerManager(backend="ollama", port=8080)
await server.start(model="qwen3:8b-q4_K_M", gguf_path="", mode="native")
```

**Key constraints:**
- Does not accept `gguf_path` (use the `model` parameter instead)
- Requires `model` to be specified
- VRAM cleanup between model switches is handled via `ollama stop` 资料来源：[src/forge/server.py:170-180]()

### Llama-server and Llamafile Backends

Both `llamaserver` and `llamafile` backends operate on GGUF files directly. The server identity is derived from the GGUF path, enabling cache-equality checks for server reuse. 资料来源：[src/forge/server.py:185-200]()

```python
# Llama-server
server = ServerManager(backend="llamaserver", port=8080)
await server.start(model="identity", gguf_path="/models/qwen3-8b-q4_K_M.gguf", mode="native")

# Llamafile (auto-discovers runtime)
server = ServerManager(backend="llamafile", port=8080)
await server.start(model="identity", gguf_path="/models/llamafile-binary", mode="native")
```

## ServerManager Architecture

The `ServerManager` class encapsulates all backend server lifecycle operations, providing a unified interface across different backend types.

### State Management

```mermaid
graph TD
    A[ServerManager.__init__] --> B[_proc: Popen | None]
    A --> C[_current_model: str | None]
    A --> D[_current_mode: str | None]
    A --> E[_current_ctx: int | None]
    A --> F[_current_flags: tuple]
    A --> G[_current_cache_type_k/v: str | None]
    A --> H[_current_n_slots: int | None]
    A --> I[_current_kv_unified: bool]
    
    J[ServerManager.start] --> K{Cache Hit?}
    K -->|Yes| L[Return - reuse existing]
    K -->|No| M[await stop]
    M --> N[Build command flags]
    N --> O[Spawn subprocess]
```

### Cache Equality Check

Before starting a new server instance, `ServerManager` checks if an existing server matches the requested configuration. This prevents unnecessary VRAM allocation and model reloading. 资料来源：[src/forge/server.py:50-65]()

```python
flags = tuple(extra_flags) if extra_flags else ()
if (
    self._current_model == model
    and self._current_mode == mode
    and self._current_ctx == ctx_override
    and self._current_flags == flags
    and self._current_cache_type_k == cache_type_k
    and self._current_cache_type_v == cache_type_v
    and self._current_n_slots == n_slots
    and self._current_kv_unified == kv_unified
):
    return  # Reuse existing server
```

### Server Initialization Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `backend` | `str` | Required | Backend type: `"ollama"` \| `"llamaserver"` \| `"llamafile"` |
| `port` | `int` | `8080` | Server listen port (llama-server / llamafile only) |
| `models_dir` | `str \| Path` | `None` | Directory containing GGUF files |

## Startup Parameters

The `start()` method accepts numerous parameters for fine-grained control over server behavior:

| Parameter | Type | Description |
|-----------|------|-------------|
| `model` | `str` | Model identity (Ollama: model name; others: GGUF path as string) |
| `gguf_path` | `str \| Path` | Path to GGUF file for llamaserver/llamafile |
| `mode` | `str` | `"native"` or `"prompt"` reasoning mode |
| `extra_flags` | `list[str]` | Additional CLI flags passed to the server |
| `ctx_override` | `int \| None` | Override context window size (`-c <value>`) |
| `cache_type_k` | `str` | KV cache quantization type for keys (e.g., `"q8_0"`, `"q4_0"`) |
| `cache_type_v` | `str` | KV cache quantization type for values |
| `n_slots` | `int` | Concurrent slot count for multi-agent architectures |
| `kv_unified` | `bool` | Use unified KV cache across all slots |

资料来源：[src/forge/server.py:95-115]()

## Budget Modes

Budget modes control how forge resolves the context window budget for the `ContextManager`. The `resolve_budget()` method maps `BudgetMode` enum values to actual token counts. 资料来源：[src/forge/server.py:220-250]()

```mermaid
graph TD
    A[resolve_budget mode] --> B{MANUAL?}
    B -->|Yes, Ollama| C[Return manual_tokens]
    B -->|Yes, others| D[await get_server_context]
    B -->|No| E{Ollama?}
    E -->|Yes| F[await _ollama.full]
    E -->|No| G{Mode == FORGE_FAST?}
    G -->|Yes| H[await get_server_context ÷ 4]
    G -->|No| I[await get_server_context]
```

### Budget Resolution Table

| BudgetMode | Ollama Backend | Llama-server/Llamafile Backend |
|------------|----------------|--------------------------------|
| `MANUAL` | Returns `manual_tokens` parameter | Queries `/props` for `n_ctx` |
| `BACKEND` | Ollama's reported context length | Queries `/props` for `n_ctx` |
| `FORGE_FAST` | `n_ctx / 4` | `n_ctx / 4` |

## High-Level Setup with `setup_backend()`

For most use cases, prefer `setup_backend()` which combines server startup with `ContextManager` creation. 资料来源：[src/forge/server.py:280-330]()

```python
from forge.server import setup_backend, BudgetMode

async def example():
    client, ctx = await setup_backend(
        backend="llamaserver",
        gguf_path="/models/qwen3-8b-q4_K_M.gguf",
        budget_mode=BudgetMode.FORGE_FAST,
        client=None,  # Will create default client
    )
    # ... run workflows ...
    await client.close()
```

### `setup_backend()` Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `backend` | `str` | Required | Backend type |
| `model` | `str \| None` | `None` | Ollama model name |
| `gguf_path` | `str \| Path \| None` | `None` | GGUF file path |
| `budget_mode` | `BudgetMode` | `BudgetMode.BACKEND` | Context budget strategy |
| `manual_tokens` | `int \| None` | `None` | Required for `MANUAL` mode on Ollama |
| `client` | `Any \| None` | `None` | Existing client or `None` to create default |
| `mode` | `str` | `"native"` | Reasoning mode |
| `port` | `int` | `8080` | Server port |
| `extra_flags` | `list[str] \| None` | `None` | Additional backend flags |
| `on_compact` | `Callable \| None` | `None` | Callback for compaction events |
| `compact_threshold` | `float` | `0.75` | Compaction trigger threshold |
| `phase_thresholds` | `tuple` | `(0.5, 0.7, 0.9)` | Tiered compaction thresholds |

## Server Readiness Detection

Forge uses `/props` polling rather than `/health` for readiness confirmation. This eliminates the gap between health-ok and props-available states. 资料来源：[src/forge/server.py:260-278]()

```python
async def wait_for_ready(self, timeout: float = 60.0) -> None:
    url = f"http://localhost:{self._port}/props"
    while time.monotonic() < deadline:
        try:
            resp = await client.get(url)
            if resp.status_code == 200:
                data = resp.json()
                if "default_generation_settings" in data:
                    return
        except (httpx.ConnectError, httpx.ReadError, httpx.TimeoutException):
            pass
        await asyncio.sleep(2)
```

The readiness check looks for `default_generation_settings` in the response—a strong indicator that the model is fully loaded and serving. 资料来源：[src/forge/server.py:260-278]()

## Proxy Server Configuration

Forge includes a proxy server (`forge.proxy`) that plumbs OpenAI-compatible sampling parameters through to backends. The proxy does not consult the sampling defaults map; it passes through whatever parameters the inbound request carries. 资料来源：[src/forge/proxy/__main__.py:1-60]()

### Proxy CLI Options

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--backend-url` | `str` | Required | Target backend URL |
| `--backend` | `str` | Required | Backend type |
| `--model` | `str` | Required | Model identifier |
| `--gguf` | `str` | `""` | GGUF path (for non-Ollama) |
| `--budget-mode` | `str` | `"backend"` | Budget resolution mode |
| `--budget-tokens` | `int` | `None` | Manual token budget |
| `--host` | `str` | `127.0.0.1` | Proxy listen host |
| `--port` | `int` | `8081` | Proxy listen port |
| `--serialize` | `flag` | `None` | Force request serialization |
| `--max-retries` | `int` | `3` | Max retries per request |
| `--verbose` | `flag` | `False` | Enable debug logging |

### Proxy Sampling Passthrough

The proxy supports these OpenAI-compatible body fields:

| Parameter | Type | Description |
|-----------|------|-------------|
| `temperature` | `float` | Sampling temperature |
| `top_p` | `float` | Nucleus sampling threshold |
| `top_k` | `int` | Top-k sampling |
| `min_p` | `float` | Minimum probability threshold |
| `repeat_penalty` | `float` | Repetition penalty |
| `presence_penalty` | `float` | Presence penalty |
| `seed` | `int` | Deterministic sampling seed |

For per-model recommended sampling in proxy mode, callers should look up `forge.clients.get_sampling_defaults(model)` and include the values in the request body. 资料来源：[src/forge/clients/sampling_defaults.py:1-50]()

## Per-Model Sampling Defaults

Forge ships verified per-model sampling recommendations for supported models. These must be explicitly opted into via `recommended_sampling=True`. 资料来源：[src/forge/clients/sampling_defaults.py:50-80]()

### Supported Models

The sampling defaults map includes recommendations for:

- **Qwen3 / 3.5 / 3.6** series
- **Qwen3-Coder**
- **Gemma 4**
- **Mistral Small 3.2**
- **Devstral Small 2**
- **Ministral 3 Instruct + Reasoning**
- **Mistral Nemo**
- **Granite 4.0** (`h-micro`, `h-tiny`)

Each entry includes an inline HuggingFace model card URL for verification. 资料来源：[CHANGELOG.md:0.6.0]()

### Sampling Policy

| `strict` | Model in Map | Behavior |
|----------|--------------|----------|
| `True` | Yes | Return dict copy |
| `True` | No | Raise `UnsupportedModelError` |
| `False` | Yes | One-shot INFO log; return `{}` |
| `False` | No | Return `{}` (silent) |

## Context Length Resolution

For non-Ollama backends, forge queries the server's `/props` endpoint to determine the configured context length. This value feeds into budget resolution. 资料来源：[src/forge/server.py:200-220]()

```python
async def get_server_context(self) -> int:
    """Query /props for actual n_ctx.
    
    For Ollama: ``ollama stop`` for clean VRAM unloads between model switches.
    """
    props = await self.query_props()
    ctx = props.get("default_generation_settings", {}).get("n_ctx")
    if ctx is None:
        raise BudgetResolutionError()
    return ctx
```

## Best Practices

### Server Reuse

Always check if a server with the desired configuration is already running before starting a new one. The `ServerManager` performs this check internally based on:

1. Model identity (name or GGUF path)
2. Mode (`native` or `prompt`)
3. Context override
4. CLI flags
5. KV cache quantization settings
6. Slot configuration

### VRAM Management

For Ollama backends, use `ollama stop` to cleanly unload models and free VRAM before switching to a different model. The llama-server/llamafile backends handle this through server restart. 资料来源：[src/forge/server.py:200]()

### Graceful Shutdown

Always call `server.stop()` when finished to properly terminate the backend process:

```python
server = ServerManager(backend="llamaserver", port=8080)
try:
    await server.start(...)
    # ... work ...
finally:
    await server.stop()
```

### Multi-Agent Configurations

For multi-agent architectures requiring concurrent slots, configure `n_slots` and optionally `kv_unified=True` for shared KV cache across slots:

```python
await server.start(
    model="...",
    gguf_path="...",
    n_slots=4,
    kv_unified=True,  # Each slot can use full context
)
```

## Quick Start Example

```python
import asyncio
from forge import OllamaClient, WorkflowRunner
from forge.server import setup_backend, BudgetMode
from forge.context import ContextManager, TieredCompact

async def main():
    # Setup backend with forge-managed context
    client, ctx = await setup_backend(
        backend="ollama",
        model="ministral-3:8b-instruct-2512-q4_K_M",
        budget_mode=BudgetMode.FORGE_FAST,
        recommended_sampling=True,
    )
    
    try:
        runner = WorkflowRunner(client=client, context_manager=ctx)
        # ... run workflows ...
    finally:
        await client.close()

asyncio.run(main())
```

For GGUF-based setups:

```python
from forge.server import setup_backend, BudgetMode

client, ctx = await setup_backend(
    backend="llamaserver",
    gguf_path="/path/to/model-q4_K_M.gguf",
    budget_mode=BudgetMode.FORGE_FAST,
    extra_flags=["--reasoning-format", "auto"],
)
```

## Related Documentation

- [CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md) - Project setup and testing
- [README.md](https://github.com/antoinezambelli/forge/blob/main/README.md) - Quick start and Workflow overview
- [CHANGELOG.md](https://github.com/antoinezambelli/forge/blob/main/CHANGELOG.md) - Version history and breaking changes

---

<a id='page-model-guide'></a>

## Model Selection Guide

### 相关页面

相关主题：[Backend Setup Guide](#page-backend-setup), [Backend Clients](#page-backend-clients)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/forge/clients/sampling_defaults.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/sampling_defaults.py)
- [src/forge/errors.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/errors.py)
- [src/forge/server.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/server.py)
- [src/forge/clients/llamafile.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/clients/llamafile.py)
- [src/forge/proxy/__main__.py](https://github.com/antoinezambelli/forge/blob/main/src/forge/proxy/__main__.py)
- [CONTRIBUTING.md](https://github.com/antoinezambelli/forge/blob/main/CONTRIBUTING.md)
- [README.md](https://github.com/antoinezambelli/forge/blob/main/README.md)
- [CHANGELOG.md](https://github.com/antoinezambelli/forge/blob/main/CHANGELOG.md)
</details>

# Model Selection Guide

## Overview

The Model Selection Guide covers how to choose, configure, and deploy language models within forge. Forge provides a unified workflow engine that abstracts backend differences (Ollama, Llamafile, LlamaServer) while offering per-model recommended sampling parameters sourced directly from HuggingFace model cards.

资料来源：[src/forge/clients/sampling_defaults.py:1-20]()

## Supported Backends

Forge supports three LLM backend types, each with distinct configuration requirements.

| Backend | Configuration Method | Model Specification | Notes |
|---------|---------------------|---------------------|-------|
| Ollama | `model` parameter | Model name from `ollama list` | No GGUF path needed |
| Llamafile | `gguf_path` parameter | Path to .llamafile binary | Self-contained executables |
| LlamaServer | `gguf_path` parameter | Path to GGUF model file | Requires llama.cpp server binary |

资料来源：[src/forge/server.py:1-50]()

### Backend Selection Logic

```mermaid
graph TD
    A[Choose Backend] --> B{Backend Type?}
    B -->|Ollama| C[Use model name]
    B -->|Llamafile| D[Use gguf_path]
    B -->|LlamaServer| D
    C --> E[Connect via localhost:11434]
    D --> F[Start server process]
    F --> G[Connect via port 8080]
```

For Ollama, the `model` parameter directly references the model name from `ollama list`. For Llamafile and LlamaServer, you must provide the `gguf_path` pointing to the model file, and Forge will manage the server process lifecycle.

资料来源：[src/forge/server.py:80-120]()

## Supported Models

### Model Families

Forge has been tested and evaluated with the following model families across different quantization levels.

| Model Family | Variants | Recommended Quantization | Notes |
|-------------|----------|-------------------------|-------|
| Qwen3 | 8B, 3.5, 3.6 | Q4_K_M, Q8_0 | Includes Qwen3-Coder |
| Gemma | 4 (all sizes) | Q4_K_M | Use `--reasoning-budget 0` workaround |
| Mistral | Small 3.2, Nemo, 7B | Q4_K_M, Q8_0 | Ministral variants available |
| Devstral | Small 2 | Q4_K_M | Code-focused model |
| Granite | 4.0 (h-micro, h-tiny) | Q4_K_M, Q8_0 | OpenAI-style tool calls |
| Llama | 3.1 8B | Q4_K_M, Q8_0 | 8B Reasoning variants |
| Ministral | 3 Instruct, 8B Instruct, Reasoning | Q4_K_M, Q8_0 | Reasoning requires budget fix |

资料来源：[CHANGELOG.md:1-50]()

### Model Naming Conventions

Model names vary by backend. When using Ollama, use the exact model tag as shown in `ollama list`. For GGUF-based backends, the model name is derived from the filename stem of the GGUF file.

```python
# Ollama example
client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M")

# Llamafile/LlamaServer example - model derived from path
client = LlamafileClient(gguf_path="/models/mistral-7b-q4_K_M.gguf")
```

资料来源：[README.md:1-30]()

## Recommended Sampling Configuration

### The Sampling Defaults Map

Forge ships `forge.clients.sampling_defaults` containing a verified per-model sampling recommendations map. Each entry includes parameters such as `temperature`, `top_p`, `top_k`, `min_p`, `repeat_penalty`, and `presence_penalty` sourced directly from HuggingFace model cards.

资料来源：[src/forge/clients/sampling_defaults.py:20-40]()

### Enabling Recommended Sampling

To use per-model recommended sampling, pass `recommended_sampling=True` when initializing the client.

```python
from forge import OllamaClient

# Opt-in to recommended sampling
client = OllamaClient(
    model="qwen3:8b-q4_K_M",
    recommended_sampling=True
)
```

If the model is not in the map and `recommended_sampling=True` is set, Forge raises `UnsupportedModelError` rather than silently falling back to backend defaults.

资料来源：[src/forge/errors.py:1-25]()

### Sampling Policy Behavior

The following table describes the four-quadrant behavior when applying sampling defaults.

| `strict` | Model in Map | Behavior |
|----------|-------------|----------|
| `True` | Yes | Return recommended dict |
| `True` | No | Raise `UnsupportedModelError` |
| `False` | Yes | One-shot INFO log; return `{}` |
| `False` | No | Return `{}` (silent) |

资料来源：[src/forge/clients/sampling_defaults.py:60-85]()

### Per-Call Sampling Overrides

The `send()` and `send_stream()` methods accept a `sampling: dict | None` kwarg that merges field-by-field with the client's instance-level sampling without mutating it. The caller's explicit non-None fields take precedence.

```python
# Merge with instance defaults
response = await client.send(
    messages,
    sampling={"temperature": 0.7, "top_p": 0.9}
)
```

资料来源：[CHANGELOG.md:50-70]()

## Proxy Mode Configuration

When running forge in proxy mode, sampling parameters are plumbed through from the incoming request body. OpenAI-compatible fields supported include `temperature`, `top_p`, `top_k`, `min_p`, `repeat_penalty`, `presence_penalty`, and `seed`.

### Proxy Server Startup

```bash
python -m forge.proxy \
    --backend-url http://localhost:11434 \
    --backend ollama \
    --model qwen3:8b-q4_K_M \
    --port 8081
```

For per-model recommended sampling in proxy mode, the calling client should look up `forge.clients.get_sampling_defaults(model)` and include the values in the request body.

资料来源：[src/forge/proxy/__main__.py:1-40]()

## Context and Budget Management

### Server Context Resolution

Forge automatically queries the backend's `/props` endpoint to determine the maximum context length. For Ollama, use `ollama stop` to cleanly unload VRAM between model switches.

```python
from forge import ContextManager, TieredCompact

ctx = ContextManager(
    strategy=TieredCompact(keep_recent=2),
    budget_tokens=8192
)
```

### Budget Modes

| Mode | Description | Token Source |
|------|-------------|--------------|
| `MANUAL` | User-specified token budget | `manual_tokens` parameter |
| `FORGE_FAST` | Fast iteration mode | Server-reported context |
| `FORGE_BALANCED` | Balanced speed/quality | Server-reported context |
| `FORGE_THOROUGH` | Maximum quality | Server-reported context |

资料来源：[src/forge/server.py:150-200]()

### Known Issues with Reasoning Models

Models using extended reasoning (Gemma 4, Qwen 3.5, Ministral Reasoning) may hang with unbounded reasoning budgets on builds after April 10, 2026. The workaround is to set `--reasoning-budget 0` when starting the backend.

资料来源：[CHANGELOG.md:70-90]()

## Client Configuration Reference

### OllamaClient

```python
OllamaClient(
    model: str,                          # Model name from `ollama list`
    base_url: str = "http://localhost:11434/v1",
    api_key: str | None = None,
    timeout: float = 120.0,
    recommended_sampling: bool = False,  # Opt-in to per-model defaults
    **kwargs
)
```

### LlamafileClient

```python
LlamafileClient(
    gguf_path: str | Path,               # Path to .llamafile binary
    model: str | None = None,            # Optional model name
    base_url: str = "http://localhost:8080/v1",
    recommended_sampling: bool = False,
    **kwargs
)
```

资料来源：[src/forge/clients/llamafile.py:1-50]()

## Complete Workflow Example

```python
import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(
        model="ministral-3:8b-instruct-2512-q4_K_M",
        recommended_sampling=True
    )
    ctx = ContextManager(
        strategy=TieredCompact(keep_recent=2),
        budget_tokens=8192
    )
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())
```

资料来源：[README.md:30-80]()

## Error Handling

### UnsupportedModelError

Raised when `recommended_sampling=True` is specified but the model is not in the sampling defaults map.

```python
from forge.errors import UnsupportedModelError

try:
    client = OllamaClient(
        model="unknown-model:latest",
        recommended_sampling=True
    )
except UnsupportedModelError as e:
    print(f"Model not supported: {e.model}")
    # Solution: Either add entry to MODEL_SAMPLING_DEFAULTS
    # or drop recommended_sampling=True
```

### Tool Call Errors

Tool-related errors include `ToolCallError` (LLM failed to produce valid tool call), `ToolExecutionError` (tool callable raised an exception), and `ToolResolutionError` (valid arguments but data didn't resolve).

资料来源：[src/forge/errors.py:25-60]()

## Best Practices

### Selecting Quantization Levels

| Use Case | Recommended Quantization |
|----------|-------------------------|
| Development/Testing | Q4_K_M (balanced quality/size) |
| Production (quality priority) | Q8_0 (near-float quality) |
| Resource-constrained | Q4_0 (smaller, lower quality) |

### Guardrail Integration

Guardrails in forge are defined in `src/forge/core/runner.py` and nudge templates in `src/forge/prompts/nudges.py`. Each guardrail can be independently toggled via ablation presets for evaluation.

资料来源：[CONTRIBUTING.md:1-30]()

### Server Management

When running multiple evaluations, reuse `ServerManager` instances when the model and configuration match to avoid unnecessary server restarts.

```python
# ServerManager caches configuration to avoid redundant restarts
if (
    self._current_model == model
    and self._current_mode == mode
    and self._current_ctx == ctx_override
):
    # Reuse existing server
    return
```

资料来源：[src/forge/server.py:60-75]()

---

---

## Doramagic 踩坑日志

项目：antoinezambelli/forge

摘要：发现 15 个潜在踩坑项，其中 0 个为 high/blocking；最高优先级：安装坑 - 来源证据：Client sampling params: thread top_p/top_k/min_p/repeat_penalty through request body。

## 1. 安装坑 · 来源证据：Client sampling params: thread top_p/top_k/min_p/repeat_penalty through request body

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Client sampling params: thread top_p/top_k/min_p/repeat_penalty through request body
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_148dff87195e42549d0ffb88b99e9cbf | https://github.com/antoinezambelli/forge/issues/58 | 来源类型 github_issue 暴露的待验证使用条件。

## 2. 安装坑 · 来源证据：Investigate: integration paths with Hermes Agent

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Investigate: integration paths with Hermes Agent
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_e3cbd2d1c9a84a1887887bf24b036865 | https://github.com/antoinezambelli/forge/issues/51 | 来源类型 github_issue 暴露的待验证使用条件。

## 3. 安装坑 · 来源证据：Per-model recommended sampling defaults (map keyed by HF model cards)

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Per-model recommended sampling defaults (map keyed by HF model cards)
- 对用户的影响：可能阻塞安装或首次运行。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_057ca2af912e4a608259ffb2a3654d4f | https://github.com/antoinezambelli/forge/issues/59 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 4. 安装坑 · 来源证据：Rescue-parse ChatGPT-style XML tool calls

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Rescue-parse ChatGPT-style XML tool calls
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_471c674c8d73451da75d6b8c9349aabf | https://github.com/antoinezambelli/forge/issues/55 | 来源类型 github_issue 暴露的待验证使用条件。

## 5. 配置坑 · 来源证据：Proxy external mode hardcodes native FC — no prompt-injection fallback

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：Proxy external mode hardcodes native FC — no prompt-injection fallback
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_f3a85ec8447a4838b3bc4c846cd9e7a0 | https://github.com/antoinezambelli/forge/issues/53 | 来源类型 github_issue 暴露的待验证使用条件。

## 6. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 建议检查：将假设转成下游验证清单。
- 防护动作：假设必须转成验证项；没有验证结果前不能写成事实。
- 证据：capability.assumptions | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | README/documentation is current enough for a first validation pass.

## 7. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 建议检查：补 GitHub 最近 commit、release、issue/PR 响应信号。
- 防护动作：维护活跃度未知时，推荐强度不能标为高信任。
- 证据：evidence.maintainer_signals | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | last_activity_observed missing

## 8. 安全/权限坑 · 下游验证发现风险项

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：下游已经要求复核，不能在页面中弱化。
- 建议检查：进入安全/权限治理复核队列。
- 防护动作：下游风险存在时必须保持 review/recommendation 降级。
- 证据：downstream_validation.risk_items | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | no_demo; severity=medium

## 9. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 建议检查：把风险写入边界卡，并确认是否需要人工复核。
- 防护动作：评分风险必须进入边界卡，不能只作为内部分数。
- 证据：risks.scoring_risks | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | no_demo; severity=medium

## 10. 安全/权限坑 · 来源证据：Hardware detection: AMD unified-memory rigs fall through to 4K Ollama budget

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Hardware detection: AMD unified-memory rigs fall through to 4K Ollama budget
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_4ad226a6d1fa4a5f89fa7702bec11188 | https://github.com/antoinezambelli/forge/issues/61 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 11. 安全/权限坑 · 来源证据：Sub-agent support: dynamic slot splitting

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Sub-agent support: dynamic slot splitting
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_5b35873cf63c4647bca8a0611d441189 | https://github.com/antoinezambelli/forge/issues/28 | 来源类型 github_issue 暴露的待验证使用条件。

## 12. 安全/权限坑 · 来源证据：Sub-agent support: slot pool

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Sub-agent support: slot pool
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_070d9a3d20d24123b62d7d76ee16078a | https://github.com/antoinezambelli/forge/issues/29 | 来源类型 github_issue 暴露的待验证使用条件。

## 13. 安全/权限坑 · 来源证据：llama.cpp reasoning budget sampler causes silent hangs after April 10 builds

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：llama.cpp reasoning budget sampler causes silent hangs after April 10 builds
- 对用户的影响：可能阻塞安装或首次运行。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_673be4a583984219bab90cbadff631fe | https://github.com/antoinezambelli/forge/issues/54 | 来源类型 github_issue 暴露的待验证使用条件。

## 14. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 建议检查：抽样最近 issue/PR，判断是否长期无人处理。
- 防护动作：issue/PR 响应未知时，必须提示维护风险。
- 证据：evidence.maintainer_signals | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | issue_or_pr_quality=unknown

## 15. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 建议检查：确认最近 release/tag 和 README 安装命令是否一致。
- 防护动作：发布节奏未知或过期时，安装说明必须标注可能漂移。
- 证据：evidence.maintainer_signals | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | release_recency=unknown

<!-- canonical_name: antoinezambelli/forge; human_manual_source: deepwiki_human_wiki -->