forge Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

forge

Forge addresses the core challenges of LLM-based agent development:

Introduction to Forge

Related topics: System Architecture, Quick Start Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Project Structure

Continue reading this section for the full explanation and source context.

Section Workflow

Continue reading this section for the full explanation and source context.

Section ToolDef

Continue reading this section for the full explanation and source context.

Related topics: System Architecture, Quick Start Guide

Introduction to Forge

Forge is a framework-agnostic LLM orchestration library that provides structured tool-calling workflows, guardrail enforcement, and context management for building reliable AI agents. It supports multiple LLM backends (Ollama, Llamafile, llama.cpp) and exposes both high-level runners and granular APIs for embedding into foreign orchestration loops.

Overview

Forge addresses the core challenges of LLM-based agent development:

Challenge	Forge Solution
Unreliable tool parsing	Rescue parsing with retry mechanisms
Missing required steps	StepEnforcer validates call sequences
Context overflow	Tiered context compaction strategies
Model-specific sampling	Per-model verified sampling defaults
Multi-backend support	Unified client abstraction layer

Sources: README.md

Core Architecture

Forge follows a layered architecture with clear separation of concerns:

graph TD
    subgraph "User Layer"
        A[User Workflow] --> B[WorkflowRunner]
    end
    
    subgraph "Core Layer"
        B --> C[ContextManager]
        B --> D[LLMClient]
        B --> E[Guardrails]
    end
    
    subgraph "Client Layer"
        D --> F[OllamaClient]
        D --> G[LlamafileClient]
        D --> H[AnthropicClient]
    end
    
    subgraph "Backend Layer"
        F --> I[Ollama Backend]
        G --> J[Llamafile Backend]
        H --> K[Anthropic API]
    end

Sources: CONTRIBUTING.md

Project Structure

src/forge/           # Library source
  clients/           # LLM backend adapters (one per backend)
  core/              # Workflow, runner, messages, steps
  context/           # Context management and compaction
  prompts/           # Prompt templates and nudges
  guardrails/        # Response validation and step enforcement
  proxy/             # OpenAI-compatible proxy server

Sources: CONTRIBUTING.md

Quick Start

The fundamental building blocks of a Forge workflow:

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

Sources: README.md

Core Components

Workflow

The Workflow class defines the structure and constraints of an agent task:

Parameter	Type	Description
`name`	`str`	Workflow identifier
`description`	`str`	Human-readable description
`tools`	`dict[str, ToolDef]`	Tool definitions keyed by name
`required_steps`	`list[str]`	Tools that must precede terminal tool
`terminal_tool`	`str`	Tool(s) that can end the workflow

Sources: src/forge/core/workflow.py:1-50

ToolDef

Binds a tool schema to its implementation:

@dataclass
class ToolDef:
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)

The prerequisites field supports:

String entries: Name-only requirements ("read_file" — any prior call satisfies)
Dict entries: Arg-matched requirements ({"tool": "read_file", "match_arg": "path"})

Sources: src/forge/core/workflow.py

LLM Clients

Forge provides backend-specific clients:

Client	Backend	Features
`OllamaClient`	Ollama API	recommended_sampling, async streaming
`LlamafileClient`	Llamafile binary	Context length detection, reasoning extraction
`AnthropicClient`	Anthropic API	Native Claude support

All clients support send() and send_stream() methods with sampling parameter overrides:

# Instance-level sampling
client = OllamaClient(model="qwen3:8b", recommended_sampling=True)

# Per-call override (merged without mutation)
await client.send(messages, sampling={"temperature": 0.5})

Sources: src/forge/clients/sampling_defaults.py

Per-Model Sampling Defaults

The sampling_defaults module provides verified per-model sampling parameters sourced from HuggingFace model cards:

def get_sampling_defaults(model: str) -> dict[str, float | int]:
    """Pure lookup - returns copy of map value or {} for unknown models."""
    return dict(MODEL_SAMPLING_DEFAULTS.get(model, {}))

def apply_sampling_defaults(model: str, *, strict: bool) -> dict[str, float | int]:
    """Policy layer with four-quadrant behavior."""

strict	model in map	behavior
True	yes	return dict
True	no	raise `UnsupportedModelError`
False	yes	one-shot INFO log; return `{}`
False	no	return `{}` (silent)

Sources: src/forge/clients/sampling_defaults.py

Guardrails System

Forge's guardrails provide multi-layered response validation:

graph LR
    A[LLM Response] --> B[ResponseValidator]
    B --> C{Valid ToolCall?}
    C -->|No| D[Rescue Parsing]
    D --> E{Rescued?}
    E -->|Yes| F[Return ToolCall]
    E -->|No| G[Retry Nudge]
    C -->|Yes| H[StepEnforcer]
    H --> I{Valid Sequence?}
    I -->|Yes| J[ErrorTracker]
    I -->|No| K[Step Blocked Nudge]
    J --> L{Error Limit?}
    L -->|Yes| M[Fatal Error]
    L -->|No| N[Execute]

Sources: src/forge/guardrails/guardrails.py

Guardrails API

class Guardrails:
    def __init__(
        self,
        tool_names: list[str],
        terminal_tool: str | frozenset[str],
        required_steps: list[str] | None = None,
        max_retries: int = 3,
        max_tool_errors: int = 2,
        rescue_enabled: bool = True,
        max_premature_attempts: int = 3,
        retry_nudge: Callable[[str], str] | None = None,
    ) -> None:

Parameter	Default	Description
`max_retries`	3	Consecutive bad responses before fatal
`max_tool_errors`	2	Tool execution failures before exhaustion
`max_premature_attempts`	3	Premature terminal attempts before fatal
`rescue_enabled`	True	Parse tool calls from plain text
`retry_nudge`	None	Custom nudge for bare text responses

Sources: src/forge/guardrails/guardrails.py

CheckResult Actions

@dataclass
class CheckResult:
    action: Literal["execute", "retry", "step_blocked", "fatal"]
    tool_calls: list[ToolCall] | None = None
    nudge: Nudge | None = None  # Set when action is "retry" or "step_blocked"
    reason: str | None = None  # Only when action == "fatal"

Sources: src/forge/guardrails/guardrails.py

Context Management

The ContextManager handles token budget enforcement and message compaction:

Strategy	Purpose
`TieredCompact`	Keeps recent N messages, compacts older ones
Custom strategies	Pluggable compaction algorithms

ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)

Sources: src/forge/server.py

Server Manager

For local GGUF model execution, ServerManager handles backend lifecycle:

server, ctx = await forge.serve(
    backend="llamaserver",
    gguf_path="/path/to/model.gguf",
    mode=BudgetMode.FORGE_FAST,
    client=client,
)
runner = WorkflowRunner(client=client, context_manager=ctx)

Parameter	Description
`backend`	"ollama", "llamaserver", or "llamafile"
`gguf_path`	Path to GGUF file (not for Ollama)
`mode`	Budget mode (FORGE_FAST, FORGE_BALANCED, FORGE_DEEP)
`n_slots`	Concurrent slots for multi-agent
`kv_unified`	Single shared KV cache across slots

Sources: src/forge/server.py

Error Handling

Forge defines a structured exception hierarchy:

class ForgeError(Exception):  # Base exception
    pass

class UnsupportedModelError(ForgeError):
    """raised when recommended_sampling=True for unknown model."""
    
class ToolCallError(ForgeError):
    """LLM failed to produce valid tool call after retries."""
    
class ToolExecutionError(ForgeError):
    """Tool callable raised during execution."""

Sources: src/forge/errors.py

Foreign Loop Integration

Forge can be embedded into existing orchestration systems using the Guardrails middleware API:

from forge.guardrails import Guardrails

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

def handle_response(response):
    result = guardrails.check(response)
    
    if result.action == "fatal":
        return f"FATAL: {result.reason}"
    
    if result.action in ("retry", "step_blocked"):
        return f"{result.action}: {result.nudge.content[:80]}..."
    
    # Execute tools, then record
    executed = [tc.tool for tc in result.tool_calls]
    done = guardrails.record(executed)
    return f"executed {executed}" + (" -- DONE" if done else "")

Sources: examples/foreign_loop.py

Granular Component Access

For fine-grained control, access components directly:

from forge.guardrails import ErrorTracker, ResponseValidator, StepEnforcer

validator = ResponseValidator(tool_names=["search", "lookup", "answer"], rescue_enabled=True)
enforcer = StepEnforcer(required_steps=["search", "lookup"], terminal_tool="answer")
errors = ErrorTracker(max_retries=3, max_tool_errors=2)

Sources: examples/foreign_loop.py

Code Style and Requirements

Requirement	Details
Python	3.12+
Async	`asyncio` throughout — all client methods and runner are async
Type Safety	Pydantic for tool parameter schemas
Type Syntax	Modern unions with `\	`

Sources: CONTRIBUTING.md

Running Tests

# Full suite (865 tests)
python -m pytest tests/unit/ -v --tb=short

# With coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing

# Skip integration tests (require live backend)
python -m pytest tests/ -m "not integration"

Sources: CONTRIBUTING.md

Sources: README.md

Installation Guide

Related topics: Backend Setup Guide, Quick Start Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section System Requirements

Continue reading this section for the full explanation and source context.

Section Backend Options

Continue reading this section for the full explanation and source context.

Section Standard Installation

Continue reading this section for the full explanation and source context.

Related topics: Backend Setup Guide, Quick Start Guide

Installation Guide

This guide covers all aspects of setting up the forge framework, from basic installation to advanced configuration for different LLM backends.

Overview

Forge is a Python library for building LLM-powered workflows with automatic tool calling, context management, and multi-backend support. The installation process involves:

Python environment setup (Python 3.12+)
Package installation via pip
LLM backend selection and configuration
Optional development environment for contributors

Sources: CONTRIBUTING.md:1-15

Prerequisites

System Requirements

Component	Requirement
Python	3.12+
OS	Linux, macOS, Windows
LLM Backend	Ollama, llama-server, or llamafile
VRAM	Varies by model (8GB minimum for small models)

Backend Options

Forge supports three LLM backends:

Backend	Description	Use Case
Ollama	Local model management with simple API	Quick setup, model management
llama-server	llama.cpp server binary	Production, GGUF files
llamafile	Single-file executable models	Distribution, portability

Each backend requires different installation steps:

Ollama: Install via ollama run commands
llama-server: Download llama.cpp server binary
llamafile: Download pre-built executables or convert GGUF files

Sources: src/forge/server.py:1-50

Installation Methods

Standard Installation

For end users wanting to use forge as a library:

pip install forge

This installs the core package with all necessary dependencies.

Development Installation

For contributors and those wanting to modify the codebase:

git clone https://github.com/antoinezambelli/forge.git
cd forge
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or: .venv\Scripts\activate  # Windows
pip install -e ".[dev]"

Sources: CONTRIBUTING.md:7-14

The .[dev] extras include:

Testing dependencies (pytest)
Development tools
Documentation build tools

Package Configuration

The project uses pyproject.toml for dependency management:

[project]
name = "forge"
version = "0.5.0"
requires-python = ">=3.12"

[project.optional-dependencies]
dev = [
    "pytest>=7.0",
    "pytest-asyncio",
    "pytest-cov",
]

Environment Setup

Virtual Environment

Using a virtual environment is strongly recommended to avoid dependency conflicts:

python -m venv forge-env
source forge-env/bin/activate

Backend Installation

#### Ollama Setup

Install Ollama from ollama.ai
Pull a model:

ollama pull ministral-3:8b-instruct-2512-q4_K_M

Verify the installation:

ollama list

#### llama-server Setup

Download the llama.cpp server binary for your platform
Place the binary in your models directory or system PATH
Verify with:

./llama-server --help

#### llamafile Setup

Download a pre-built llamafile (e.g., from TheBloke)
Make it executable:

chmod +x model-name.Q4_K_M.llamafile

Sources: src/forge/server.py:180-220

Configuration Flow

graph TD
    A[Install forge] --> B{Use Case}
    B -->|Library| C[pip install forge]
    B -->|Development| D[pip install -e .[dev]]
    C --> E[Choose Backend]
    D --> E
    E -->|Ollama| F[Install Ollama + Pull Model]
    E -->|llama-server| G[Download llama.cpp binary]
    E -->|llamafile| H[Download llamafile]
    F --> I[Verify with test script]
    G --> I
    H --> I

Quick Start Verification

After installation, verify your setup with this minimal example:

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

Sources: README.md:1-60

Advanced Backend Configuration

KV Cache Quantization

Forge supports KV cache quantization to reduce VRAM usage:

Setting	VRAM Savings	Quality Impact
`q8_0`	~50% vs F16	Minimal
`q4_0`	~75% vs F16	Low

# In setup_backend call
from forge.server import setup_backend, BudgetMode

server, ctx = await setup_backend(
    backend="llamaserver",
    gguf_path="/path/to/model.gguf",
    budget_mode=BudgetMode.FORGE_FAST,
    cache_type_k="q4_0",  # Key cache quantization
    cache_type_v="q4_0",   # Value cache quantization
)

Sources: src/forge/server.py:20-45

Multi-Slot Configuration

For multi-agent architectures:

server, ctx = await setup_backend(
    backend="llamaserver",
    gguf_path="/path/to/model.gguf",
    n_slots=4,           # Number of concurrent slots
    kv_unified=True,     # Shared KV cache pool
)

When kv_unified=True, all slots share a single KV cache pool, allowing each slot to use the full context window.

Sources: src/forge/server.py:40-55

Recommended Sampling Parameters

Forge provides recommended sampling defaults for specific models:

client = OllamaClient(
    model="qwen3:8b-q4_K_M",
    recommended_sampling=True  # Enable recommended defaults
)

The recommended_sampling=True parameter enables tuned temperature, top_p, top_k, and other sampling parameters sourced from HuggingFace model cards.

Sources: src/forge/clients/sampling_defaults.py:1-80

Testing Your Installation

Unit Tests

Run the deterministic unit test suite (no backend required):

python -m pytest tests/unit/ -v --tb=short

Sources: CONTRIBUTING.md:18-25

Integration Tests

Integration tests require a running backend:

# Skip integration tests
python -m pytest tests/ -m "not integration"

# Run with coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing

Troubleshooting

Common Issues

Issue	Solution
`ModuleNotFoundError: forge`	Run `pip install forge` or check virtual environment
Backend connection refused	Verify backend is running on correct port
Model not found (Ollama)	Run `ollama pull <model-name>`
VRAM out of memory	Enable KV cache quantization or use smaller model

Backend Health Check

Verify backend connectivity:

import httpx
import asyncio

async def check_backend():
    try:
        async with httpx.AsyncClient(timeout=5.0) as client:
            resp = await client.get("http://localhost:8080/props")
            if resp.status_code == 200:
                print("Backend is ready")
    except httpx.ConnectError:
        print("Backend not reachable")

Sources: src/forge/server.py:250-280

Next Steps

After installation, refer to:

User Guide — Complete workflow creation and execution
Backend Setup Guide — Detailed backend configuration
Model Guide — Recommended models by hardware tier
Architecture Decision Records — Design rationale documentation

Source: https://github.com/antoinezambelli/forge / Human Manual

Quick Start Guide

Related topics: WorkflowRunner and Agentic Loop, System Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Prerequisites

Continue reading this section for the full explanation and source context.

Section Setup Commands

Continue reading this section for the full explanation and source context.

Section Architecture Overview

Continue reading this section for the full explanation and source context.

Quick Start Guide

This guide provides a practical introduction to Forge, a reliability layer for self-hosted LLM tool-calling. Forge elevates an 8B local model to top-tier performance on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction).

Purpose and Scope

The Quick Start Guide covers:

Installation and environment setup for local LLM backends
Core concepts including Workflows, ToolDefs, and the WorkflowRunner
Basic usage patterns for single-step and multi-step agentic workflows
Integration options for foreign orchestration loops
Backend management with auto-start capabilities

Forge targets developers building agentic applications that require structured tool-calling with local models. It works with llama.cpp-based backends (llama-server, llamafile) and Ollama.

Sources: README.md:1-20

Installation

Prerequisites

Requirement	Version	Notes
Python	3.12+	Modern syntax required (type unions with `\	`)
pip	Latest	For package installation
LLM Backend	llama.cpp / Ollama	For inference

Setup Commands

git clone https://github.com/antoinezambelli/forge.git
cd forge
python -m venv .venv
pip install -e ".[dev]"

Sources: CONTRIBUTING.md:1-15

Core Concepts

Architecture Overview

graph TD
    A[User Input] --> B[WorkflowRunner]
    B --> C[LLM Client]
    C --> D[Tool Call Response]
    D --> E[Guardrails Check]
    E -->|execute| F[Tool Execution]
    F --> G[Context Manager]
    G -->|compact| B
    E -->|retry| C
    E -->|fatal| H[Error Handling]

Workflow

The Workflow is the central definition for an agentic task. It binds together:

Component	Type	Purpose
`name`	`str`	Workflow identifier
`description`	`str`	Human-readable description
`tools`	`dict[str, ToolDef]`	Tool name → definition mapping
`required_steps`	`list[str]`	Tools that must execute before terminal
`terminal_tool`	`str`	Tool that ends the workflow
`system_prompt_template`	`str`	System prompt for the LLM

Sources: src/forge/core/workflow.py:1-50

ToolDef and ToolSpec

ToolSpec defines the schema exposed to the LLM:

class ToolSpec(BaseModel):
    name: str
    description: str
    parameters: type[BaseModel]  # Pydantic model

ToolDef binds the schema to its Python implementation:

@dataclass
class ToolDef:
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)

The prerequisites field enables conditional dependencies:

str: "if you call this tool, you must have called tool X first"
dict: {"tool": "read_file", "match_arg": "path"} — arg-matched prerequisites

Sources: src/forge/core/workflow.py:60-90

WorkflowRunner

The WorkflowRunner manages the full lifecycle:

System prompt injection
Tool execution and result handling
Context compaction
Guardrail enforcement
Multi-turn conversation state

class WorkflowRunner:
    def __init__(self, client, context_manager):
        ...
    
    async def run(self, workflow, user_message):
        ...

Sources: src/forge/core/runner.py

ContextManager and Budget Modes

Forge provides VRAM-aware context management through budget modes:

Mode	Behavior
`BudgetMode.MANUAL`	User-specified token budget
`BudgetMode.FORGE_FAST`	VRAM-optimized fast inference budget
`BudgetMode.FORGE_DEEP`	Extended context for complex reasoning

The ContextManager resolves budgets at runtime based on the backend:

async def resolve_budget(self, mode: BudgetMode, manual_tokens: int | None = None) -> int:
    if mode == BudgetMode.MANUAL:
        if self._backend == "ollama":
            return manual_tokens
        return await self.get_server_context()

Sources: src/forge/server.py:80-120

Quick Start Example

Basic Single-Tool Workflow

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

Sources: README.md:20-60

Key Components Explained

Component	Import	Purpose
`OllamaClient`	`forge`	LLM backend adapter for Ollama
`TieredCompact`	`forge`	Context compaction strategy
`ContextManager`	`forge`	Token budget management
`WorkflowRunner`	`forge`	Orchestrates the agent loop

Multi-Step Workflows

For workflows requiring sequential tool execution:

# Define multi-step workflow with prerequisites
workflow = Workflow(
    name="research_assistant",
    description="Research and answer questions",
    tools={
        "search": ToolDef(spec=search_spec, callable=do_search),
        "lookup": ToolDef(spec=lookup_spec, callable=do_lookup),
        "answer": ToolDef(spec=answer_spec, callable=final_answer),
    },
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

The required_steps list enforces that search and lookup must execute before answer. Attempting to call the terminal tool prematurely triggers a retry nudge.

Sources: examples/foreign_loop.py:1-50

Guardrails API

For foreign orchestration loops (non-WorkflowRunner usage), Forge provides standalone guardrails:

Simple API

from forge.guardrails import Guardrails

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

def handle_response(response):
    result = guardrails.check(response)
    
    if result.action == "fatal":
        return f"FATAL: {result.reason}"
    
    if result.action in ("retry", "step_blocked"):
        return f"{result.action}: {result.nudge.content[:80]}..."
    
    # Execute tools
    executed = [tc.tool for tc in result.tool_calls]
    done = guardrails.record(executed)
    return f"executed {executed}" + (" -- DONE" if done else "")

Granular API

Direct access to individual components:

from forge.guardrails import ErrorTracker, ResponseValidator, StepEnforcer

validator = ResponseValidator(
    tool_names=["search", "lookup", "answer"],
    rescue_enabled=True,
)
enforcer = StepEnforcer(
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)
errors = ErrorTracker(max_retries=3, max_tool_errors=2)

Component	Purpose
`ResponseValidator`	Parses tool calls from LLM responses, rescue mode
`StepEnforcer`	Enforces required step sequence
`ErrorTracker`	Tracks retry attempts and tool errors

Sources: examples/foreign_loop.py:80-150

Respond Tool Pattern

For conversational turns where the model should respond directly:

from forge.tools import RESPOND_TOOL_NAME, respond_spec

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer", RESPOND_TOOL_NAME],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

def handle_response_with_respond(response):
    result = guardrails.check(response)
    
    # Check for respond() call
    for tc in result.tool_calls:
        if tc.tool == RESPOND_TOOL_NAME:
            message = tc.args.get("message", "")
            return f"MODEL SAYS: {message}"
    
    # Normal tool execution
    ...

Sources: examples/foreign_loop.py:160-200

Backend Auto-Management

Forge can auto-start backends for multi-agent architectures:

from forge.server import run_with_server
from forge.clients import LlamafileClient, BudgetMode

async with run_with_server(
    backend="llamafile",
    gguf_path="/path/to/model.gguf",
    budget_mode=BudgetMode.FORGE_FAST,
) as (server, ctx):
    client = LlamafileClient(model="my-model")
    runner = WorkflowRunner(client=client, context_manager=ctx)
    # Run workflows...

Backend Options

Backend	Model Source	GGUF Support
`ollama`	Model name (e.g., `ministral-3:8b`)	No
`llamaserver`	GGUF file path	Yes
`llamafile`	GGUF file path	Yes

Sources: src/forge/server.py:1-80

Recommended Sampling Parameters

Forge provides curated sampling defaults for supported models:

from forge.clients import OllamaClient, get_sampling_defaults

# Opt-in to recommended sampling
client = OllamaClient(
    model="ministral-3:8b-q4_K_M",
    recommended_sampling=True  # Raises error for unknown models
)

Parameter	Source	Verification
`temperature`	HF model cards	Per-model verification
`top_p`	HF model cards	Per-model verification
`top_k`	HF model cards	Per-model verification
`min_p`	HF model cards	Per-model verification
`repeat_penalty`	HF model cards	Per-model verification

Sources: src/forge/clients/sampling_defaults.py:1-50

Testing

Unit Tests (No Backend Required)

# Full suite (865 tests)
python -m pytest tests/unit/ -v --tb=short

# With coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing

# Single file
python -m pytest tests/unit/test_runner.py -v

Integration Tests (Requires Backend)

# Skip integration tests
python -m pytest tests/ -m "not integration"

Sources: CONTRIBUTING.md:15-30

Next Steps

Topic	Description
User Guide	Multi-step workflows, long-running sessions
Model Guide	Model-specific configurations
Architecture Decisions	Design rationale and ADRs
Eval Suite	Performance evaluation methodology

Sources: README.md:1-20

System Architecture

Related topics: WorkflowRunner and Agentic Loop

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Architecture Diagram

Continue reading this section for the full explanation and source context.

Section LLM Clients

Continue reading this section for the full explanation and source context.

Section Workflow System

Continue reading this section for the full explanation and source context.

Related topics: WorkflowRunner and Agentic Loop

System Architecture

Overview

Forge is an LLM agent framework that orchestrates multi-step tool-calling workflows with built-in guardrails, context management, and automatic backend server management. The architecture follows a clean separation of concerns: clients abstract LLM backends, workflows define agent behavior, guardrails enforce execution policies, and the context manager handles token budgeting. Sources: CONTRIBUTING.md

The framework is designed for determinism in unit tests and supports three backend types: Ollama, llama-server, and llamafile. Sources: src/forge/server.py

Project Layout

src/forge/           # Library source
  clients/           # LLM backend adapters (one per backend)
  core/              # Workflow, runner, messages, steps
  context/           # Context management and compaction
  prompts/           # Prompt templates and nudges
tests/
  unit/              # Deterministic tests
  eval/              # Eval harness (requires live backends)
    scenarios/       # Eval scenario definitions
    dashboard/       # React-based HTML dashboard (separate npm build)
docs/                # User-facing documentation
  decisions/         # Architecture Decision Records (ADRs)
  results/           # Eval results and raw data tables

Sources: CONTRIBUTING.md

Core Architecture Components

Architecture Diagram

graph TD
    User["User / Application"]
    Runner["WorkflowRunner"]
    Workflow["Workflow"]
    Guardrails["Guardrails"]
    ContextMgr["ContextManager"]
    Client["LLM Client"]
    ServerMgr["ServerManager"]
    
    User --> Runner
    Runner --> Workflow
    Runner --> Guardrails
    Runner --> ContextMgr
    Runner --> Client
    Client --> ServerMgr
    
    subgraph "forge Library"
        Runner
        Workflow
        Guardrails
        ContextMgr
        Client
    end
    
    subgraph "Backend"
        ServerMgr
    end

LLM Clients

The client layer abstracts different LLM backends behind a common async interface. Each client handles backend-specific protocol differences.

Client	Backend	Protocol
`OllamaClient`	Ollama	OpenAI-compatible REST
`LlamafileClient`	Llamafile	OpenAI-compatible REST
`AnthropicClient`	Anthropic API	Anthropic native
`OpenAIClient`	OpenAI API	OpenAI native

Sources: CONTRIBUTING.md

#### Sampling Defaults

Each client can optionally apply recommended sampling parameters sourced from HuggingFace model cards. The policy layer provides four-quadrant behavior:

`strict`	Model in map	Behavior
`True`	Yes	Return dict
`True`	No	Raise `UnsupportedModelError`
`False`	Yes	One-shot INFO log; return `{}`
`False`	No	Return `{}` (silent)

Sources: src/forge/clients/sampling_defaults.py

Clients no longer ship hardcoded temperature defaults. With recommended_sampling=False (default), forge sends nothing and the backend's default applies. Sources: CHANGELOG.md

Workflow System

The Workflow class defines an agent's behavior declaratively:

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(...),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools.",
)

Sources: README.md

#### Tool Definition

ToolDef binds a tool schema to its implementation:

@dataclass
class ToolDef:
    """Binds a tool schema to its implementation."""
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)

Prerequisites express conditional dependencies:

str: Name-only ("read_file" — any prior call satisfies it)
dict: Arg-matched ({"tool": "read_file", "match_arg": "path"})

Sources: src/forge/core/workflow.py

Guardrails System

Guardrails enforce execution policies through three coordinated components:

graph LR
    Response["LLM Response"] --> Validator["ResponseValidator"]
    Validator --> Enforcer["StepEnforcer"]
    Enforcer --> Tracker["ErrorTracker"]
    
    Validator --> "rescue parsing"
    Enforcer --> "required steps"
    Tracker --> "retry limits"

#### Guardrails Configuration

Parameter	Purpose	Default
`tool_names`	List of available tools	Required
`terminal_tool`	Final allowed tool	Required
`required_steps`	Ordered prerequisite chain	`None`
`max_retries`	Total retry attempts	3
`max_tool_errors`	Consecutive tool failures	2
`rescue_enabled`	Enable XML rescue parsing	`True`
`max_premature_attempts`	Premature terminal attempts before fatal	3

Sources: src/forge/guardrails/guardrails.py

#### Check Result Actions

The check() method returns a CheckResult with these actions:

Action	Meaning
`proceed`	Response passes all guardrails
`retry`	Invalid response, apply nudge and retry
`step_blocked`	Missing required step
`fatal`	Max retries exceeded

Context Manager

The ContextManager handles token budgeting and context compaction to prevent context overflow during long conversations. Sources: CONTRIBUTING.md

#### Budget Resolution

Budget is resolved based on the mode:

Mode	Resolution Strategy
`MANUAL`	Use `manual_tokens` parameter or query server
`FORGE_FAST`	Server-reported context / 4
`FORGE_BALANCED`	Server-reported context / 2
`FORGE_DEEP`	Server-reported context * 3 / 4

For Ollama backends, the context length is obtained from ollama show. For llama-server/llamafile, a /props query retrieves the actual n_ctx. Sources: src/forge/server.py

Server Manager

ServerManager handles lifecycle management of backend servers (llama-server and llamafile only; Ollama is managed externally).

server = ServerManager(backend="llamaserver", port=8080)
context, ctx_mgr = await server.start_with_budget(
    model="qwen3:8b-q4_K_M",
    budget_mode=BudgetMode.FORGE_FAST,
    client=client,
)

#### Server Configuration Parameters

Parameter	Description	Backend
`model`	Model identity for server	All
`gguf_path`	Path to GGUF file	llamaserver/llamafile
`mode`	Operation mode	All
`extra_flags`	Additional CLI flags	llamaserver/llamafile
`ctx_override`	Override context length (`-c value`)	llamaserver/llamafile
`cache_type_k`	KV cache quantization type for keys	llamaserver
`cache_type_v`	KV cache quantization type for values	llamaserver
`n_slots`	Concurrent slots count	llamaserver
`kv_unified`	Single unified KV cache	llamaserver

The server reuses an existing process if the same configuration is requested, avoiding unnecessary restarts. Sources: src/forge/server.py

Execution Flow

sequenceDiagram
    participant User
    participant Runner
    participant Workflow
    participant Guardrails
    participant Context
    participant Client
    
    User->>Runner: run(workflow, input)
    Runner->>Context: begin_session()
    Runner->>Workflow: Get system prompt
    Runner->>Client: send(messages)
    
    loop Until terminal or max iterations
        Client-->>Runner: LLMResponse
        Runner->>Guardrails: check(response)
        Guardrails-->>Runner: CheckResult
        
        alt proceed
            Runner->>Runner: Execute tools
            Runner->>Context: append(messages)
            Runner->>Client: send(messages)
        else retry
            Runner->>Runner: Apply nudge, retry
        else fatal
            Runner->>User: Return error
        end
    end
    
    Runner-->>User: Final result

Workflow Runner Integration

The WorkflowRunner orchestrates all components:

async def main():
    client = OllamaClient(
        model="ministral-3:8b-instruct-2512-q4_K_M",
        recommended_sampling=True
    )
    ctx = ContextManager(
        strategy=TieredCompact(keep_recent=2),
        budget_tokens=8192
    )
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

Sources: README.md

Design Principles

Async-First

All client methods and the runner are async, enabling efficient I/O handling across multiple concurrent requests. Sources: CONTRIBUTING.md

Type Safety

Pydantic is used for tool parameter schemas and response validation, ensuring runtime type safety for tool arguments. Sources: CONTRIBUTING.md

Modern Python

The codebase targets Python 3.12+ and uses modern syntax including:

Type unions with | (e.g., str | None)
dataclass decorators
field(default_factory=list) patterns

Sources: CONTRIBUTING.md

Guardrails Configuration Examples

# Basic guardrails
guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

# With respond tool for middleware
respond_guardrails = Guardrails(
    tool_names=["search", "lookup", "answer", RESPOND_TOOL_NAME],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

Sources: examples/foreign_loop.py

User Guide — Multi-step workflows and backend auto-management
Model Guide — Model recommendations by tier
Architecture Decisions — Design rationale for significant changes

Sources: CONTRIBUTING.md

Module Structure and API

Related topics: System Architecture, WorkflowRunner and Agentic Loop

Section Related Pages

Continue reading this section for the full explanation and source context.

Section ToolSpec

Continue reading this section for the full explanation and source context.

Section ToolDef

Continue reading this section for the full explanation and source context.

Section Workflow Class

Continue reading this section for the full explanation and source context.

Module Structure and API

Overview

The forge repository implements a modular LLM orchestration framework designed to handle multi-step tool-calling workflows with built-in guardrails, context management, and support for multiple LLM backends. The architecture follows a clean separation of concerns with distinct modules for client handling, workflow orchestration, guardrails enforcement, and server management.

Core Architecture

The forge library is organized into a layered architecture that separates backend communication, workflow definition, execution enforcement, and context management.

graph TD
    A[User Code] --> B[WorkflowRunner]
    B --> C[LLM Clients]
    C --> D[llama.cpp / Ollama / Llamafile]
    B --> E[Guardrails]
    B --> F[ContextManager]
    E --> G[ResponseValidator]
    E --> H[StepEnforcer]
    E --> I[ErrorTracker]

Module Hierarchy

Module	Purpose	Key Classes
`forge.core`	Core workflow orchestration	`Workflow`, `WorkflowRunner`, `ToolSpec`, `ToolDef`, `ToolCall`
`forge.clients`	LLM backend adapters	`OllamaClient`, `LlamafileClient`, `LlamaServerClient`
`forge.guardrails`	Response validation and enforcement	`Guardrails`, `ResponseValidator`, `StepEnforcer`, `ErrorTracker`
`forge.context`	Token budget and context management	`ContextManager`, `TieredCompact`
`forge.server`	Backend server lifecycle management	`ServerManager`, `BudgetMode`

Sources: src/forge/__init__.py

Tool Definition API

ToolSpec

The ToolSpec class defines the interface for tools that the LLM can invoke. It wraps a Pydantic model representing the tool's parameters.

class ToolSpec(BaseModel):
    name: str
    description: str
    parameters: type[BaseModel]

Construction from OpenAI Schema:

Tools can be defined from an OpenAI-style JSON Schema:

tool_spec = ToolSpec.from_openai_schema(
    name="get_weather",
    description="Get current weather for a city",
    schema={
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name"}
        },
        "required": ["city"]
    }
)

Sources: src/forge/core/workflow.py:1-50

ToolDef

The ToolDef dataclass binds a tool schema to its implementation callable, along with prerequisites:

@dataclass
class ToolDef:
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)

Prerequisites Syntax:

Prerequisites express conditional dependencies between tool calls:

Type	Example	Behavior
String (name-only)	`"read_file"`	Any prior call to `read_file` satisfies it
Dict (arg-matched)	`{"tool": "read_file", "match_arg": "path"}`	Prior call with same `path` value required

tool_def = ToolDef(
    spec=tool_spec,
    callable=get_weather_function,
    prerequisites=[{"tool": "search", "match_arg": "query"}]
)

Sources: src/forge/core/workflow.py:52-72

Workflow Definition API

Workflow Class

The Workflow class is the central configuration object for a multi-step LLM task:

workflow = Workflow(
    name="weather",
    description="Look up weather for a city",
    tools={
        "get_weather": ToolDef(spec=..., callable=get_weather)
    },
    required_steps=["search", "lookup"],
    terminal_tool="answer",
    system_prompt_template="You are a helpful assistant."
)

Key Parameters:

Parameter	Type	Required	Description
`name`	`str`	Yes	Workflow identifier
`description`	`str`	Yes	Human-readable description for the LLM
`tools`	`dict[str, ToolDef]`	Yes	Map of tool name to ToolDef
`required_steps`	`list[str]`	No	Tools that must be called before terminal_tool
`terminal_tool`	`str`	Yes	Tool(s) that can end the workflow
`system_prompt_template`	`str`	No	System prompt injected into context

Sources: src/forge/core/workflow.py

LLM Client API

Client Architecture

Forge provides backend-agnostic client adapters that implement a common interface:

graph LR
    A[WorkflowRunner] --> B[Client Interface]
    B --> C[OllamaClient]
    B --> D[LlamafileClient]
    B --> E[LlamaServerClient]

OllamaClient

client = OllamaClient(
    model="ministral-3:8b-instruct-2512-q4_K_M",
    recommended_sampling=True
)

Sampling Defaults

Per-model recommended sampling parameters are managed through sampling_defaults.py:

def apply_sampling_defaults(
    model: str,
    *,
    strict: bool,
) -> dict[str, float | int]:
    """Apply the recommended-sampling policy for model."""

Sampling Policy Quadrant:

`strict`	Model in Map	Behavior
`True`	Yes	Return dict copy
`True`	No	Raise `UnsupportedModelError`
`False`	Yes	One-shot INFO log; return `{}`
`False`	No	Return `{}` (silent)

Sources: src/forge/clients/sampling_defaults.py:1-80

ToolCall Response Model

The ToolCall class represents a validated tool invocation returned by an LLM client:

class ToolCall(BaseModel):
    tool: str

Additional fields may be populated by client implementations (e.g., args, reasoning).

Sources: src/forge/core/workflow.py:74-77

Guardrails API

The guardrails system provides middleware for orchestrating LLM responses with built-in validation, step enforcement, and error handling.

Guardrails Class

The main entry point for the guardrails system:

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
    max_retries=3,
    max_tool_errors=2,
    rescue_enabled=True,
    max_premature_attempts=3
)

Constructor Parameters:

Parameter	Type	Default	Description
`tool_names`	`list[str]`	Required	Valid tool names for this workflow
`required_steps`	`list[str]`	`None`	Tools that must be called before terminal_tool
`terminal_tool`	`str \	frozenset`	Required	Tool(s) that can end the workflow
`max_retries`	`int`	`3`	Consecutive bad responses before fatal
`max_tool_errors`	`int`	`2`	Consecutive tool failures before exhaustion
`rescue_enabled`	`bool`	`True`	Attempt to parse tool calls from plain text
`max_premature_attempts`	`int`	`3`	Premature terminal attempts before fatal
`retry_nudge`	`Callable[[str], str]`	`None`	Custom nudge for bare text responses

Sources: src/forge/guardrails/guardrails.py:1-80

CheckResult

The return type of Guardrails.check():

class CheckResult:
    action: Literal["execute", "retry", "step_blocked", "fatal"]
    tool_calls: list[ToolCall] | None
    nudge: Nudge | None
    reason: str | None

Action Meanings:

Action	Description
`execute`	Safe to proceed; `tool_calls` contains valid calls
`retry`	Invalid response; inject `nudge` and retry
`step_blocked`	Attempted terminal tool before required steps
`fatal`	Max retries exhausted; `reason` contains explanation

Two-Method Guardrails API

# After each LLM response
result = guardrails.check(response)

if result.action == "fatal":
    return f"FATAL: {result.reason}"

if result.action in ("retry", "step_blocked"):
    return f"{result.action}: {result.nudge.content}"

# result.action == "execute"
# Run tools yourself, then record results
tool_calls = result.tool_calls
executed = [tc.tool for tc in tool_calls]
done = guardrails.record(executed)

Sources: src/forge/guardrails/guardrails.py:82-130

Granular API

For advanced use cases, individual guardrail components can be used directly:

from forge.guardrails import ResponseValidator, StepEnforcer, ErrorTracker

validator = ResponseValidator(
    tool_names=["search", "lookup", "answer"],
    rescue_enabled=True,
)
enforcer = StepEnforcer(
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)
errors = ErrorTracker(max_retries=3, max_tool_errors=2)

Server Management API

ServerManager

The ServerManager class handles lifecycle management for llama.cpp-based backends:

server = ServerManager(backend="llamaserver", port=8080)

Backend Options:

Backend	Description
`"ollama"`	Ollama server (model name, no GGUF path)
`"llamaserver"`	llama.cpp server via `llama-server`
`"llamafile"`	Mozilla llamafile binary

Budget Resolution

The resolve_budget() method determines context length based on mode:

async def resolve_budget(
    self,
    mode: BudgetMode,
    manual_tokens: int | None = None,
) -> int:

Mode	Behavior
`MANUAL`	Use `manual_tokens` directly
`FORGE_FAST` / `FORGE_DEEP`	Query server `/props` for context

Sources: src/forge/server.py:1-100

Context Management

ContextManager

Token budget management for long-running conversations:

ctx = ContextManager(
    strategy=TieredCompact(keep_recent=2),
    budget_tokens=8192
)

BudgetMode Enum

class BudgetMode(Enum):
    MANUAL = "manual"
    FORGE_FAST = "forge_fast"
    FORGE_DEEP = "forge_deep"

Sources: src/forge/server.py

Error Types

Forge defines custom exceptions for specific error conditions:

class UnsupportedModelError(Exception):
    """Raised when strict sampling defaults are requested for unknown models."""
    pass

Additional error types in errors.py:

Error	Use Case
`BudgetResolutionError`	Server unreachable or missing n_ctx
`BackendError`	Backend communication failures

Sources: src/forge/errors.py

Quick Start Example

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

Sources: README.md

Summary

The forge module structure provides:

Tool Definition — ToolSpec and ToolDef for declaring LLM-callable functions with prerequisites
Workflow Orchestration — Workflow and WorkflowRunner for managing multi-step tasks
Client Abstraction — Backend-agnostic clients with sampling defaults
Guardrails Middleware — Built-in validation, step enforcement, and error handling
Server Management — Lifecycle control for llama.cpp backends
Context Management — Token budget and compaction strategies

Sources: src/forge/__init__.py

Architecture Decision Records

Related topics: System Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Workflow Engine

Continue reading this section for the full explanation and source context.

Section Guardrail Middleware

Continue reading this section for the full explanation and source context.

Section Server Management and Budget Resolution

Continue reading this section for the full explanation and source context.

Related topics: System Architecture

Architecture Decision Records

Overview

Architecture Decision Records (ADRs) serve as the authoritative documentation for significant design choices within the forge project. They capture the *why* behind implementation decisions, enabling current and future contributors to understand the reasoning without reconstructing the original context.

The forge project stores ADRs in docs/decisions/, using a numbered naming convention (e.g., 001-ablation-framework.md, 011-guardrail-middleware.md, 013-text-response-intent.md). This numbering scheme allows for easy chronological tracking and establishes precedent relationships between decisions.

Purpose and Scope

ADRs in forge address several critical aspects:

Category	Description	Example Documents
Framework Design	Ablation study methodology and tooling	`001-ablation-framework.md`
Middleware Patterns	Guardrail implementation and composition	`011-guardrail-middleware.md`
Response Handling	Intent classification for text responses	`013-text-response-intent.md`
Backend Integration	LLM client configuration and sampling defaults	`sampling_defaults.py`
Server Management	Context resolution and budget modes	`server.py`

Each ADR documents not only the chosen approach but also considered alternatives and the tradeoffs that influenced the final decision. This creates a historical record that prevents repeated debates over settled questions while enabling informed reconsideration when circumstances change.

ADR Contribution Workflow

According to the contribution guidelines, the process for introducing a new architecture decision follows a structured review pattern:

graph TD
    A[Identify Design Decision] --> B[Review Existing ADRs]
    B --> C{Decision Already Documented?}
    C -->|Yes| D[Reference Existing ADR]
    C -->|No| E[Draft New ADR]
    E --> F[Propose ADR Format]
    F --> G[Review Against Project Standards]
    G --> H[Merge and Publish]
    
    style A fill:#e1f5fe
    style H fill:#c8e6c9

The contribution workflow integrates with the broader project development cycle:

Proposal Phase: Before implementing significant changes, contributors should draft an ADR following the established format
Review Phase: The ADR undergoes peer review alongside code review
Adoption Phase: Once approved, the ADR becomes the reference for implementation decisions
Maintenance Phase: ADRs may be updated if subsequent decisions supersede them

Sources: CONTRIBUTING.md:1-50

Core Architecture Components

Workflow Engine

The Workflow and WorkflowRunner classes form the central orchestration layer. A workflow defines the available tools, required execution steps, and terminal conditions.

graph TD
    subgraph Workflow Definition
        W[Workflow] --> TD[Tool Definitions]
        W --> RS[Required Steps]
        W --> TT[Terminal Tool]
        W --> SP[System Prompt Template]
    end
    
    subgraph Execution Layer
        WR[WorkflowRunner] --> CM[Context Manager]
        WR --> GR[Guardrails]
        WR --> CL[LLM Client]
    end
    
    subgraph Tool Layer
        TC[ToolCall] --> T[Tool Execution]
        T --> TR[Tool Response]
    end
    
    WR --> TC
    TC -->|Result| CM
    CM -->|Context| WR
    
    style W fill:#fff3e0
    style WR fill:#e3f2fd

The ToolDef dataclass binds tool schemas to implementations, while ToolSpec defines the JSON Schema for parameter validation. Tool calls are represented as ToolCall objects containing the tool name and arguments.

Sources: src/forge/core/workflow.py:1-100

Guardrail Middleware

The guardrail system provides a composable validation layer that intercepts LLM responses before tool execution:

graph LR
    LLM[LLM Response] --> GR[Guardrails.check]
    GR --> RV[ResponseValidator]
    GR --> SE[StepEnforcer]
    GR --> ET[ErrorTracker]
    
    RV -->|Valid| TC[ToolCalls]
    RV -->|Invalid| NR[Retry Nudge]
    SE -->|Correct Order| TC
    SE -->|Wrong Order| SB[Step Blocked]
    ET -->|OK| TC
    ET -->|Max Errors| FT[Fatal]
    
    style GR fill:#fce4ec
    style TC fill:#c8e6c9

The Guardrails class orchestrates three sub-components:

Component	Responsibility	Key Parameters
`ResponseValidator`	Parses tool calls, enables rescue parsing	`rescue_enabled`, `retry_nudge_fn`
`StepEnforcer`	Ensures required steps precede terminal tool	`required_steps`, `max_premature_attempts`
`ErrorTracker`	Tracks consecutive errors and retries	`max_retries`, `max_tool_errors`

Sources: src/forge/guardrails/guardrails.py:1-100

Server Management and Budget Resolution

The ServerManager handles lifecycle management for llama.cpp-based backends, while the ContextManager implements token budget strategies:

graph TD
    SM[ServerManager] --> BM[BudgetMode]
    BM -->|FORGE_FAST| FT[Fast Budget]
    BM -->|FORGE_BALANCED| BT[Balanced Budget]
    BM -->|FORGE_DEEP| DT[Deep Budget]
    BM -->|MANUAL| MT[Manual Tokens]
    
    CM[ContextManager] --> TC[TieredCompact]
    CM --> SC[SimpleCompact]
    
    SM -->|Context Query| Props[/props endpoint]
    Props -->|n_ctx| CM

Budget resolution follows platform-specific paths:

Ollama: Uses manual_tokens parameter for MANUAL mode
Llamafile/Llama Server: Queries /props endpoint for server-configured context length

Sources: src/forge/server.py:1-100

Sampling Configuration System

The sampling defaults system separates lookup from policy, enabling fine-grained control over model parameters:

graph TD
    subgraph Lookup Layer
        GM[get_model_defaults] --> MAP[MODEL_SAMPLING_DEFAULTS]
    end
    
    subgraph Policy Layer
        AS[apply_sampling_defaults] --> |strict=True| KR[Known + Known]
        AS --> |strict=False| KU[Known + Unknown]
        KR -->|In Map| ReturnDict[Return Dict]
        KU -->|Not In Map| InfoLog[INFO Log Once]
    end
    
    subgraph Client Integration
        OC[OllamaClient] --> AS
        LC[LlamafileClient] --> AS
        AC[AnthropicClient] --> AS
    end

The two-function design (get_sampling_defaults for pure lookup, apply_sampling_defaults for policy) ensures that:

Unknown models don't cause errors when strict=False
Known models log a one-time INFO message when not opted in
Explicit opt-in via recommended_sampling=True enables strict behavior

Sources: src/forge/clients/sampling_defaults.py:1-100

Proxy Server Architecture

The ProxyServer provides a forwarding layer with additional control features:

graph TD
    subgraph Proxy Layer
        PS[ProxyServer] --> SF[Serialize Flag]
        PS --> RT[Retry Logic]
        PS --> RC[Rescue Parser]
    end
    
    subgraph Backend Routing
        PS --> Ollama[Ollama Backend]
        PS --> Llama[Llama Backend]
        PS --> LLF[Llamafile Backend]
    end
    
    subgraph Configuration
        SF --> |serialize=True| Serial[Serialize Requests]
        SF --> |serialize=False| Parallel[Parallel Requests]
        RT --> |max_retries=N| RetryN[N Attempts]
    end

Key proxy options include:

Flag	Default	Purpose
`--host`	`127.0.0.1`	Proxy listen address
`--port`	`8081`	Proxy listen port
`--serialize`	`None`	Request serialization control
`--max-retries`	`3`	Retries per request
`--no-rescue`	`False`	Disable rescue parsing

Sources: src/forge/proxy/__main__.py:1-80

ADR Format and Standards

Each ADR in the forge repository follows a consistent structure:

Title: Descriptive name with ADR number
Status: Proposed, Accepted, Deprecated, or Superseded
Context: Background and problem statement
Decision: The chosen approach with rationale
Consequences: Benefits, drawbacks, and tradeoffs
Related Decisions: Links to dependent or related ADRs

This format ensures that future maintainers can quickly assess whether an ADR is current and understand the full context of each decision.

Versioning and Evolution

The CHANGELOG maintains a parallel record of implementation milestones, cross-referenced with ADRs. Major architectural changes increment the minor version number, while bug fixes increment the patch version (semantic versioning).

Changes that require ADR updates include:

New LLM backend support
Guardrail algorithm modifications
Context management strategy changes
Tool execution model alterations
Breaking API changes

Sources: CHANGELOG.md:1-100

Best Practices for ADR Readers

When reviewing ADRs to understand forge's architecture:

Start with the index: The docs/decisions/ directory lists all ADRs chronologically
Check status: Deprecated ADRs indicate historical context, not current practice
Cross-reference implementations: Source files in src/forge/ implement ADR decisions
Review CHANGELOG: Implementation dates and version numbers provide temporal context
Examine tests: Unit tests in tests/unit/ validate ADR-enforced behaviors

Sources: CONTRIBUTING.md:1-50

WorkflowRunner and Agentic Loop

Related topics: System Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: System Architecture

WorkflowRunner and Agentic Loop

Overview

The WorkflowRunner is the central execution engine in the Forge framework, implementing an agentic loop that orchestrates multi-step tool-calling workflows with an LLM backend. It manages the complete lifecycle of a workflow: from initializing messages, through iterative LLM inference and tool execution, to context management and termination.

Core responsibilities:

Building initial message lists (system prompt + user input)
Coordinating LLM inference with streaming or batch responses
Validating and executing tool calls returned by the LLM
Managing context budget through the ContextManager
Enforcing required step sequences via the StepEnforcer
Handling retries for malformed responses
Terminating on terminal tool execution or max iterations

Sources: src/forge/core/runner.py:1-50

Guardrails Middleware for External Loops

Forge's Guardrails Middleware provides a composable reliability layer that can be integrated into any external orchestration loop. Rather than requiring adoption of the full WorkflowRunner...

Section Guardrails Middleware for External Loops

Forge's Guardrails Middleware provides a composable reliability layer that can be integrated into any external orchestration loop. Rather than requiring adoption of the full WorkflowRunner...

Forge's Guardrails Middleware provides a composable reliability layer that can be integrated into any external orchestration loop. Rather than requiring adoption of the full WorkflowRunner, external projects can embed forge's retry nudges, rescue parsing, step enforcement, and error tracking directly within their own agent execution frameworks.

Source: https://github.com/antoinezambelli/forge / Human Manual

Proxy Server Setup

Overview

The forge proxy server is an OpenAI-compatible HTTP proxy that transparently applies forge's guardrail stack to any backend that speaks the Chat Completions API. It acts as a drop-in replacement for local model servers, enabling existing OpenAI-compatible clients to benefit from forge's reliability layer without code changes.

Sources: README.md

Architecture

The proxy operates in two distinct modes, determined at startup:

graph TD
    subgraph "Managed Mode"
        P["ProxyServer<br/>:8081"] --> SM["ServerManager"]
        SM --> BE["llama-server<br/>llamafile<br/>ollama<br/>:8080"]
    end
    
    subgraph "External Mode"
        P2["ProxyServer<br/>:8081"] --> BU["User-managed<br/>Backend<br/>:8080"]
    end
    
    C["OpenAI Client"] --> P
    C2["OpenAI Client"] --> P2

Managed Mode

In managed mode, forge starts and controls the backend process lifecycle. The ServerManager class handles:

Backend binary discovery and execution
Model loading and initialization
Health verification via /props endpoint polling
Graceful shutdown and restart

Sources: src/forge/proxy/proxy.py:1-40

External Mode

In external mode, the proxy connects to a user-managed backend. This is useful when:

The backend runs on a different machine or container
Custom backend configurations are required
The backend is managed by an external orchestration system

Sources: src/forge/proxy/proxy.py:35-45

Supported Backends

Backend	Description	Requirements
`llamaserver`	Llama.cpp's HTTP server	Local GGUF model file
`llamafile`	Mozilla's single-file model executable	Single-file executable
`ollama`	Ollama local inference server	Ollama runtime + model pulled

Sources: src/forge/proxy/__main__.py:25-29

CLI Usage

Basic Invocation

# External mode — you manage the backend
python -m forge.proxy --backend-url http://localhost:8080 --port 8081

# Managed mode — forge starts llama-server and proxy together
python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081

# Managed mode with ollama
python -m forge.proxy --backend ollama --model llama3.2 --port 8081

Sources: README.md

Command-Line Arguments

Argument	Type	Default	Description
`--backend-url`	string	-	External backend URL (mutually exclusive with `--backend`)
`--backend`	choice	-	Backend type: `llamaserver`, `llamafile`, `ollama`
`--model`	string	-	Model name (required for ollama)
`--gguf`	string	-	Path to GGUF file (llamaserver/llamafile)
`--backend-port`	int	8080	Backend port for managed mode
`--budget-mode`	choice	backend	Context budget: `backend`, `manual`, `forge-full`, `forge-fast`
`--budget-tokens`	int	-	Manual token budget override
`--extra-flags`	list	-	Additional backend CLI flags
`--host`	string	127.0.0.1	Proxy listen host
`--port`	int	8081	Proxy listen port
`--serialize`	flag	-	Force request serialization
`--no-serialize`	flag	-	Disable request serialization
`--max-retries`	int	3	Max retries per request
`--no-rescue`	flag	-	Disable rescue parsing
`-v, --verbose`	flag	-	Enable debug logging

Sources: src/forge/proxy/__main__.py:13-53

Programmatic API

ProxyServer Class

The ProxyServer class provides a programmatic interface for embedding the proxy in Python applications.

from forge.proxy import ProxyServer

# External mode
proxy = ProxyServer(backend_url="http://localhost:8080")
proxy.start()
print(f"Proxy running at {proxy.url}")  # http://127.0.0.1:8081
# ... use proxy ...
proxy.stop()

# Managed mode
proxy = ProxyServer(
    backend="llamaserver",
    gguf="model.gguf",
    budget_mode="forge-fast",
    port=8081
)
proxy.start()
proxy.stop()  # Stops both backend and proxy

Sources: src/forge/proxy/proxy.py:50-75

Constructor Parameters

Parameter	Type	Default	Description
`backend_url`	`str \	None`	None	External backend URL
`backend`	`str \	None`	None	Backend type: `llamaserver`, `llamafile`, `ollama`
`model`	`str \	None`	None	Model name for ollama
`gguf`	`str \	Path \	None`	None	Path to GGUF file
`backend_port`	int	8080	Backend port
`budget_mode`	BudgetMode	BudgetMode.BACKEND	Context budget strategy
`budget_tokens`	int	-	Manual token budget
`extra_flags`	`list[str] \	None`	None	Additional CLI flags
`host`	str	127.0.0.1	Listen host
`port`	int	8081	Listen port
`serialize`	`bool \	None`	None	Request serialization control
`max_retries`	int	3	Max retries per request
`rescue_enabled`	bool	True	Enable rescue parsing

Sources: src/forge/proxy/proxy.py:56-100

Lifecycle Methods

Method	Description
`start()`	Start the proxy (blocks until ready, max 120s timeout)
`stop()`	Stop the proxy and managed backend (30s shutdown timeout)
`url`	Property returning the proxy's base URL

Sources: src/forge/proxy/proxy.py:102-125

Respond Tool Injection

Purpose

Small local models (~8B parameters) cannot reliably choose between text output and tool calls. The proxy automatically injects a synthetic respond tool when tools are present in the request, forcing the model into tool-calling mode.

Behavior

When the request contains tools, forge injects a respond(message="...") tool into the tools list
The model calls respond(message="...") instead of producing bare text
The respond call is stripped from the outbound response
The client receives a normal text response with finish_reason: "stop"

This keeps the model in tool-calling mode where forge's full guardrail stack applies.

Sources: README.md

sequenceDiagram
    participant C as OpenAI Client
    participant P as ProxyServer
    participant B as Backend
    
    C->>P: POST /v1/chat/completions<br/>(with tools)
    P->>P: Inject respond tool
    P->>B: Forward request<br/>(tools + respond)
    B->>P: respond(message="answer")
    P->>P: Strip respond call
    P->>C: Normal text response<br/>(finish_reason: "stop")

Context Budget Modes

The proxy supports different strategies for managing context window usage:

Mode	Description
`backend`	Let the backend manage context (default)
`manual`	Use `--budget-tokens` for fixed budget
`forge-full`	Full tiered compaction strategy
`forge-fast`	Fast tiered compaction (reduced)

Sources: src/forge/proxy/__main__.py:35-38

Tiered Compaction

The forge-full and forge-fast modes utilize TieredCompact, a three-phase compaction strategy:

Truncate — Remove oldest messages
Drop results — Remove tool result content
Sliding window — Maintain recent context

Sources: src/forge/proxy/proxy.py:22

Request Serialization

By default, the proxy handles concurrent requests independently. The serialization flags control this behavior:

Flag	Behavior
(none)	Proxy decides based on backend capabilities
`--serialize`	Force sequential request processing
`--no-serialize`	Allow concurrent processing

Sources: src/forge/proxy/__main__.py:31-34

Sampling Parameters Pass-Through

The proxy forwards OpenAI-compatible sampling fields directly to the backend without modification:

temperature
top_p
top_k
min_p
repeat_penalty
presence_penalty
seed

Sources: CHANGELOG.md

To use model-card-recommended sampling in proxy mode:

from forge.clients import get_sampling_defaults

# Look up recommended sampling parameters
sampling = get_sampling_defaults("ministral-3-8b-instruct")
# Include in request body
response = client.post("/v1/chat/completions", json={
    "model": "ministral-3-8b-instruct",
    "messages": [...],
    **sampling
})

Signal Handling

The proxy gracefully handles shutdown signals:

SIGINT (Ctrl+C) — Immediate shutdown
SIGTERM — Graceful shutdown

The main thread uses a timed sleep loop (time.sleep(0.1)) to allow Python to deliver signals between iterations, ensuring proper shutdown on Windows.

Sources: src/forge/proxy/__main__.py:95-105

Testing with Smoke Test Script

The repository includes a smoke test at scripts/smoke_test_proxy.py that:

Starts a mock backend on port 18080
Launches the proxy in external mode on port 18081
Verifies health endpoint
Sends a test chat completion request
Validates the response structure

python scripts/smoke_test_proxy.py

Sources: scripts/smoke_test_proxy.py

Health Endpoint

The proxy exposes a /health endpoint for monitoring:

curl http://127.0.0.1:8081/health

Sources: scripts/smoke_test_proxy.py:70

Configuration Example: Complete Setup

# Start llama-server with custom flags, proxy it
python -m forge.proxy \
    --backend llamaserver \
    --gguf ./models/ministral-3-8b-instruct-q8_0.gguf \
    --model ministral-3-8b-instruct \
    --budget-mode forge-full \
    --backend-port 8080 \
    --port 8081 \
    --host 0.0.0.0 \
    --extra-flags --reasoning-format auto \
    --verbose

Then configure your client:

import httpx

client = httpx.AsyncClient(
    base_url="http://localhost:8081/v1",
    timeout=120.0
)

response = await client.post("/chat/completions", json={
    "model": "ministral-3-8b-instruct",
    "messages": [
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string"}
                    },
                    "required": ["city"]
                }
            }
        }
    ]
})

Sources: README.md

Backend Clients

Related topics: Backend Setup Guide, Model Selection Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Client Hierarchy

Continue reading this section for the full explanation and source context.

Section Common Interface

Continue reading this section for the full explanation and source context.

Section Purpose

Continue reading this section for the full explanation and source context.

Related topics: Backend Setup Guide, Model Selection Guide

Backend Clients

Overview

The Backend Clients subsystem provides a unified abstraction layer over various LLM backends, enabling forge to interact with different inference engines through a consistent interface. This modular design allows users to switch between Ollama, llamafile, and Anthropic backends without modifying workflow code.

Each client handles backend-specific communication protocols, response parsing, streaming, and tool call extraction while exposing a common async API for send operations and context resolution.

Sources: src/forge/clients/base.py:1-50

Architecture

Client Hierarchy

graph TD
    A[BaseClient] --> B[OllamaClient]
    A --> C[LlamafileClient]
    A --> D[AnthropicClient]
    
    E[sampling_defaults.py] --> B
    E --> C
    E --> D
    
    F[WorkflowRunner] --> B
    F --> C
    F --> D

All clients inherit from BaseClient, which defines the core async interface including send(), send_stream(), and get_context_length() methods. Backend-specific implementations override these methods to handle vendor-specific APIs and response formats.

Sources: src/forge/clients/base.py:1-100

Common Interface

All clients implement the following async methods:

Method	Purpose
`send(messages, tools, **kwargs)`	Send a request and receive a complete response
`send_stream(messages, tools, **kwargs)`	Stream responses as an async generator
`get_context_length()`	Query the backend for maximum context window
`stop()`	Stop any ongoing generation

Sources: src/forge/clients/base.py:50-150

OllamaClient

Purpose

The OllamaClient connects to a local Ollama server instance, supporting both standard models and GGUF-formatted models served through Ollama's model management system.

Sources: src/forge/clients/ollama.py:1-100

Configuration Options

OllamaClient(
    model: str,                          # Model name (e.g., "qwen3:8b-q4_K_M")
    base_url: str = "http://localhost:11434",
    recommended_sampling: bool = False, # Use verified per-model sampling params
    **kwargs                            # Passed to httpx client
)

Key Features

Recommended Sampling: When recommended_sampling=True, the client retrieves verified sampling parameters from forge.clients.sampling_defaults for known models. If a model is not in the map and strict=True, an UnsupportedModelError is raised.

Streaming Support: Full streaming support with token-level async generation through send_stream().

Tool Call Extraction: Parses Ollama's JSON tool call format and converts to forge's internal ToolCall format.

Sources: src/forge/clients/sampling_defaults.py:1-80

LlamafileClient

Purpose

The LlamafileClient communicates with llamafile or llama-server instances, providing support for GGUF models served directly without Ollama's model management layer.

Sources: src/forge/clients/llamafile.py:1-100

Context Resolution

Unlike Ollama, llamafile and llama-server require querying the /props endpoint to determine the configured context length:

async def get_context_length(self) -> int | None:
    """Query the Llamafile /props endpoint for configured context length."""
    base = self.base_url.rstrip("/")
    if base.endswith("/v1"):
        base = base[:-3]

    resp = await self._http.get(f"{base}/props")
    data = resp.json()
    n_ctx = data.get("default_generation_settings", {}).get("n_ctx")
    return int(n_ctx) if n_ctx is not None else None

Sources: src/forge/clients/llamafile.py:180-200

Tool Call Modes

The client supports multiple tool call parsing strategies:

Mode	Description
`native`	Uses backend's native tool call format
`function`	Parses `<function=name>...</function>` style tags
`prompt`	Extracts tool calls from prompted responses

Sources: src/forge/clients/llamafile.py:100-180

AnthropicClient

Purpose

The AnthropicClient integrates with Anthropic's Claude API, enabling forge workflows to leverage Claude Opus, Sonnet, and Haiku models.

Sources: src/forge/clients/anthropic.py:1-100

Key Differences

No hardcoded temperature defaults — relies on Anthropic API's own defaults
Supports Anthropic-specific headers and request formatting
Compatible with tools via Anthropic's tool use API

Sampling Defaults System

Overview

The sampling_defaults module provides verified per-model sampling parameters sourced from HuggingFace model cards. This ensures optimal generation quality for supported models without requiring users to manually tune hyperparameters.

Sources: src/forge/clients/sampling_defaults.py:1-50

Supported Parameters

Parameter	Description	Typical Range
`temperature`	Sampling temperature	0.0 - 1.0
`top_p`	Nucleus sampling threshold	0.0 - 1.0
`top_k`	Top-k sampling	1 - 100
`min_p`	Minimum probability threshold	0.0 - 1.0
`repeat_penalty`	Repetition penalty	0.0 - 2.0
`presence_penalty`	Presence penalty (OpenAI compat)	-2.0 - 2.0

Policy Behavior

The apply_sampling_defaults() function implements a four-quadrant policy:

def apply_sampling_defaults(model: str, *, strict: bool) -> dict[str, float | int]:
    """Apply the recommended-sampling policy for model."""
    in_map = model in MODEL_SAMPLING_DEFAULTS
    if strict:
        if not in_map:
            raise UnsupportedModelError(model)
        return dict(MODEL_SAMPLING_DEFAULTS[model])
    
    # strict=False: one-shot INFO log if known, else silent
    if in_map and model not in _INFO_LOGGED:
        log.info("Recommended sampling params exist for %r...", model)
        _INFO_LOGGED.add(model)
    return {}

`strict`	Model in Map	Behavior
`True`	Yes	Return dict copy
`True`	No	Raise `UnsupportedModelError`
`False`	Yes	One-shot INFO log; return `{}`
`False`	No	Return `{}` (silent)

Sources: src/forge/clients/sampling_defaults.py:60-120

Verified Models

The following model families are currently supported with verified sampling parameters:

Qwen3 / Qwen3.5 / Qwen3.6
Qwen3-Coder
Gemma 4
Mistral Small 3.2
Devstral Small 2
Ministral 3 Instruct + Reasoning
Mistral Nemo
Granite 4.0

Each entry includes an inline HuggingFace card URL comment for verification.

Sources: src/forge/clients/sampling_defaults.py:50-80

Tool Call Processing

Extraction Flow

graph TD
    A[LLM Response] --> B{Response Type}
    B -->|tool_calls| C[extract_tool_call]
    B -->|text| D[TextResponse]
    
    C --> E{Tool Call Format}
    E -->|OpenAI style| F[Parse name + arguments]
    E -->|function tags| G[Parse XML-style tags]
    E -->|dict style| H[Parse dict with name field]
    
    F --> I[ToolCall object]
    G --> I
    H --> I

Supported Formats

Backend	Format	Example
Ollama	OpenAI-style function calls	`{"name": "get_weather", "arguments": {"city": "Paris"}}`
Llamafile	Function tags or native	`<function=name><parameter=city>Paris</parameter></function>`
Anthropic	Claude tool_use blocks	`{name: "get_weather", input: {city: "Paris"}}`

Sources: src/forge/core/workflow.py:1-50

Proxy Mode Integration

Request Passthrough

When running in proxy mode, the client plumbs OpenAI-compatible body fields through to backends without modification:

# Proxy plumbs these fields through per request:
- temperature
- top_p
- top_k
- min_p
- repeat_penalty
- presence_penalty
- seed

For per-model recommended sampling in proxy mode, the calling client must look up forge.clients.get_sampling_defaults(model) and include the values in the request body.

Sources: src/forge/proxy/__main__.py:1-50

Usage Examples

Basic Workflow with OllamaClient

from forge import OllamaClient, WorkflowRunner, ContextManager, TieredCompact

async def main():
    client = OllamaClient(
        model="ministral-3:8b-instruct-2512-q4_K_M",
        recommended_sampling=True  # Use verified sampling params
    )
    ctx = ContextManager(
        strategy=TieredCompact(keep_recent=2),
        budget_tokens=8192
    )
    runner = WorkflowRunner(client=client, context_manager=ctx)
    # ... run workflows

asyncio.run(main())

Per-Call Sampling Override

response = await client.send(
    messages,
    tools,
    sampling={
        "temperature": 0.7,
        "top_p": 0.9
    }
)

The caller's explicit non-None fields merge with client-level defaults without mutating the original configuration.

Error Handling

Error	Cause	Resolution
`UnsupportedModelError`	Model not in sampling defaults map	Add model to `sampling_defaults.py` or pass `recommended_sampling=False`
`httpx.HTTPError`	Backend unreachable	Verify backend is running on correct port
`BudgetResolutionError`	Cannot determine context length	Check backend `/props` endpoint returns `n_ctx`

Sources: src/forge/server.py:1-50

Backend Server Management

ServerManager Integration

Forge can auto-manage backend servers through ServerManager, which handles starting, stopping, and context resolution:

from forge.server import create_server_and_context

server, ctx = await create_server_and_context(
    backend="ollama",
    model="qwen3:8b-q4_K_M",
    budget_mode=BudgetMode.FORGE_FAST,
    client=client,
)

Supported Backends

Backend	Model Specification	Port	Features
`ollama`	Model name string	11434	Auto model management
`llamaserver`	GGUF file path	8080	Direct GGUF serving
`llamafile`	GGUF file path	8080	Single-file server

Sources: src/forge/server.py:50-150

Sources: src/forge/clients/base.py:1-50

Backend Setup Guide

Related topics: Backend Clients, Model Selection Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Ollama Backend

Continue reading this section for the full explanation and source context.

Section Llama-server and Llamafile Backends

Continue reading this section for the full explanation and source context.

Section State Management

Continue reading this section for the full explanation and source context.

Related topics: Backend Clients, Model Selection Guide

Backend Setup Guide

Overview

The Backend Setup Guide covers how to configure, initialize, and manage LLM backend servers within forge. Forge supports multiple backend types—Ollama, llama-server, and llamafile—each with distinct initialization patterns, context management strategies, and operational characteristics. Understanding these backends is essential for running forge's Workflow and WorkflowRunner components effectively.

Forge abstracts backend management through two primary classes: ServerManager (for direct server lifecycle control) and setup_backend() (a high-level async factory that combines server startup with ContextManager creation). Sources: src/forge/server.py:1-50

Supported Backend Types

Forge supports three backend implementations, each targeting different deployment scenarios:

Backend	Identity	Configuration	Use Case
`ollama`	Model name (e.g., `qwen3:8b`)	Uses Ollama's native model management	Quick local development, model switching
`llamaserver`	GGUF file path	Requires explicit GGUF path and context size	Production GGUF inference with fine control
`llamafile`	GGUF file path	Auto-discovers llamafile runtime	Single-file distribution, portable setups

The backend type is determined at ServerManager instantiation and cannot be changed afterward. Sources: src/forge/server.py:140-145

Ollama Backend

The Ollama backend leverages Ollama's built-in model management system. Models are pulled and managed through the Ollama CLI rather than requiring manual GGUF file handling.

server = ServerManager(backend="ollama", port=8080)
await server.start(model="qwen3:8b-q4_K_M", gguf_path="", mode="native")

Key constraints:

Does not accept gguf_path (use the model parameter instead)
Requires model to be specified
VRAM cleanup between model switches is handled via ollama stop Sources: src/forge/server.py:170-180

Llama-server and Llamafile Backends

Both llamaserver and llamafile backends operate on GGUF files directly. The server identity is derived from the GGUF path, enabling cache-equality checks for server reuse. Sources: src/forge/server.py:185-200

# Llama-server
server = ServerManager(backend="llamaserver", port=8080)
await server.start(model="identity", gguf_path="/models/qwen3-8b-q4_K_M.gguf", mode="native")

# Llamafile (auto-discovers runtime)
server = ServerManager(backend="llamafile", port=8080)
await server.start(model="identity", gguf_path="/models/llamafile-binary", mode="native")

ServerManager Architecture

The ServerManager class encapsulates all backend server lifecycle operations, providing a unified interface across different backend types.

State Management

graph TD
    A[ServerManager.__init__] --> B[_proc: Popen | None]
    A --> C[_current_model: str | None]
    A --> D[_current_mode: str | None]
    A --> E[_current_ctx: int | None]
    A --> F[_current_flags: tuple]
    A --> G[_current_cache_type_k/v: str | None]
    A --> H[_current_n_slots: int | None]
    A --> I[_current_kv_unified: bool]
    
    J[ServerManager.start] --> K{Cache Hit?}
    K -->|Yes| L[Return - reuse existing]
    K -->|No| M[await stop]
    M --> N[Build command flags]
    N --> O[Spawn subprocess]

Cache Equality Check

Before starting a new server instance, ServerManager checks if an existing server matches the requested configuration. This prevents unnecessary VRAM allocation and model reloading. Sources: src/forge/server.py:50-65

flags = tuple(extra_flags) if extra_flags else ()
if (
    self._current_model == model
    and self._current_mode == mode
    and self._current_ctx == ctx_override
    and self._current_flags == flags
    and self._current_cache_type_k == cache_type_k
    and self._current_cache_type_v == cache_type_v
    and self._current_n_slots == n_slots
    and self._current_kv_unified == kv_unified
):
    return  # Reuse existing server

Server Initialization Parameters

Parameter	Type	Default	Description
`backend`	`str`	Required	Backend type: `"ollama"` \	`"llamaserver"` \	`"llamafile"`
`port`	`int`	`8080`	Server listen port (llama-server / llamafile only)
`models_dir`	`str \	Path`	`None`	Directory containing GGUF files

Startup Parameters

The start() method accepts numerous parameters for fine-grained control over server behavior:

Parameter	Type	Description
`model`	`str`	Model identity (Ollama: model name; others: GGUF path as string)
`gguf_path`	`str \	Path`	Path to GGUF file for llamaserver/llamafile
`mode`	`str`	`"native"` or `"prompt"` reasoning mode
`extra_flags`	`list[str]`	Additional CLI flags passed to the server
`ctx_override`	`int \	None`	Override context window size (`-c <value>`)
`cache_type_k`	`str`	KV cache quantization type for keys (e.g., `"q8_0"`, `"q4_0"`)
`cache_type_v`	`str`	KV cache quantization type for values
`n_slots`	`int`	Concurrent slot count for multi-agent architectures
`kv_unified`	`bool`	Use unified KV cache across all slots

Sources: src/forge/server.py:95-115

Budget Modes

Budget modes control how forge resolves the context window budget for the ContextManager. The resolve_budget() method maps BudgetMode enum values to actual token counts. Sources: src/forge/server.py:220-250

graph TD
    A[resolve_budget mode] --> B{MANUAL?}
    B -->|Yes, Ollama| C[Return manual_tokens]
    B -->|Yes, others| D[await get_server_context]
    B -->|No| E{Ollama?}
    E -->|Yes| F[await _ollama.full]
    E -->|No| G{Mode == FORGE_FAST?}
    G -->|Yes| H[await get_server_context ÷ 4]
    G -->|No| I[await get_server_context]

Budget Resolution Table

BudgetMode	Ollama Backend	Llama-server/Llamafile Backend
`MANUAL`	Returns `manual_tokens` parameter	Queries `/props` for `n_ctx`
`BACKEND`	Ollama's reported context length	Queries `/props` for `n_ctx`
`FORGE_FAST`	`n_ctx / 4`	`n_ctx / 4`

High-Level Setup with `setup_backend()`

For most use cases, prefer setup_backend() which combines server startup with ContextManager creation. Sources: src/forge/server.py:280-330

from forge.server import setup_backend, BudgetMode

async def example():
    client, ctx = await setup_backend(
        backend="llamaserver",
        gguf_path="/models/qwen3-8b-q4_K_M.gguf",
        budget_mode=BudgetMode.FORGE_FAST,
        client=None,  # Will create default client
    )
    # ... run workflows ...
    await client.close()

`setup_backend()` Parameters

Parameter	Type	Default	Description
`backend`	`str`	Required	Backend type
`model`	`str \	None`	`None`	Ollama model name
`gguf_path`	`str \	Path \	None`	`None`	GGUF file path
`budget_mode`	`BudgetMode`	`BudgetMode.BACKEND`	Context budget strategy
`manual_tokens`	`int \	None`	`None`	Required for `MANUAL` mode on Ollama
`client`	`Any \	None`	`None`	Existing client or `None` to create default
`mode`	`str`	`"native"`	Reasoning mode
`port`	`int`	`8080`	Server port
`extra_flags`	`list[str] \	None`	`None`	Additional backend flags
`on_compact`	`Callable \	None`	`None`	Callback for compaction events
`compact_threshold`	`float`	`0.75`	Compaction trigger threshold
`phase_thresholds`	`tuple`	`(0.5, 0.7, 0.9)`	Tiered compaction thresholds

Server Readiness Detection

Forge uses /props polling rather than /health for readiness confirmation. This eliminates the gap between health-ok and props-available states. Sources: src/forge/server.py:260-278

async def wait_for_ready(self, timeout: float = 60.0) -> None:
    url = f"http://localhost:{self._port}/props"
    while time.monotonic() < deadline:
        try:
            resp = await client.get(url)
            if resp.status_code == 200:
                data = resp.json()
                if "default_generation_settings" in data:
                    return
        except (httpx.ConnectError, httpx.ReadError, httpx.TimeoutException):
            pass
        await asyncio.sleep(2)

The readiness check looks for default_generation_settings in the response—a strong indicator that the model is fully loaded and serving. Sources: src/forge/server.py:260-278

Proxy Server Configuration

Forge includes a proxy server (forge.proxy) that plumbs OpenAI-compatible sampling parameters through to backends. The proxy does not consult the sampling defaults map; it passes through whatever parameters the inbound request carries. Sources: src/forge/proxy/__main__.py:1-60

Proxy CLI Options

Flag	Type	Default	Description
`--backend-url`	`str`	Required	Target backend URL
`--backend`	`str`	Required	Backend type
`--model`	`str`	Required	Model identifier
`--gguf`	`str`	`""`	GGUF path (for non-Ollama)
`--budget-mode`	`str`	`"backend"`	Budget resolution mode
`--budget-tokens`	`int`	`None`	Manual token budget
`--host`	`str`	`127.0.0.1`	Proxy listen host
`--port`	`int`	`8081`	Proxy listen port
`--serialize`	`flag`	`None`	Force request serialization
`--max-retries`	`int`	`3`	Max retries per request
`--verbose`	`flag`	`False`	Enable debug logging

Proxy Sampling Passthrough

The proxy supports these OpenAI-compatible body fields:

Parameter	Type	Description
`temperature`	`float`	Sampling temperature
`top_p`	`float`	Nucleus sampling threshold
`top_k`	`int`	Top-k sampling
`min_p`	`float`	Minimum probability threshold
`repeat_penalty`	`float`	Repetition penalty
`presence_penalty`	`float`	Presence penalty
`seed`	`int`	Deterministic sampling seed

For per-model recommended sampling in proxy mode, callers should look up forge.clients.get_sampling_defaults(model) and include the values in the request body. Sources: src/forge/clients/sampling_defaults.py:1-50

Per-Model Sampling Defaults

Forge ships verified per-model sampling recommendations for supported models. These must be explicitly opted into via recommended_sampling=True. Sources: src/forge/clients/sampling_defaults.py:50-80

Supported Models

The sampling defaults map includes recommendations for:

Qwen3 / 3.5 / 3.6 series
Qwen3-Coder
Gemma 4
Mistral Small 3.2
Devstral Small 2
Ministral 3 Instruct + Reasoning
Mistral Nemo
Granite 4.0 (h-micro, h-tiny)

Each entry includes an inline HuggingFace model card URL for verification. Sources: CHANGELOG.md:0.6.0

Sampling Policy

`strict`	Model in Map	Behavior
`True`	Yes	Return dict copy
`True`	No	Raise `UnsupportedModelError`
`False`	Yes	One-shot INFO log; return `{}`
`False`	No	Return `{}` (silent)

Context Length Resolution

For non-Ollama backends, forge queries the server's /props endpoint to determine the configured context length. This value feeds into budget resolution. Sources: src/forge/server.py:200-220

async def get_server_context(self) -> int:
    """Query /props for actual n_ctx.
    
    For Ollama: ``ollama stop`` for clean VRAM unloads between model switches.
    """
    props = await self.query_props()
    ctx = props.get("default_generation_settings", {}).get("n_ctx")
    if ctx is None:
        raise BudgetResolutionError()
    return ctx

Best Practices

Server Reuse

Always check if a server with the desired configuration is already running before starting a new one. The ServerManager performs this check internally based on:

Model identity (name or GGUF path)
Mode (native or prompt)
Context override
CLI flags
KV cache quantization settings
Slot configuration

VRAM Management

For Ollama backends, use ollama stop to cleanly unload models and free VRAM before switching to a different model. The llama-server/llamafile backends handle this through server restart. Sources: src/forge/server.py:200

Graceful Shutdown

Always call server.stop() when finished to properly terminate the backend process:

server = ServerManager(backend="llamaserver", port=8080)
try:
    await server.start(...)
    # ... work ...
finally:
    await server.stop()

Multi-Agent Configurations

For multi-agent architectures requiring concurrent slots, configure n_slots and optionally kv_unified=True for shared KV cache across slots:

await server.start(
    model="...",
    gguf_path="...",
    n_slots=4,
    kv_unified=True,  # Each slot can use full context
)

Quick Start Example

import asyncio
from forge import OllamaClient, WorkflowRunner
from forge.server import setup_backend, BudgetMode
from forge.context import ContextManager, TieredCompact

async def main():
    # Setup backend with forge-managed context
    client, ctx = await setup_backend(
        backend="ollama",
        model="ministral-3:8b-instruct-2512-q4_K_M",
        budget_mode=BudgetMode.FORGE_FAST,
        recommended_sampling=True,
    )
    
    try:
        runner = WorkflowRunner(client=client, context_manager=ctx)
        # ... run workflows ...
    finally:
        await client.close()

asyncio.run(main())

For GGUF-based setups:

from forge.server import setup_backend, BudgetMode

client, ctx = await setup_backend(
    backend="llamaserver",
    gguf_path="/path/to/model-q4_K_M.gguf",
    budget_mode=BudgetMode.FORGE_FAST,
    extra_flags=["--reasoning-format", "auto"],
)

CONTRIBUTING.md - Project setup and testing
README.md - Quick start and Workflow overview
CHANGELOG.md - Version history and breaking changes

Sources: src/forge/server.py:95-115

Model Selection Guide

Related topics: Backend Setup Guide, Backend Clients

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Backend Selection Logic

Continue reading this section for the full explanation and source context.

Section Model Families

Continue reading this section for the full explanation and source context.

Section Model Naming Conventions

Continue reading this section for the full explanation and source context.

Related topics: Backend Setup Guide, Backend Clients

Model Selection Guide

Overview

The Model Selection Guide covers how to choose, configure, and deploy language models within forge. Forge provides a unified workflow engine that abstracts backend differences (Ollama, Llamafile, LlamaServer) while offering per-model recommended sampling parameters sourced directly from HuggingFace model cards.

Sources: src/forge/clients/sampling_defaults.py:1-20

Supported Backends

Forge supports three LLM backend types, each with distinct configuration requirements.

Backend	Configuration Method	Model Specification	Notes
Ollama	`model` parameter	Model name from `ollama list`	No GGUF path needed
Llamafile	`gguf_path` parameter	Path to .llamafile binary	Self-contained executables
LlamaServer	`gguf_path` parameter	Path to GGUF model file	Requires llama.cpp server binary

Sources: src/forge/server.py:1-50

Backend Selection Logic

graph TD
    A[Choose Backend] --> B{Backend Type?}
    B -->|Ollama| C[Use model name]
    B -->|Llamafile| D[Use gguf_path]
    B -->|LlamaServer| D
    C --> E[Connect via localhost:11434]
    D --> F[Start server process]
    F --> G[Connect via port 8080]

For Ollama, the model parameter directly references the model name from ollama list. For Llamafile and LlamaServer, you must provide the gguf_path pointing to the model file, and Forge will manage the server process lifecycle.

Sources: src/forge/server.py:80-120

Supported Models

Model Families

Forge has been tested and evaluated with the following model families across different quantization levels.

Model Family	Variants	Recommended Quantization	Notes
Qwen3	8B, 3.5, 3.6	Q4_K_M, Q8_0	Includes Qwen3-Coder
Gemma	4 (all sizes)	Q4_K_M	Use `--reasoning-budget 0` workaround
Mistral	Small 3.2, Nemo, 7B	Q4_K_M, Q8_0	Ministral variants available
Devstral	Small 2	Q4_K_M	Code-focused model
Granite	4.0 (h-micro, h-tiny)	Q4_K_M, Q8_0	OpenAI-style tool calls
Llama	3.1 8B	Q4_K_M, Q8_0	8B Reasoning variants
Ministral	3 Instruct, 8B Instruct, Reasoning	Q4_K_M, Q8_0	Reasoning requires budget fix

Sources: CHANGELOG.md:1-50

Model Naming Conventions

Model names vary by backend. When using Ollama, use the exact model tag as shown in ollama list. For GGUF-based backends, the model name is derived from the filename stem of the GGUF file.

# Ollama example
client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M")

# Llamafile/LlamaServer example - model derived from path
client = LlamafileClient(gguf_path="/models/mistral-7b-q4_K_M.gguf")

Sources: README.md:1-30

Recommended Sampling Configuration

The Sampling Defaults Map

Forge ships forge.clients.sampling_defaults containing a verified per-model sampling recommendations map. Each entry includes parameters such as temperature, top_p, top_k, min_p, repeat_penalty, and presence_penalty sourced directly from HuggingFace model cards.

Sources: src/forge/clients/sampling_defaults.py:20-40

Enabling Recommended Sampling

To use per-model recommended sampling, pass recommended_sampling=True when initializing the client.

from forge import OllamaClient

# Opt-in to recommended sampling
client = OllamaClient(
    model="qwen3:8b-q4_K_M",
    recommended_sampling=True
)

If the model is not in the map and recommended_sampling=True is set, Forge raises UnsupportedModelError rather than silently falling back to backend defaults.

Sources: src/forge/errors.py:1-25

Sampling Policy Behavior

The following table describes the four-quadrant behavior when applying sampling defaults.

`strict`	Model in Map	Behavior
`True`	Yes	Return recommended dict
`True`	No	Raise `UnsupportedModelError`
`False`	Yes	One-shot INFO log; return `{}`
`False`	No	Return `{}` (silent)

Sources: src/forge/clients/sampling_defaults.py:60-85

Per-Call Sampling Overrides

The send() and send_stream() methods accept a sampling: dict | None kwarg that merges field-by-field with the client's instance-level sampling without mutating it. The caller's explicit non-None fields take precedence.

# Merge with instance defaults
response = await client.send(
    messages,
    sampling={"temperature": 0.7, "top_p": 0.9}
)

Sources: CHANGELOG.md:50-70

Proxy Mode Configuration

When running forge in proxy mode, sampling parameters are plumbed through from the incoming request body. OpenAI-compatible fields supported include temperature, top_p, top_k, min_p, repeat_penalty, presence_penalty, and seed.

Proxy Server Startup

python -m forge.proxy \
    --backend-url http://localhost:11434 \
    --backend ollama \
    --model qwen3:8b-q4_K_M \
    --port 8081

For per-model recommended sampling in proxy mode, the calling client should look up forge.clients.get_sampling_defaults(model) and include the values in the request body.

Sources: src/forge/proxy/__main__.py:1-40

Context and Budget Management

Server Context Resolution

Forge automatically queries the backend's /props endpoint to determine the maximum context length. For Ollama, use ollama stop to cleanly unload VRAM between model switches.

from forge import ContextManager, TieredCompact

ctx = ContextManager(
    strategy=TieredCompact(keep_recent=2),
    budget_tokens=8192
)

Budget Modes

Mode	Description	Token Source
`MANUAL`	User-specified token budget	`manual_tokens` parameter
`FORGE_FAST`	Fast iteration mode	Server-reported context
`FORGE_BALANCED`	Balanced speed/quality	Server-reported context
`FORGE_THOROUGH`	Maximum quality	Server-reported context

Sources: src/forge/server.py:150-200

Known Issues with Reasoning Models

Models using extended reasoning (Gemma 4, Qwen 3.5, Ministral Reasoning) may hang with unbounded reasoning budgets on builds after April 10, 2026. The workaround is to set --reasoning-budget 0 when starting the backend.

Sources: CHANGELOG.md:70-90

Client Configuration Reference

OllamaClient

OllamaClient(
    model: str,                          # Model name from `ollama list`
    base_url: str = "http://localhost:11434/v1",
    api_key: str | None = None,
    timeout: float = 120.0,
    recommended_sampling: bool = False,  # Opt-in to per-model defaults
    **kwargs
)

LlamafileClient

LlamafileClient(
    gguf_path: str | Path,               # Path to .llamafile binary
    model: str | None = None,            # Optional model name
    base_url: str = "http://localhost:8080/v1",
    recommended_sampling: bool = False,
    **kwargs
)

Sources: src/forge/clients/llamafile.py:1-50

Complete Workflow Example

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(
        model="ministral-3:8b-instruct-2512-q4_K_M",
        recommended_sampling=True
    )
    ctx = ContextManager(
        strategy=TieredCompact(keep_recent=2),
        budget_tokens=8192
    )
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

Sources: README.md:30-80

Error Handling

UnsupportedModelError

Raised when recommended_sampling=True is specified but the model is not in the sampling defaults map.

from forge.errors import UnsupportedModelError

try:
    client = OllamaClient(
        model="unknown-model:latest",
        recommended_sampling=True
    )
except UnsupportedModelError as e:
    print(f"Model not supported: {e.model}")
    # Solution: Either add entry to MODEL_SAMPLING_DEFAULTS
    # or drop recommended_sampling=True

Tool Call Errors

Tool-related errors include ToolCallError (LLM failed to produce valid tool call), ToolExecutionError (tool callable raised an exception), and ToolResolutionError (valid arguments but data didn't resolve).

Sources: src/forge/errors.py:25-60

Best Practices

Selecting Quantization Levels

Use Case	Recommended Quantization
Development/Testing	Q4_K_M (balanced quality/size)
Production (quality priority)	Q8_0 (near-float quality)
Resource-constrained	Q4_0 (smaller, lower quality)

Guardrail Integration

Guardrails in forge are defined in src/forge/core/runner.py and nudge templates in src/forge/prompts/nudges.py. Each guardrail can be independently toggled via ablation presets for evaluation.

Sources: CONTRIBUTING.md:1-30

Server Management

When running multiple evaluations, reuse ServerManager instances when the model and configuration match to avoid unnecessary server restarts.

# ServerManager caches configuration to avoid redundant restarts
if (
    self._current_model == model
    and self._current_mode == mode
    and self._current_ctx == ctx_override
):
    # Reuse existing server
    return

Sources: src/forge/server.py:60-75

Sources: src/forge/clients/sampling_defaults.py:1-20

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Client sampling params: thread top_p/top_k/min_p/repeat_penalty through request body

First-time setup may fail or require extra isolation and rollback planning.

medium Investigate: integration paths with Hermes Agent

First-time setup may fail or require extra isolation and rollback planning.

medium Per-model recommended sampling defaults (map keyed by HF model cards)

First-time setup may fail or require extra isolation and rollback planning.

medium Rescue-parse ChatGPT-style XML tool calls

First-time setup may fail or require extra isolation and rollback planning.

Doramagic Pitfall Log

Doramagic extracted 15 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: Client sampling params: thread top_p/top_k/min_p/repeat_penalty through request body

Severity: medium
Finding: Installation risk is backed by a source signal: Client sampling params: thread top_p/top_k/min_p/repeat_penalty through request body. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/58

2. Installation risk: Investigate: integration paths with Hermes Agent

Severity: medium
Finding: Installation risk is backed by a source signal: Investigate: integration paths with Hermes Agent. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/51

3. Installation risk: Per-model recommended sampling defaults (map keyed by HF model cards)

Severity: medium
Finding: Installation risk is backed by a source signal: Per-model recommended sampling defaults (map keyed by HF model cards). Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/59

4. Installation risk: Rescue-parse ChatGPT-style XML tool calls

Severity: medium
Finding: Installation risk is backed by a source signal: Rescue-parse ChatGPT-style XML tool calls. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/55

5. Configuration risk: Proxy external mode hardcodes native FC — no prompt-injection fallback

Severity: medium
Finding: Configuration risk is backed by a source signal: Proxy external mode hardcodes native FC — no prompt-injection fallback. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/53

6. Capability assumption: README/documentation is current enough for a first validation pass.

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: capability.assumptions | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | README/documentation is current enough for a first validation pass.

7. Maintenance risk: Maintainer activity is unknown

Severity: medium
Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: evidence.maintainer_signals | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | last_activity_observed missing

8. Security or permission risk: no_demo

Severity: medium
Finding: no_demo
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: downstream_validation.risk_items | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | no_demo; severity=medium

9. Security or permission risk: no_demo

Severity: medium
Finding: no_demo
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: risks.scoring_risks | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | no_demo; severity=medium

10. Security or permission risk: Hardware detection: AMD unified-memory rigs fall through to 4K Ollama budget

Severity: medium
Finding: Security or permission risk is backed by a source signal: Hardware detection: AMD unified-memory rigs fall through to 4K Ollama budget. Treat it as a review item until the current version is checked.
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/61

11. Security or permission risk: Sub-agent support: dynamic slot splitting

Severity: medium
Finding: Security or permission risk is backed by a source signal: Sub-agent support: dynamic slot splitting. Treat it as a review item until the current version is checked.
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/28

12. Security or permission risk: Sub-agent support: slot pool

Severity: medium
Finding: Security or permission risk is backed by a source signal: Sub-agent support: slot pool. Treat it as a review item until the current version is checked.
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/29

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 10

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using forge with real data or production workflows.

Hardware detection: AMD unified-memory rigs fall through to 4K Ollama bu - github / github_issue
Per-model recommended sampling defaults (map keyed by HF model cards) - github / github_issue
Client sampling params: thread top_p/top_k/min_p/repeat_penalty through - github / github_issue
llama.cpp reasoning budget sampler causes silent hangs after April 10 bu - github / github_issue
Rescue-parse ChatGPT-style XML tool calls - github / github_issue
Proxy external mode hardcodes native FC — no prompt-injection fallback - github / github_issue
Investigate: integration paths with Hermes Agent - github / github_issue
Sub-agent support: slot pool - github / github_issue
Sub-agent support: dynamic slot splitting - github / github_issue
README/documentation is current enough for a first validation pass. - GitHub / issue

Source: Project Pack community evidence and pitfall evidence