Doramagic Project Pack · Human Manual

forge

Forge addresses the core challenges of LLM-based agent development:

Introduction to Forge

Related topics: System Architecture, Quick Start Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Project Structure

Continue reading this section for the full explanation and source context.

Section Workflow

Continue reading this section for the full explanation and source context.

Section ToolDef

Continue reading this section for the full explanation and source context.

Related topics: System Architecture, Quick Start Guide

Introduction to Forge

Forge is a framework-agnostic LLM orchestration library that provides structured tool-calling workflows, guardrail enforcement, and context management for building reliable AI agents. It supports multiple LLM backends (Ollama, Llamafile, llama.cpp) and exposes both high-level runners and granular APIs for embedding into foreign orchestration loops.

Overview

Forge addresses the core challenges of LLM-based agent development:

ChallengeForge Solution
Unreliable tool parsingRescue parsing with retry mechanisms
Missing required stepsStepEnforcer validates call sequences
Context overflowTiered context compaction strategies
Model-specific samplingPer-model verified sampling defaults
Multi-backend supportUnified client abstraction layer

Sources: README.md

Core Architecture

Forge follows a layered architecture with clear separation of concerns:

graph TD
    subgraph "User Layer"
        A[User Workflow] --> B[WorkflowRunner]
    end
    
    subgraph "Core Layer"
        B --> C[ContextManager]
        B --> D[LLMClient]
        B --> E[Guardrails]
    end
    
    subgraph "Client Layer"
        D --> F[OllamaClient]
        D --> G[LlamafileClient]
        D --> H[AnthropicClient]
    end
    
    subgraph "Backend Layer"
        F --> I[Ollama Backend]
        G --> J[Llamafile Backend]
        H --> K[Anthropic API]
    end

Sources: CONTRIBUTING.md

Project Structure

src/forge/           # Library source
  clients/           # LLM backend adapters (one per backend)
  core/              # Workflow, runner, messages, steps
  context/           # Context management and compaction
  prompts/           # Prompt templates and nudges
  guardrails/        # Response validation and step enforcement
  proxy/             # OpenAI-compatible proxy server

Sources: CONTRIBUTING.md

Quick Start

The fundamental building blocks of a Forge workflow:

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

Sources: README.md

Core Components

Workflow

The Workflow class defines the structure and constraints of an agent task:

ParameterTypeDescription
namestrWorkflow identifier
descriptionstrHuman-readable description
toolsdict[str, ToolDef]Tool definitions keyed by name
required_stepslist[str]Tools that must precede terminal tool
terminal_toolstrTool(s) that can end the workflow

Sources: src/forge/core/workflow.py:1-50

ToolDef

Binds a tool schema to its implementation:

@dataclass
class ToolDef:
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)

The prerequisites field supports:

  • String entries: Name-only requirements ("read_file" — any prior call satisfies)
  • Dict entries: Arg-matched requirements ({"tool": "read_file", "match_arg": "path"})

Sources: src/forge/core/workflow.py

LLM Clients

Forge provides backend-specific clients:

ClientBackendFeatures
OllamaClientOllama APIrecommended_sampling, async streaming
LlamafileClientLlamafile binaryContext length detection, reasoning extraction
AnthropicClientAnthropic APINative Claude support

All clients support send() and send_stream() methods with sampling parameter overrides:

# Instance-level sampling
client = OllamaClient(model="qwen3:8b", recommended_sampling=True)

# Per-call override (merged without mutation)
await client.send(messages, sampling={"temperature": 0.5})

Sources: src/forge/clients/sampling_defaults.py

Per-Model Sampling Defaults

The sampling_defaults module provides verified per-model sampling parameters sourced from HuggingFace model cards:

def get_sampling_defaults(model: str) -> dict[str, float | int]:
    """Pure lookup - returns copy of map value or {} for unknown models."""
    return dict(MODEL_SAMPLING_DEFAULTS.get(model, {}))

def apply_sampling_defaults(model: str, *, strict: bool) -> dict[str, float | int]:
    """Policy layer with four-quadrant behavior."""
strictmodel in mapbehavior
Trueyesreturn dict
Truenoraise UnsupportedModelError
Falseyesone-shot INFO log; return {}
Falsenoreturn {} (silent)

Sources: src/forge/clients/sampling_defaults.py

Guardrails System

Forge's guardrails provide multi-layered response validation:

graph LR
    A[LLM Response] --> B[ResponseValidator]
    B --> C{Valid ToolCall?}
    C -->|No| D[Rescue Parsing]
    D --> E{Rescued?}
    E -->|Yes| F[Return ToolCall]
    E -->|No| G[Retry Nudge]
    C -->|Yes| H[StepEnforcer]
    H --> I{Valid Sequence?}
    I -->|Yes| J[ErrorTracker]
    I -->|No| K[Step Blocked Nudge]
    J --> L{Error Limit?}
    L -->|Yes| M[Fatal Error]
    L -->|No| N[Execute]

Sources: src/forge/guardrails/guardrails.py

Guardrails API

class Guardrails:
    def __init__(
        self,
        tool_names: list[str],
        terminal_tool: str | frozenset[str],
        required_steps: list[str] | None = None,
        max_retries: int = 3,
        max_tool_errors: int = 2,
        rescue_enabled: bool = True,
        max_premature_attempts: int = 3,
        retry_nudge: Callable[[str], str] | None = None,
    ) -> None:
ParameterDefaultDescription
max_retries3Consecutive bad responses before fatal
max_tool_errors2Tool execution failures before exhaustion
max_premature_attempts3Premature terminal attempts before fatal
rescue_enabledTrueParse tool calls from plain text
retry_nudgeNoneCustom nudge for bare text responses

Sources: src/forge/guardrails/guardrails.py

CheckResult Actions

@dataclass
class CheckResult:
    action: Literal["execute", "retry", "step_blocked", "fatal"]
    tool_calls: list[ToolCall] | None = None
    nudge: Nudge | None = None  # Set when action is "retry" or "step_blocked"
    reason: str | None = None  # Only when action == "fatal"

Sources: src/forge/guardrails/guardrails.py

Context Management

The ContextManager handles token budget enforcement and message compaction:

StrategyPurpose
TieredCompactKeeps recent N messages, compacts older ones
Custom strategiesPluggable compaction algorithms
ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)

Sources: src/forge/server.py

Server Manager

For local GGUF model execution, ServerManager handles backend lifecycle:

server, ctx = await forge.serve(
    backend="llamaserver",
    gguf_path="/path/to/model.gguf",
    mode=BudgetMode.FORGE_FAST,
    client=client,
)
runner = WorkflowRunner(client=client, context_manager=ctx)
ParameterDescription
backend"ollama", "llamaserver", or "llamafile"
gguf_pathPath to GGUF file (not for Ollama)
modeBudget mode (FORGE_FAST, FORGE_BALANCED, FORGE_DEEP)
n_slotsConcurrent slots for multi-agent
kv_unifiedSingle shared KV cache across slots

Sources: src/forge/server.py

Error Handling

Forge defines a structured exception hierarchy:

class ForgeError(Exception):  # Base exception
    pass

class UnsupportedModelError(ForgeError):
    """raised when recommended_sampling=True for unknown model."""
    
class ToolCallError(ForgeError):
    """LLM failed to produce valid tool call after retries."""
    
class ToolExecutionError(ForgeError):
    """Tool callable raised during execution."""

Sources: src/forge/errors.py

Foreign Loop Integration

Forge can be embedded into existing orchestration systems using the Guardrails middleware API:

from forge.guardrails import Guardrails

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

def handle_response(response):
    result = guardrails.check(response)
    
    if result.action == "fatal":
        return f"FATAL: {result.reason}"
    
    if result.action in ("retry", "step_blocked"):
        return f"{result.action}: {result.nudge.content[:80]}..."
    
    # Execute tools, then record
    executed = [tc.tool for tc in result.tool_calls]
    done = guardrails.record(executed)
    return f"executed {executed}" + (" -- DONE" if done else "")

Sources: examples/foreign_loop.py

Granular Component Access

For fine-grained control, access components directly:

from forge.guardrails import ErrorTracker, ResponseValidator, StepEnforcer

validator = ResponseValidator(tool_names=["search", "lookup", "answer"], rescue_enabled=True)
enforcer = StepEnforcer(required_steps=["search", "lookup"], terminal_tool="answer")
errors = ErrorTracker(max_retries=3, max_tool_errors=2)

Sources: examples/foreign_loop.py

Code Style and Requirements

RequirementDetails
Python3.12+
Asyncasyncio throughout — all client methods and runner are async
Type SafetyPydantic for tool parameter schemas
Type SyntaxModern unions with `\`

Sources: CONTRIBUTING.md

Running Tests

# Full suite (865 tests)
python -m pytest tests/unit/ -v --tb=short

# With coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing

# Skip integration tests (require live backend)
python -m pytest tests/ -m "not integration"

Sources: CONTRIBUTING.md

Sources: README.md

Installation Guide

Related topics: Backend Setup Guide, Quick Start Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section System Requirements

Continue reading this section for the full explanation and source context.

Section Backend Options

Continue reading this section for the full explanation and source context.

Section Standard Installation

Continue reading this section for the full explanation and source context.

Related topics: Backend Setup Guide, Quick Start Guide

Installation Guide

This guide covers all aspects of setting up the forge framework, from basic installation to advanced configuration for different LLM backends.

Overview

Forge is a Python library for building LLM-powered workflows with automatic tool calling, context management, and multi-backend support. The installation process involves:

  1. Python environment setup (Python 3.12+)
  2. Package installation via pip
  3. LLM backend selection and configuration
  4. Optional development environment for contributors

Sources: CONTRIBUTING.md:1-15

Prerequisites

System Requirements

ComponentRequirement
Python3.12+
OSLinux, macOS, Windows
LLM BackendOllama, llama-server, or llamafile
VRAMVaries by model (8GB minimum for small models)

Backend Options

Forge supports three LLM backends:

BackendDescriptionUse Case
OllamaLocal model management with simple APIQuick setup, model management
llama-serverllama.cpp server binaryProduction, GGUF files
llamafileSingle-file executable modelsDistribution, portability

Each backend requires different installation steps:

  • Ollama: Install via ollama run commands
  • llama-server: Download llama.cpp server binary
  • llamafile: Download pre-built executables or convert GGUF files

Sources: src/forge/server.py:1-50

Installation Methods

Standard Installation

For end users wanting to use forge as a library:

pip install forge

This installs the core package with all necessary dependencies.

Development Installation

For contributors and those wanting to modify the codebase:

git clone https://github.com/antoinezambelli/forge.git
cd forge
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or: .venv\Scripts\activate  # Windows
pip install -e ".[dev]"

Sources: CONTRIBUTING.md:7-14

The .[dev] extras include:

  • Testing dependencies (pytest)
  • Development tools
  • Documentation build tools

Package Configuration

The project uses pyproject.toml for dependency management:

[project]
name = "forge"
version = "0.5.0"
requires-python = ">=3.12"

[project.optional-dependencies]
dev = [
    "pytest>=7.0",
    "pytest-asyncio",
    "pytest-cov",
]

Environment Setup

Virtual Environment

Using a virtual environment is strongly recommended to avoid dependency conflicts:

python -m venv forge-env
source forge-env/bin/activate

Backend Installation

#### Ollama Setup

  1. Install Ollama from ollama.ai
  2. Pull a model:
ollama pull ministral-3:8b-instruct-2512-q4_K_M
  1. Verify the installation:
ollama list

#### llama-server Setup

  1. Download the llama.cpp server binary for your platform
  2. Place the binary in your models directory or system PATH
  3. Verify with:
./llama-server --help

#### llamafile Setup

  1. Download a pre-built llamafile (e.g., from TheBloke)
  2. Make it executable:
chmod +x model-name.Q4_K_M.llamafile

Sources: src/forge/server.py:180-220

Configuration Flow

graph TD
    A[Install forge] --> B{Use Case}
    B -->|Library| C[pip install forge]
    B -->|Development| D[pip install -e .[dev]]
    C --> E[Choose Backend]
    D --> E
    E -->|Ollama| F[Install Ollama + Pull Model]
    E -->|llama-server| G[Download llama.cpp binary]
    E -->|llamafile| H[Download llamafile]
    F --> I[Verify with test script]
    G --> I
    H --> I

Quick Start Verification

After installation, verify your setup with this minimal example:

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

Sources: README.md:1-60

Advanced Backend Configuration

KV Cache Quantization

Forge supports KV cache quantization to reduce VRAM usage:

SettingVRAM SavingsQuality Impact
q8_0~50% vs F16Minimal
q4_0~75% vs F16Low
# In setup_backend call
from forge.server import setup_backend, BudgetMode

server, ctx = await setup_backend(
    backend="llamaserver",
    gguf_path="/path/to/model.gguf",
    budget_mode=BudgetMode.FORGE_FAST,
    cache_type_k="q4_0",  # Key cache quantization
    cache_type_v="q4_0",   # Value cache quantization
)

Sources: src/forge/server.py:20-45

Multi-Slot Configuration

For multi-agent architectures:

server, ctx = await setup_backend(
    backend="llamaserver",
    gguf_path="/path/to/model.gguf",
    n_slots=4,           # Number of concurrent slots
    kv_unified=True,     # Shared KV cache pool
)

When kv_unified=True, all slots share a single KV cache pool, allowing each slot to use the full context window.

Sources: src/forge/server.py:40-55

Forge provides recommended sampling defaults for specific models:

client = OllamaClient(
    model="qwen3:8b-q4_K_M",
    recommended_sampling=True  # Enable recommended defaults
)

The recommended_sampling=True parameter enables tuned temperature, top_p, top_k, and other sampling parameters sourced from HuggingFace model cards.

Sources: src/forge/clients/sampling_defaults.py:1-80

Testing Your Installation

Unit Tests

Run the deterministic unit test suite (no backend required):

python -m pytest tests/unit/ -v --tb=short

Sources: CONTRIBUTING.md:18-25

Integration Tests

Integration tests require a running backend:

# Skip integration tests
python -m pytest tests/ -m "not integration"

# Run with coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing

Troubleshooting

Common Issues

IssueSolution
ModuleNotFoundError: forgeRun pip install forge or check virtual environment
Backend connection refusedVerify backend is running on correct port
Model not found (Ollama)Run ollama pull <model-name>
VRAM out of memoryEnable KV cache quantization or use smaller model

Backend Health Check

Verify backend connectivity:

import httpx
import asyncio

async def check_backend():
    try:
        async with httpx.AsyncClient(timeout=5.0) as client:
            resp = await client.get("http://localhost:8080/props")
            if resp.status_code == 200:
                print("Backend is ready")
    except httpx.ConnectError:
        print("Backend not reachable")

Sources: src/forge/server.py:250-280

Next Steps

After installation, refer to:

Source: https://github.com/antoinezambelli/forge / Human Manual

Quick Start Guide

Related topics: WorkflowRunner and Agentic Loop, System Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Prerequisites

Continue reading this section for the full explanation and source context.

Section Setup Commands

Continue reading this section for the full explanation and source context.

Section Architecture Overview

Continue reading this section for the full explanation and source context.

Related topics: WorkflowRunner and Agentic Loop, System Architecture

Quick Start Guide

This guide provides a practical introduction to Forge, a reliability layer for self-hosted LLM tool-calling. Forge elevates an 8B local model to top-tier performance on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction).

Purpose and Scope

The Quick Start Guide covers:

  • Installation and environment setup for local LLM backends
  • Core concepts including Workflows, ToolDefs, and the WorkflowRunner
  • Basic usage patterns for single-step and multi-step agentic workflows
  • Integration options for foreign orchestration loops
  • Backend management with auto-start capabilities

Forge targets developers building agentic applications that require structured tool-calling with local models. It works with llama.cpp-based backends (llama-server, llamafile) and Ollama.

Sources: README.md:1-20

Installation

Prerequisites

RequirementVersionNotes
Python3.12+Modern syntax required (type unions with `\`)
pipLatestFor package installation
LLM Backendllama.cpp / OllamaFor inference

Setup Commands

git clone https://github.com/antoinezambelli/forge.git
cd forge
python -m venv .venv
pip install -e ".[dev]"

Sources: CONTRIBUTING.md:1-15

Core Concepts

Architecture Overview

graph TD
    A[User Input] --> B[WorkflowRunner]
    B --> C[LLM Client]
    C --> D[Tool Call Response]
    D --> E[Guardrails Check]
    E -->|execute| F[Tool Execution]
    F --> G[Context Manager]
    G -->|compact| B
    E -->|retry| C
    E -->|fatal| H[Error Handling]

Workflow

The Workflow is the central definition for an agentic task. It binds together:

ComponentTypePurpose
namestrWorkflow identifier
descriptionstrHuman-readable description
toolsdict[str, ToolDef]Tool name → definition mapping
required_stepslist[str]Tools that must execute before terminal
terminal_toolstrTool that ends the workflow
system_prompt_templatestrSystem prompt for the LLM

Sources: src/forge/core/workflow.py:1-50

ToolDef and ToolSpec

ToolSpec defines the schema exposed to the LLM:

class ToolSpec(BaseModel):
    name: str
    description: str
    parameters: type[BaseModel]  # Pydantic model

ToolDef binds the schema to its Python implementation:

@dataclass
class ToolDef:
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)

The prerequisites field enables conditional dependencies:

  • str: "if you call this tool, you must have called tool X first"
  • dict: {"tool": "read_file", "match_arg": "path"} — arg-matched prerequisites

Sources: src/forge/core/workflow.py:60-90

WorkflowRunner

The WorkflowRunner manages the full lifecycle:

  • System prompt injection
  • Tool execution and result handling
  • Context compaction
  • Guardrail enforcement
  • Multi-turn conversation state
class WorkflowRunner:
    def __init__(self, client, context_manager):
        ...
    
    async def run(self, workflow, user_message):
        ...

Sources: src/forge/core/runner.py

ContextManager and Budget Modes

Forge provides VRAM-aware context management through budget modes:

ModeBehavior
BudgetMode.MANUALUser-specified token budget
BudgetMode.FORGE_FASTVRAM-optimized fast inference budget
BudgetMode.FORGE_DEEPExtended context for complex reasoning

The ContextManager resolves budgets at runtime based on the backend:

async def resolve_budget(self, mode: BudgetMode, manual_tokens: int | None = None) -> int:
    if mode == BudgetMode.MANUAL:
        if self._backend == "ollama":
            return manual_tokens
        return await self.get_server_context()

Sources: src/forge/server.py:80-120

Quick Start Example

Basic Single-Tool Workflow

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

Sources: README.md:20-60

Key Components Explained

ComponentImportPurpose
OllamaClientforgeLLM backend adapter for Ollama
TieredCompactforgeContext compaction strategy
ContextManagerforgeToken budget management
WorkflowRunnerforgeOrchestrates the agent loop

Multi-Step Workflows

For workflows requiring sequential tool execution:

# Define multi-step workflow with prerequisites
workflow = Workflow(
    name="research_assistant",
    description="Research and answer questions",
    tools={
        "search": ToolDef(spec=search_spec, callable=do_search),
        "lookup": ToolDef(spec=lookup_spec, callable=do_lookup),
        "answer": ToolDef(spec=answer_spec, callable=final_answer),
    },
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

The required_steps list enforces that search and lookup must execute before answer. Attempting to call the terminal tool prematurely triggers a retry nudge.

Sources: examples/foreign_loop.py:1-50

Guardrails API

For foreign orchestration loops (non-WorkflowRunner usage), Forge provides standalone guardrails:

Simple API

from forge.guardrails import Guardrails

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

def handle_response(response):
    result = guardrails.check(response)
    
    if result.action == "fatal":
        return f"FATAL: {result.reason}"
    
    if result.action in ("retry", "step_blocked"):
        return f"{result.action}: {result.nudge.content[:80]}..."
    
    # Execute tools
    executed = [tc.tool for tc in result.tool_calls]
    done = guardrails.record(executed)
    return f"executed {executed}" + (" -- DONE" if done else "")

Granular API

Direct access to individual components:

from forge.guardrails import ErrorTracker, ResponseValidator, StepEnforcer

validator = ResponseValidator(
    tool_names=["search", "lookup", "answer"],
    rescue_enabled=True,
)
enforcer = StepEnforcer(
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)
errors = ErrorTracker(max_retries=3, max_tool_errors=2)
ComponentPurpose
ResponseValidatorParses tool calls from LLM responses, rescue mode
StepEnforcerEnforces required step sequence
ErrorTrackerTracks retry attempts and tool errors

Sources: examples/foreign_loop.py:80-150

Respond Tool Pattern

For conversational turns where the model should respond directly:

from forge.tools import RESPOND_TOOL_NAME, respond_spec

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer", RESPOND_TOOL_NAME],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

def handle_response_with_respond(response):
    result = guardrails.check(response)
    
    # Check for respond() call
    for tc in result.tool_calls:
        if tc.tool == RESPOND_TOOL_NAME:
            message = tc.args.get("message", "")
            return f"MODEL SAYS: {message}"
    
    # Normal tool execution
    ...

Sources: examples/foreign_loop.py:160-200

Backend Auto-Management

Forge can auto-start backends for multi-agent architectures:

from forge.server import run_with_server
from forge.clients import LlamafileClient, BudgetMode

async with run_with_server(
    backend="llamafile",
    gguf_path="/path/to/model.gguf",
    budget_mode=BudgetMode.FORGE_FAST,
) as (server, ctx):
    client = LlamafileClient(model="my-model")
    runner = WorkflowRunner(client=client, context_manager=ctx)
    # Run workflows...

Backend Options

BackendModel SourceGGUF Support
ollamaModel name (e.g., ministral-3:8b)No
llamaserverGGUF file pathYes
llamafileGGUF file pathYes

Sources: src/forge/server.py:1-80

Forge provides curated sampling defaults for supported models:

from forge.clients import OllamaClient, get_sampling_defaults

# Opt-in to recommended sampling
client = OllamaClient(
    model="ministral-3:8b-q4_K_M",
    recommended_sampling=True  # Raises error for unknown models
)
ParameterSourceVerification
temperatureHF model cardsPer-model verification
top_pHF model cardsPer-model verification
top_kHF model cardsPer-model verification
min_pHF model cardsPer-model verification
repeat_penaltyHF model cardsPer-model verification

Sources: src/forge/clients/sampling_defaults.py:1-50

Testing

Unit Tests (No Backend Required)

# Full suite (865 tests)
python -m pytest tests/unit/ -v --tb=short

# With coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing

# Single file
python -m pytest tests/unit/test_runner.py -v

Integration Tests (Requires Backend)

# Skip integration tests
python -m pytest tests/ -m "not integration"

Sources: CONTRIBUTING.md:15-30

Next Steps

TopicDescription
User GuideMulti-step workflows, long-running sessions
Model GuideModel-specific configurations
Architecture DecisionsDesign rationale and ADRs
Eval SuitePerformance evaluation methodology

Sources: README.md:1-20

System Architecture

Related topics: WorkflowRunner and Agentic Loop

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Architecture Diagram

Continue reading this section for the full explanation and source context.

Section LLM Clients

Continue reading this section for the full explanation and source context.

Section Workflow System

Continue reading this section for the full explanation and source context.

Related topics: WorkflowRunner and Agentic Loop

System Architecture

Overview

Forge is an LLM agent framework that orchestrates multi-step tool-calling workflows with built-in guardrails, context management, and automatic backend server management. The architecture follows a clean separation of concerns: clients abstract LLM backends, workflows define agent behavior, guardrails enforce execution policies, and the context manager handles token budgeting. Sources: CONTRIBUTING.md

The framework is designed for determinism in unit tests and supports three backend types: Ollama, llama-server, and llamafile. Sources: src/forge/server.py

Project Layout

src/forge/           # Library source
  clients/           # LLM backend adapters (one per backend)
  core/              # Workflow, runner, messages, steps
  context/           # Context management and compaction
  prompts/           # Prompt templates and nudges
tests/
  unit/              # Deterministic tests
  eval/              # Eval harness (requires live backends)
    scenarios/       # Eval scenario definitions
    dashboard/       # React-based HTML dashboard (separate npm build)
docs/                # User-facing documentation
  decisions/         # Architecture Decision Records (ADRs)
  results/           # Eval results and raw data tables

Sources: CONTRIBUTING.md

Core Architecture Components

Architecture Diagram

graph TD
    User["User / Application"]
    Runner["WorkflowRunner"]
    Workflow["Workflow"]
    Guardrails["Guardrails"]
    ContextMgr["ContextManager"]
    Client["LLM Client"]
    ServerMgr["ServerManager"]
    
    User --> Runner
    Runner --> Workflow
    Runner --> Guardrails
    Runner --> ContextMgr
    Runner --> Client
    Client --> ServerMgr
    
    subgraph "forge Library"
        Runner
        Workflow
        Guardrails
        ContextMgr
        Client
    end
    
    subgraph "Backend"
        ServerMgr
    end

LLM Clients

The client layer abstracts different LLM backends behind a common async interface. Each client handles backend-specific protocol differences.

ClientBackendProtocol
OllamaClientOllamaOpenAI-compatible REST
LlamafileClientLlamafileOpenAI-compatible REST
AnthropicClientAnthropic APIAnthropic native
OpenAIClientOpenAI APIOpenAI native

Sources: CONTRIBUTING.md

#### Sampling Defaults

Each client can optionally apply recommended sampling parameters sourced from HuggingFace model cards. The policy layer provides four-quadrant behavior:

strictModel in mapBehavior
TrueYesReturn dict
TrueNoRaise UnsupportedModelError
FalseYesOne-shot INFO log; return {}
FalseNoReturn {} (silent)

Sources: src/forge/clients/sampling_defaults.py

Clients no longer ship hardcoded temperature defaults. With recommended_sampling=False (default), forge sends nothing and the backend's default applies. Sources: CHANGELOG.md

Workflow System

The Workflow class defines an agent's behavior declaratively:

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(...),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools.",
)

Sources: README.md

#### Tool Definition

ToolDef binds a tool schema to its implementation:

@dataclass
class ToolDef:
    """Binds a tool schema to its implementation."""
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)

Prerequisites express conditional dependencies:

  • str: Name-only ("read_file" — any prior call satisfies it)
  • dict: Arg-matched ({"tool": "read_file", "match_arg": "path"})

Sources: src/forge/core/workflow.py

Guardrails System

Guardrails enforce execution policies through three coordinated components:

graph LR
    Response["LLM Response"] --> Validator["ResponseValidator"]
    Validator --> Enforcer["StepEnforcer"]
    Enforcer --> Tracker["ErrorTracker"]
    
    Validator --> "rescue parsing"
    Enforcer --> "required steps"
    Tracker --> "retry limits"

#### Guardrails Configuration

ParameterPurposeDefault
tool_namesList of available toolsRequired
terminal_toolFinal allowed toolRequired
required_stepsOrdered prerequisite chainNone
max_retriesTotal retry attempts3
max_tool_errorsConsecutive tool failures2
rescue_enabledEnable XML rescue parsingTrue
max_premature_attemptsPremature terminal attempts before fatal3

Sources: src/forge/guardrails/guardrails.py

#### Check Result Actions

The check() method returns a CheckResult with these actions:

ActionMeaning
proceedResponse passes all guardrails
retryInvalid response, apply nudge and retry
step_blockedMissing required step
fatalMax retries exceeded

Context Manager

The ContextManager handles token budgeting and context compaction to prevent context overflow during long conversations. Sources: CONTRIBUTING.md

#### Budget Resolution

Budget is resolved based on the mode:

ModeResolution Strategy
MANUALUse manual_tokens parameter or query server
FORGE_FASTServer-reported context / 4
FORGE_BALANCEDServer-reported context / 2
FORGE_DEEPServer-reported context * 3 / 4

For Ollama backends, the context length is obtained from ollama show. For llama-server/llamafile, a /props query retrieves the actual n_ctx. Sources: src/forge/server.py

Server Manager

ServerManager handles lifecycle management of backend servers (llama-server and llamafile only; Ollama is managed externally).

server = ServerManager(backend="llamaserver", port=8080)
context, ctx_mgr = await server.start_with_budget(
    model="qwen3:8b-q4_K_M",
    budget_mode=BudgetMode.FORGE_FAST,
    client=client,
)

#### Server Configuration Parameters

ParameterDescriptionBackend
modelModel identity for serverAll
gguf_pathPath to GGUF filellamaserver/llamafile
modeOperation modeAll
extra_flagsAdditional CLI flagsllamaserver/llamafile
ctx_overrideOverride context length (-c value)llamaserver/llamafile
cache_type_kKV cache quantization type for keysllamaserver
cache_type_vKV cache quantization type for valuesllamaserver
n_slotsConcurrent slots countllamaserver
kv_unifiedSingle unified KV cachellamaserver

The server reuses an existing process if the same configuration is requested, avoiding unnecessary restarts. Sources: src/forge/server.py

Execution Flow

sequenceDiagram
    participant User
    participant Runner
    participant Workflow
    participant Guardrails
    participant Context
    participant Client
    
    User->>Runner: run(workflow, input)
    Runner->>Context: begin_session()
    Runner->>Workflow: Get system prompt
    Runner->>Client: send(messages)
    
    loop Until terminal or max iterations
        Client-->>Runner: LLMResponse
        Runner->>Guardrails: check(response)
        Guardrails-->>Runner: CheckResult
        
        alt proceed
            Runner->>Runner: Execute tools
            Runner->>Context: append(messages)
            Runner->>Client: send(messages)
        else retry
            Runner->>Runner: Apply nudge, retry
        else fatal
            Runner->>User: Return error
        end
    end
    
    Runner-->>User: Final result

Workflow Runner Integration

The WorkflowRunner orchestrates all components:

async def main():
    client = OllamaClient(
        model="ministral-3:8b-instruct-2512-q4_K_M",
        recommended_sampling=True
    )
    ctx = ContextManager(
        strategy=TieredCompact(keep_recent=2),
        budget_tokens=8192
    )
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

Sources: README.md

Design Principles

Async-First

All client methods and the runner are async, enabling efficient I/O handling across multiple concurrent requests. Sources: CONTRIBUTING.md

Type Safety

Pydantic is used for tool parameter schemas and response validation, ensuring runtime type safety for tool arguments. Sources: CONTRIBUTING.md

Modern Python

The codebase targets Python 3.12+ and uses modern syntax including:

  • Type unions with | (e.g., str | None)
  • dataclass decorators
  • field(default_factory=list) patterns

Sources: CONTRIBUTING.md

Guardrails Configuration Examples

# Basic guardrails
guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

# With respond tool for middleware
respond_guardrails = Guardrails(
    tool_names=["search", "lookup", "answer", RESPOND_TOOL_NAME],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)

Sources: examples/foreign_loop.py

Sources: CONTRIBUTING.md

Module Structure and API

Related topics: System Architecture, WorkflowRunner and Agentic Loop

Section Related Pages

Continue reading this section for the full explanation and source context.

Section ToolSpec

Continue reading this section for the full explanation and source context.

Section ToolDef

Continue reading this section for the full explanation and source context.

Section Workflow Class

Continue reading this section for the full explanation and source context.

Related topics: System Architecture, WorkflowRunner and Agentic Loop

Module Structure and API

Overview

The forge repository implements a modular LLM orchestration framework designed to handle multi-step tool-calling workflows with built-in guardrails, context management, and support for multiple LLM backends. The architecture follows a clean separation of concerns with distinct modules for client handling, workflow orchestration, guardrails enforcement, and server management.

Core Architecture

The forge library is organized into a layered architecture that separates backend communication, workflow definition, execution enforcement, and context management.

graph TD
    A[User Code] --> B[WorkflowRunner]
    B --> C[LLM Clients]
    C --> D[llama.cpp / Ollama / Llamafile]
    B --> E[Guardrails]
    B --> F[ContextManager]
    E --> G[ResponseValidator]
    E --> H[StepEnforcer]
    E --> I[ErrorTracker]

Module Hierarchy

ModulePurposeKey Classes
forge.coreCore workflow orchestrationWorkflow, WorkflowRunner, ToolSpec, ToolDef, ToolCall
forge.clientsLLM backend adaptersOllamaClient, LlamafileClient, LlamaServerClient
forge.guardrailsResponse validation and enforcementGuardrails, ResponseValidator, StepEnforcer, ErrorTracker
forge.contextToken budget and context managementContextManager, TieredCompact
forge.serverBackend server lifecycle managementServerManager, BudgetMode

Sources: src/forge/__init__.py

Tool Definition API

ToolSpec

The ToolSpec class defines the interface for tools that the LLM can invoke. It wraps a Pydantic model representing the tool's parameters.

class ToolSpec(BaseModel):
    name: str
    description: str
    parameters: type[BaseModel]

Construction from OpenAI Schema:

Tools can be defined from an OpenAI-style JSON Schema:

tool_spec = ToolSpec.from_openai_schema(
    name="get_weather",
    description="Get current weather for a city",
    schema={
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name"}
        },
        "required": ["city"]
    }
)

Sources: src/forge/core/workflow.py:1-50

ToolDef

The ToolDef dataclass binds a tool schema to its implementation callable, along with prerequisites:

@dataclass
class ToolDef:
    spec: ToolSpec
    callable: Callable[..., Any]
    prerequisites: list[str | dict[str, str]] = field(default_factory=list)

Prerequisites Syntax:

Prerequisites express conditional dependencies between tool calls:

TypeExampleBehavior
String (name-only)"read_file"Any prior call to read_file satisfies it
Dict (arg-matched){"tool": "read_file", "match_arg": "path"}Prior call with same path value required
tool_def = ToolDef(
    spec=tool_spec,
    callable=get_weather_function,
    prerequisites=[{"tool": "search", "match_arg": "query"}]
)

Sources: src/forge/core/workflow.py:52-72

Workflow Definition API

Workflow Class

The Workflow class is the central configuration object for a multi-step LLM task:

workflow = Workflow(
    name="weather",
    description="Look up weather for a city",
    tools={
        "get_weather": ToolDef(spec=..., callable=get_weather)
    },
    required_steps=["search", "lookup"],
    terminal_tool="answer",
    system_prompt_template="You are a helpful assistant."
)

Key Parameters:

ParameterTypeRequiredDescription
namestrYesWorkflow identifier
descriptionstrYesHuman-readable description for the LLM
toolsdict[str, ToolDef]YesMap of tool name to ToolDef
required_stepslist[str]NoTools that must be called before terminal_tool
terminal_toolstrYesTool(s) that can end the workflow
system_prompt_templatestrNoSystem prompt injected into context

Sources: src/forge/core/workflow.py

LLM Client API

Client Architecture

Forge provides backend-agnostic client adapters that implement a common interface:

graph LR
    A[WorkflowRunner] --> B[Client Interface]
    B --> C[OllamaClient]
    B --> D[LlamafileClient]
    B --> E[LlamaServerClient]

OllamaClient

client = OllamaClient(
    model="ministral-3:8b-instruct-2512-q4_K_M",
    recommended_sampling=True
)

Sampling Defaults

Per-model recommended sampling parameters are managed through sampling_defaults.py:

def apply_sampling_defaults(
    model: str,
    *,
    strict: bool,
) -> dict[str, float | int]:
    """Apply the recommended-sampling policy for model."""

Sampling Policy Quadrant:

strictModel in MapBehavior
TrueYesReturn dict copy
TrueNoRaise UnsupportedModelError
FalseYesOne-shot INFO log; return {}
FalseNoReturn {} (silent)

Sources: src/forge/clients/sampling_defaults.py:1-80

ToolCall Response Model

The ToolCall class represents a validated tool invocation returned by an LLM client:

class ToolCall(BaseModel):
    tool: str

Additional fields may be populated by client implementations (e.g., args, reasoning).

Sources: src/forge/core/workflow.py:74-77

Guardrails API

The guardrails system provides middleware for orchestrating LLM responses with built-in validation, step enforcement, and error handling.

Guardrails Class

The main entry point for the guardrails system:

guardrails = Guardrails(
    tool_names=["search", "lookup", "answer"],
    required_steps=["search", "lookup"],
    terminal_tool="answer",
    max_retries=3,
    max_tool_errors=2,
    rescue_enabled=True,
    max_premature_attempts=3
)

Constructor Parameters:

ParameterTypeDefaultDescription
tool_nameslist[str]RequiredValid tool names for this workflow
required_stepslist[str]NoneTools that must be called before terminal_tool
terminal_tool`str \frozenset`RequiredTool(s) that can end the workflow
max_retriesint3Consecutive bad responses before fatal
max_tool_errorsint2Consecutive tool failures before exhaustion
rescue_enabledboolTrueAttempt to parse tool calls from plain text
max_premature_attemptsint3Premature terminal attempts before fatal
retry_nudgeCallable[[str], str]NoneCustom nudge for bare text responses

Sources: src/forge/guardrails/guardrails.py:1-80

CheckResult

The return type of Guardrails.check():

class CheckResult:
    action: Literal["execute", "retry", "step_blocked", "fatal"]
    tool_calls: list[ToolCall] | None
    nudge: Nudge | None
    reason: str | None

Action Meanings:

ActionDescription
executeSafe to proceed; tool_calls contains valid calls
retryInvalid response; inject nudge and retry
step_blockedAttempted terminal tool before required steps
fatalMax retries exhausted; reason contains explanation

Two-Method Guardrails API

# After each LLM response
result = guardrails.check(response)

if result.action == "fatal":
    return f"FATAL: {result.reason}"

if result.action in ("retry", "step_blocked"):
    return f"{result.action}: {result.nudge.content}"

# result.action == "execute"
# Run tools yourself, then record results
tool_calls = result.tool_calls
executed = [tc.tool for tc in tool_calls]
done = guardrails.record(executed)

Sources: src/forge/guardrails/guardrails.py:82-130

Granular API

For advanced use cases, individual guardrail components can be used directly:

from forge.guardrails import ResponseValidator, StepEnforcer, ErrorTracker

validator = ResponseValidator(
    tool_names=["search", "lookup", "answer"],
    rescue_enabled=True,
)
enforcer = StepEnforcer(
    required_steps=["search", "lookup"],
    terminal_tool="answer",
)
errors = ErrorTracker(max_retries=3, max_tool_errors=2)

Server Management API

ServerManager

The ServerManager class handles lifecycle management for llama.cpp-based backends:

server = ServerManager(backend="llamaserver", port=8080)

Backend Options:

BackendDescription
"ollama"Ollama server (model name, no GGUF path)
"llamaserver"llama.cpp server via llama-server
"llamafile"Mozilla llamafile binary

Budget Resolution

The resolve_budget() method determines context length based on mode:

async def resolve_budget(
    self,
    mode: BudgetMode,
    manual_tokens: int | None = None,
) -> int:
ModeBehavior
MANUALUse manual_tokens directly
FORGE_FAST / FORGE_DEEPQuery server /props for context

Sources: src/forge/server.py:1-100

Context Management

ContextManager

Token budget management for long-running conversations:

ctx = ContextManager(
    strategy=TieredCompact(keep_recent=2),
    budget_tokens=8192
)

BudgetMode Enum

class BudgetMode(Enum):
    MANUAL = "manual"
    FORGE_FAST = "forge_fast"
    FORGE_DEEP = "forge_deep"

Sources: src/forge/server.py

Error Types

Forge defines custom exceptions for specific error conditions:

class UnsupportedModelError(Exception):
    """Raised when strict sampling defaults are requested for unknown models."""
    pass

Additional error types in errors.py:

ErrorUse Case
BudgetResolutionErrorServer unreachable or missing n_ctx
BackendErrorBackend communication failures

Sources: src/forge/errors.py

Quick Start Example

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

Sources: README.md

Summary

The forge module structure provides:

  1. Tool DefinitionToolSpec and ToolDef for declaring LLM-callable functions with prerequisites
  2. Workflow OrchestrationWorkflow and WorkflowRunner for managing multi-step tasks
  3. Client Abstraction — Backend-agnostic clients with sampling defaults
  4. Guardrails Middleware — Built-in validation, step enforcement, and error handling
  5. Server Management — Lifecycle control for llama.cpp backends
  6. Context Management — Token budget and compaction strategies

Sources: src/forge/__init__.py

Architecture Decision Records

Related topics: System Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Workflow Engine

Continue reading this section for the full explanation and source context.

Section Guardrail Middleware

Continue reading this section for the full explanation and source context.

Section Server Management and Budget Resolution

Continue reading this section for the full explanation and source context.

Related topics: System Architecture

Architecture Decision Records

Overview

Architecture Decision Records (ADRs) serve as the authoritative documentation for significant design choices within the forge project. They capture the *why* behind implementation decisions, enabling current and future contributors to understand the reasoning without reconstructing the original context.

The forge project stores ADRs in docs/decisions/, using a numbered naming convention (e.g., 001-ablation-framework.md, 011-guardrail-middleware.md, 013-text-response-intent.md). This numbering scheme allows for easy chronological tracking and establishes precedent relationships between decisions.

Purpose and Scope

ADRs in forge address several critical aspects:

CategoryDescriptionExample Documents
Framework DesignAblation study methodology and tooling001-ablation-framework.md
Middleware PatternsGuardrail implementation and composition011-guardrail-middleware.md
Response HandlingIntent classification for text responses013-text-response-intent.md
Backend IntegrationLLM client configuration and sampling defaultssampling_defaults.py
Server ManagementContext resolution and budget modesserver.py

Each ADR documents not only the chosen approach but also considered alternatives and the tradeoffs that influenced the final decision. This creates a historical record that prevents repeated debates over settled questions while enabling informed reconsideration when circumstances change.

ADR Contribution Workflow

According to the contribution guidelines, the process for introducing a new architecture decision follows a structured review pattern:

graph TD
    A[Identify Design Decision] --> B[Review Existing ADRs]
    B --> C{Decision Already Documented?}
    C -->|Yes| D[Reference Existing ADR]
    C -->|No| E[Draft New ADR]
    E --> F[Propose ADR Format]
    F --> G[Review Against Project Standards]
    G --> H[Merge and Publish]
    
    style A fill:#e1f5fe
    style H fill:#c8e6c9

The contribution workflow integrates with the broader project development cycle:

  1. Proposal Phase: Before implementing significant changes, contributors should draft an ADR following the established format
  2. Review Phase: The ADR undergoes peer review alongside code review
  3. Adoption Phase: Once approved, the ADR becomes the reference for implementation decisions
  4. Maintenance Phase: ADRs may be updated if subsequent decisions supersede them

Sources: CONTRIBUTING.md:1-50

Core Architecture Components

Workflow Engine

The Workflow and WorkflowRunner classes form the central orchestration layer. A workflow defines the available tools, required execution steps, and terminal conditions.

graph TD
    subgraph Workflow Definition
        W[Workflow] --> TD[Tool Definitions]
        W --> RS[Required Steps]
        W --> TT[Terminal Tool]
        W --> SP[System Prompt Template]
    end
    
    subgraph Execution Layer
        WR[WorkflowRunner] --> CM[Context Manager]
        WR --> GR[Guardrails]
        WR --> CL[LLM Client]
    end
    
    subgraph Tool Layer
        TC[ToolCall] --> T[Tool Execution]
        T --> TR[Tool Response]
    end
    
    WR --> TC
    TC -->|Result| CM
    CM -->|Context| WR
    
    style W fill:#fff3e0
    style WR fill:#e3f2fd

The ToolDef dataclass binds tool schemas to implementations, while ToolSpec defines the JSON Schema for parameter validation. Tool calls are represented as ToolCall objects containing the tool name and arguments.

Sources: src/forge/core/workflow.py:1-100

Guardrail Middleware

The guardrail system provides a composable validation layer that intercepts LLM responses before tool execution:

graph LR
    LLM[LLM Response] --> GR[Guardrails.check]
    GR --> RV[ResponseValidator]
    GR --> SE[StepEnforcer]
    GR --> ET[ErrorTracker]
    
    RV -->|Valid| TC[ToolCalls]
    RV -->|Invalid| NR[Retry Nudge]
    SE -->|Correct Order| TC
    SE -->|Wrong Order| SB[Step Blocked]
    ET -->|OK| TC
    ET -->|Max Errors| FT[Fatal]
    
    style GR fill:#fce4ec
    style TC fill:#c8e6c9

The Guardrails class orchestrates three sub-components:

ComponentResponsibilityKey Parameters
ResponseValidatorParses tool calls, enables rescue parsingrescue_enabled, retry_nudge_fn
StepEnforcerEnsures required steps precede terminal toolrequired_steps, max_premature_attempts
ErrorTrackerTracks consecutive errors and retriesmax_retries, max_tool_errors

Sources: src/forge/guardrails/guardrails.py:1-100

Server Management and Budget Resolution

The ServerManager handles lifecycle management for llama.cpp-based backends, while the ContextManager implements token budget strategies:

graph TD
    SM[ServerManager] --> BM[BudgetMode]
    BM -->|FORGE_FAST| FT[Fast Budget]
    BM -->|FORGE_BALANCED| BT[Balanced Budget]
    BM -->|FORGE_DEEP| DT[Deep Budget]
    BM -->|MANUAL| MT[Manual Tokens]
    
    CM[ContextManager] --> TC[TieredCompact]
    CM --> SC[SimpleCompact]
    
    SM -->|Context Query| Props[/props endpoint]
    Props -->|n_ctx| CM

Budget resolution follows platform-specific paths:

  • Ollama: Uses manual_tokens parameter for MANUAL mode
  • Llamafile/Llama Server: Queries /props endpoint for server-configured context length

Sources: src/forge/server.py:1-100

Sampling Configuration System

The sampling defaults system separates lookup from policy, enabling fine-grained control over model parameters:

graph TD
    subgraph Lookup Layer
        GM[get_model_defaults] --> MAP[MODEL_SAMPLING_DEFAULTS]
    end
    
    subgraph Policy Layer
        AS[apply_sampling_defaults] --> |strict=True| KR[Known + Known]
        AS --> |strict=False| KU[Known + Unknown]
        KR -->|In Map| ReturnDict[Return Dict]
        KU -->|Not In Map| InfoLog[INFO Log Once]
    end
    
    subgraph Client Integration
        OC[OllamaClient] --> AS
        LC[LlamafileClient] --> AS
        AC[AnthropicClient] --> AS
    end

The two-function design (get_sampling_defaults for pure lookup, apply_sampling_defaults for policy) ensures that:

  • Unknown models don't cause errors when strict=False
  • Known models log a one-time INFO message when not opted in
  • Explicit opt-in via recommended_sampling=True enables strict behavior

Sources: src/forge/clients/sampling_defaults.py:1-100

Proxy Server Architecture

The ProxyServer provides a forwarding layer with additional control features:

graph TD
    subgraph Proxy Layer
        PS[ProxyServer] --> SF[Serialize Flag]
        PS --> RT[Retry Logic]
        PS --> RC[Rescue Parser]
    end
    
    subgraph Backend Routing
        PS --> Ollama[Ollama Backend]
        PS --> Llama[Llama Backend]
        PS --> LLF[Llamafile Backend]
    end
    
    subgraph Configuration
        SF --> |serialize=True| Serial[Serialize Requests]
        SF --> |serialize=False| Parallel[Parallel Requests]
        RT --> |max_retries=N| RetryN[N Attempts]
    end

Key proxy options include:

FlagDefaultPurpose
--host127.0.0.1Proxy listen address
--port8081Proxy listen port
--serializeNoneRequest serialization control
--max-retries3Retries per request
--no-rescueFalseDisable rescue parsing

Sources: src/forge/proxy/__main__.py:1-80

ADR Format and Standards

Each ADR in the forge repository follows a consistent structure:

  1. Title: Descriptive name with ADR number
  2. Status: Proposed, Accepted, Deprecated, or Superseded
  3. Context: Background and problem statement
  4. Decision: The chosen approach with rationale
  5. Consequences: Benefits, drawbacks, and tradeoffs
  6. Related Decisions: Links to dependent or related ADRs

This format ensures that future maintainers can quickly assess whether an ADR is current and understand the full context of each decision.

Versioning and Evolution

The CHANGELOG maintains a parallel record of implementation milestones, cross-referenced with ADRs. Major architectural changes increment the minor version number, while bug fixes increment the patch version (semantic versioning).

Changes that require ADR updates include:

  • New LLM backend support
  • Guardrail algorithm modifications
  • Context management strategy changes
  • Tool execution model alterations
  • Breaking API changes

Sources: CHANGELOG.md:1-100

Best Practices for ADR Readers

When reviewing ADRs to understand forge's architecture:

  1. Start with the index: The docs/decisions/ directory lists all ADRs chronologically
  2. Check status: Deprecated ADRs indicate historical context, not current practice
  3. Cross-reference implementations: Source files in src/forge/ implement ADR decisions
  4. Review CHANGELOG: Implementation dates and version numbers provide temporal context
  5. Examine tests: Unit tests in tests/unit/ validate ADR-enforced behaviors

Sources: CONTRIBUTING.md:1-50

WorkflowRunner and Agentic Loop

Related topics: System Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: System Architecture

WorkflowRunner and Agentic Loop

Overview

The WorkflowRunner is the central execution engine in the Forge framework, implementing an agentic loop that orchestrates multi-step tool-calling workflows with an LLM backend. It manages the complete lifecycle of a workflow: from initializing messages, through iterative LLM inference and tool execution, to context management and termination.

Core responsibilities:

  1. Building initial message lists (system prompt + user input)
  2. Coordinating LLM inference with streaming or batch responses
  3. Validating and executing tool calls returned by the LLM
  4. Managing context budget through the ContextManager
  5. Enforcing required step sequences via the StepEnforcer
  6. Handling retries for malformed responses
  7. Terminating on terminal tool execution or max iterations

Sources: src/forge/core/runner.py:1-50

Sources: src/forge/core/runner.py:1-50

Guardrails Middleware for External Loops

Forge's Guardrails Middleware provides a composable reliability layer that can be integrated into any external orchestration loop. Rather than requiring adoption of the full WorkflowRunner...

Section Guardrails Middleware for External Loops

Forge's Guardrails Middleware provides a composable reliability layer that can be integrated into any external orchestration loop. Rather than requiring adoption of the full WorkflowRunner...

Forge's Guardrails Middleware provides a composable reliability layer that can be integrated into any external orchestration loop. Rather than requiring adoption of the full WorkflowRunner, external projects can embed forge's retry nudges, rescue parsing, step enforcement, and error tracking directly within their own agent execution frameworks.

Source: https://github.com/antoinezambelli/forge / Human Manual

Proxy Server Setup

Related topics: Backend Clients

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Managed Mode

Continue reading this section for the full explanation and source context.

Section External Mode

Continue reading this section for the full explanation and source context.

Section Basic Invocation

Continue reading this section for the full explanation and source context.

Related topics: Backend Clients

Proxy Server Setup

Overview

The forge proxy server is an OpenAI-compatible HTTP proxy that transparently applies forge's guardrail stack to any backend that speaks the Chat Completions API. It acts as a drop-in replacement for local model servers, enabling existing OpenAI-compatible clients to benefit from forge's reliability layer without code changes.

Sources: README.md

Architecture

The proxy operates in two distinct modes, determined at startup:

graph TD
    subgraph "Managed Mode"
        P["ProxyServer<br/>:8081"] --> SM["ServerManager"]
        SM --> BE["llama-server<br/>llamafile<br/>ollama<br/>:8080"]
    end
    
    subgraph "External Mode"
        P2["ProxyServer<br/>:8081"] --> BU["User-managed<br/>Backend<br/>:8080"]
    end
    
    C["OpenAI Client"] --> P
    C2["OpenAI Client"] --> P2

Managed Mode

In managed mode, forge starts and controls the backend process lifecycle. The ServerManager class handles:

  • Backend binary discovery and execution
  • Model loading and initialization
  • Health verification via /props endpoint polling
  • Graceful shutdown and restart

Sources: src/forge/proxy/proxy.py:1-40

External Mode

In external mode, the proxy connects to a user-managed backend. This is useful when:

  • The backend runs on a different machine or container
  • Custom backend configurations are required
  • The backend is managed by an external orchestration system

Sources: src/forge/proxy/proxy.py:35-45

Supported Backends

BackendDescriptionRequirements
llamaserverLlama.cpp's HTTP serverLocal GGUF model file
llamafileMozilla's single-file model executableSingle-file executable
ollamaOllama local inference serverOllama runtime + model pulled

Sources: src/forge/proxy/__main__.py:25-29

CLI Usage

Basic Invocation

# External mode — you manage the backend
python -m forge.proxy --backend-url http://localhost:8080 --port 8081

# Managed mode — forge starts llama-server and proxy together
python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081

# Managed mode with ollama
python -m forge.proxy --backend ollama --model llama3.2 --port 8081

Sources: README.md

Command-Line Arguments

ArgumentTypeDefaultDescription
--backend-urlstring-External backend URL (mutually exclusive with --backend)
--backendchoice-Backend type: llamaserver, llamafile, ollama
--modelstring-Model name (required for ollama)
--ggufstring-Path to GGUF file (llamaserver/llamafile)
--backend-portint8080Backend port for managed mode
--budget-modechoicebackendContext budget: backend, manual, forge-full, forge-fast
--budget-tokensint-Manual token budget override
--extra-flagslist-Additional backend CLI flags
--hoststring127.0.0.1Proxy listen host
--portint8081Proxy listen port
--serializeflag-Force request serialization
--no-serializeflag-Disable request serialization
--max-retriesint3Max retries per request
--no-rescueflag-Disable rescue parsing
-v, --verboseflag-Enable debug logging

Sources: src/forge/proxy/__main__.py:13-53

Programmatic API

ProxyServer Class

The ProxyServer class provides a programmatic interface for embedding the proxy in Python applications.

from forge.proxy import ProxyServer

# External mode
proxy = ProxyServer(backend_url="http://localhost:8080")
proxy.start()
print(f"Proxy running at {proxy.url}")  # http://127.0.0.1:8081
# ... use proxy ...
proxy.stop()
# Managed mode
proxy = ProxyServer(
    backend="llamaserver",
    gguf="model.gguf",
    budget_mode="forge-fast",
    port=8081
)
proxy.start()
proxy.stop()  # Stops both backend and proxy

Sources: src/forge/proxy/proxy.py:50-75

Constructor Parameters

ParameterTypeDefaultDescription
backend_url`str \None`NoneExternal backend URL
backend`str \None`NoneBackend type: llamaserver, llamafile, ollama
model`str \None`NoneModel name for ollama
gguf`str \Path \None`NonePath to GGUF file
backend_portint8080Backend port
budget_modeBudgetModeBudgetMode.BACKENDContext budget strategy
budget_tokensint-Manual token budget
extra_flags`list[str] \None`NoneAdditional CLI flags
hoststr127.0.0.1Listen host
portint8081Listen port
serialize`bool \None`NoneRequest serialization control
max_retriesint3Max retries per request
rescue_enabledboolTrueEnable rescue parsing

Sources: src/forge/proxy/proxy.py:56-100

Lifecycle Methods

MethodDescription
start()Start the proxy (blocks until ready, max 120s timeout)
stop()Stop the proxy and managed backend (30s shutdown timeout)
urlProperty returning the proxy's base URL

Sources: src/forge/proxy/proxy.py:102-125

Respond Tool Injection

Purpose

Small local models (~8B parameters) cannot reliably choose between text output and tool calls. The proxy automatically injects a synthetic respond tool when tools are present in the request, forcing the model into tool-calling mode.

Behavior

  1. When the request contains tools, forge injects a respond(message="...") tool into the tools list
  2. The model calls respond(message="...") instead of producing bare text
  3. The respond call is stripped from the outbound response
  4. The client receives a normal text response with finish_reason: "stop"

This keeps the model in tool-calling mode where forge's full guardrail stack applies.

Sources: README.md

sequenceDiagram
    participant C as OpenAI Client
    participant P as ProxyServer
    participant B as Backend
    
    C->>P: POST /v1/chat/completions<br/>(with tools)
    P->>P: Inject respond tool
    P->>B: Forward request<br/>(tools + respond)
    B->>P: respond(message="answer")
    P->>P: Strip respond call
    P->>C: Normal text response<br/>(finish_reason: "stop")

Context Budget Modes

The proxy supports different strategies for managing context window usage:

ModeDescription
backendLet the backend manage context (default)
manualUse --budget-tokens for fixed budget
forge-fullFull tiered compaction strategy
forge-fastFast tiered compaction (reduced)

Sources: src/forge/proxy/__main__.py:35-38

Tiered Compaction

The forge-full and forge-fast modes utilize TieredCompact, a three-phase compaction strategy:

  1. Truncate — Remove oldest messages
  2. Drop results — Remove tool result content
  3. Sliding window — Maintain recent context

Sources: src/forge/proxy/proxy.py:22

Request Serialization

By default, the proxy handles concurrent requests independently. The serialization flags control this behavior:

FlagBehavior
(none)Proxy decides based on backend capabilities
--serializeForce sequential request processing
--no-serializeAllow concurrent processing

Sources: src/forge/proxy/__main__.py:31-34

Sampling Parameters Pass-Through

The proxy forwards OpenAI-compatible sampling fields directly to the backend without modification:

  • temperature
  • top_p
  • top_k
  • min_p
  • repeat_penalty
  • presence_penalty
  • seed

Sources: CHANGELOG.md

To use model-card-recommended sampling in proxy mode:

from forge.clients import get_sampling_defaults

# Look up recommended sampling parameters
sampling = get_sampling_defaults("ministral-3-8b-instruct")
# Include in request body
response = client.post("/v1/chat/completions", json={
    "model": "ministral-3-8b-instruct",
    "messages": [...],
    **sampling
})

Signal Handling

The proxy gracefully handles shutdown signals:

  • SIGINT (Ctrl+C) — Immediate shutdown
  • SIGTERM — Graceful shutdown

The main thread uses a timed sleep loop (time.sleep(0.1)) to allow Python to deliver signals between iterations, ensuring proper shutdown on Windows.

Sources: src/forge/proxy/__main__.py:95-105

Testing with Smoke Test Script

The repository includes a smoke test at scripts/smoke_test_proxy.py that:

  1. Starts a mock backend on port 18080
  2. Launches the proxy in external mode on port 18081
  3. Verifies health endpoint
  4. Sends a test chat completion request
  5. Validates the response structure
python scripts/smoke_test_proxy.py

Sources: scripts/smoke_test_proxy.py

Health Endpoint

The proxy exposes a /health endpoint for monitoring:

curl http://127.0.0.1:8081/health

Sources: scripts/smoke_test_proxy.py:70

Configuration Example: Complete Setup

# Start llama-server with custom flags, proxy it
python -m forge.proxy \
    --backend llamaserver \
    --gguf ./models/ministral-3-8b-instruct-q8_0.gguf \
    --model ministral-3-8b-instruct \
    --budget-mode forge-full \
    --backend-port 8080 \
    --port 8081 \
    --host 0.0.0.0 \
    --extra-flags --reasoning-format auto \
    --verbose

Then configure your client:

import httpx

client = httpx.AsyncClient(
    base_url="http://localhost:8081/v1",
    timeout=120.0
)

response = await client.post("/chat/completions", json={
    "model": "ministral-3-8b-instruct",
    "messages": [
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string"}
                    },
                    "required": ["city"]
                }
            }
        }
    ]
})

Sources: README.md

Backend Clients

Related topics: Backend Setup Guide, Model Selection Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Client Hierarchy

Continue reading this section for the full explanation and source context.

Section Common Interface

Continue reading this section for the full explanation and source context.

Section Purpose

Continue reading this section for the full explanation and source context.

Related topics: Backend Setup Guide, Model Selection Guide

Backend Clients

Overview

The Backend Clients subsystem provides a unified abstraction layer over various LLM backends, enabling forge to interact with different inference engines through a consistent interface. This modular design allows users to switch between Ollama, llamafile, and Anthropic backends without modifying workflow code.

Each client handles backend-specific communication protocols, response parsing, streaming, and tool call extraction while exposing a common async API for send operations and context resolution.

Sources: src/forge/clients/base.py:1-50

Architecture

Client Hierarchy

graph TD
    A[BaseClient] --> B[OllamaClient]
    A --> C[LlamafileClient]
    A --> D[AnthropicClient]
    
    E[sampling_defaults.py] --> B
    E --> C
    E --> D
    
    F[WorkflowRunner] --> B
    F --> C
    F --> D

All clients inherit from BaseClient, which defines the core async interface including send(), send_stream(), and get_context_length() methods. Backend-specific implementations override these methods to handle vendor-specific APIs and response formats.

Sources: src/forge/clients/base.py:1-100

Common Interface

All clients implement the following async methods:

MethodPurpose
send(messages, tools, **kwargs)Send a request and receive a complete response
send_stream(messages, tools, **kwargs)Stream responses as an async generator
get_context_length()Query the backend for maximum context window
stop()Stop any ongoing generation

Sources: src/forge/clients/base.py:50-150

OllamaClient

Purpose

The OllamaClient connects to a local Ollama server instance, supporting both standard models and GGUF-formatted models served through Ollama's model management system.

Sources: src/forge/clients/ollama.py:1-100

Configuration Options

OllamaClient(
    model: str,                          # Model name (e.g., "qwen3:8b-q4_K_M")
    base_url: str = "http://localhost:11434",
    recommended_sampling: bool = False, # Use verified per-model sampling params
    **kwargs                            # Passed to httpx client
)

Key Features

  • Recommended Sampling: When recommended_sampling=True, the client retrieves verified sampling parameters from forge.clients.sampling_defaults for known models. If a model is not in the map and strict=True, an UnsupportedModelError is raised.
  • Streaming Support: Full streaming support with token-level async generation through send_stream().
  • Tool Call Extraction: Parses Ollama's JSON tool call format and converts to forge's internal ToolCall format.

Sources: src/forge/clients/sampling_defaults.py:1-80

LlamafileClient

Purpose

The LlamafileClient communicates with llamafile or llama-server instances, providing support for GGUF models served directly without Ollama's model management layer.

Sources: src/forge/clients/llamafile.py:1-100

Context Resolution

Unlike Ollama, llamafile and llama-server require querying the /props endpoint to determine the configured context length:

async def get_context_length(self) -> int | None:
    """Query the Llamafile /props endpoint for configured context length."""
    base = self.base_url.rstrip("/")
    if base.endswith("/v1"):
        base = base[:-3]

    resp = await self._http.get(f"{base}/props")
    data = resp.json()
    n_ctx = data.get("default_generation_settings", {}).get("n_ctx")
    return int(n_ctx) if n_ctx is not None else None

Sources: src/forge/clients/llamafile.py:180-200

Tool Call Modes

The client supports multiple tool call parsing strategies:

ModeDescription
nativeUses backend's native tool call format
functionParses <function=name>...</function> style tags
promptExtracts tool calls from prompted responses

Sources: src/forge/clients/llamafile.py:100-180

AnthropicClient

Purpose

The AnthropicClient integrates with Anthropic's Claude API, enabling forge workflows to leverage Claude Opus, Sonnet, and Haiku models.

Sources: src/forge/clients/anthropic.py:1-100

Key Differences

  • No hardcoded temperature defaults — relies on Anthropic API's own defaults
  • Supports Anthropic-specific headers and request formatting
  • Compatible with tools via Anthropic's tool use API

Sampling Defaults System

Overview

The sampling_defaults module provides verified per-model sampling parameters sourced from HuggingFace model cards. This ensures optimal generation quality for supported models without requiring users to manually tune hyperparameters.

Sources: src/forge/clients/sampling_defaults.py:1-50

Supported Parameters

ParameterDescriptionTypical Range
temperatureSampling temperature0.0 - 1.0
top_pNucleus sampling threshold0.0 - 1.0
top_kTop-k sampling1 - 100
min_pMinimum probability threshold0.0 - 1.0
repeat_penaltyRepetition penalty0.0 - 2.0
presence_penaltyPresence penalty (OpenAI compat)-2.0 - 2.0

Policy Behavior

The apply_sampling_defaults() function implements a four-quadrant policy:

def apply_sampling_defaults(model: str, *, strict: bool) -> dict[str, float | int]:
    """Apply the recommended-sampling policy for model."""
    in_map = model in MODEL_SAMPLING_DEFAULTS
    if strict:
        if not in_map:
            raise UnsupportedModelError(model)
        return dict(MODEL_SAMPLING_DEFAULTS[model])
    
    # strict=False: one-shot INFO log if known, else silent
    if in_map and model not in _INFO_LOGGED:
        log.info("Recommended sampling params exist for %r...", model)
        _INFO_LOGGED.add(model)
    return {}
strictModel in MapBehavior
TrueYesReturn dict copy
TrueNoRaise UnsupportedModelError
FalseYesOne-shot INFO log; return {}
FalseNoReturn {} (silent)

Sources: src/forge/clients/sampling_defaults.py:60-120

Verified Models

The following model families are currently supported with verified sampling parameters:

  • Qwen3 / Qwen3.5 / Qwen3.6
  • Qwen3-Coder
  • Gemma 4
  • Mistral Small 3.2
  • Devstral Small 2
  • Ministral 3 Instruct + Reasoning
  • Mistral Nemo
  • Granite 4.0

Each entry includes an inline HuggingFace card URL comment for verification.

Sources: src/forge/clients/sampling_defaults.py:50-80

Tool Call Processing

Extraction Flow

graph TD
    A[LLM Response] --> B{Response Type}
    B -->|tool_calls| C[extract_tool_call]
    B -->|text| D[TextResponse]
    
    C --> E{Tool Call Format}
    E -->|OpenAI style| F[Parse name + arguments]
    E -->|function tags| G[Parse XML-style tags]
    E -->|dict style| H[Parse dict with name field]
    
    F --> I[ToolCall object]
    G --> I
    H --> I

Supported Formats

BackendFormatExample
OllamaOpenAI-style function calls{"name": "get_weather", "arguments": {"city": "Paris"}}
LlamafileFunction tags or native<function=name><parameter=city>Paris</parameter></function>
AnthropicClaude tool_use blocks{name: "get_weather", input: {city: "Paris"}}

Sources: src/forge/core/workflow.py:1-50

Proxy Mode Integration

Request Passthrough

When running in proxy mode, the client plumbs OpenAI-compatible body fields through to backends without modification:

# Proxy plumbs these fields through per request:
- temperature
- top_p
- top_k
- min_p
- repeat_penalty
- presence_penalty
- seed

For per-model recommended sampling in proxy mode, the calling client must look up forge.clients.get_sampling_defaults(model) and include the values in the request body.

Sources: src/forge/proxy/__main__.py:1-50

Usage Examples

Basic Workflow with OllamaClient

from forge import OllamaClient, WorkflowRunner, ContextManager, TieredCompact

async def main():
    client = OllamaClient(
        model="ministral-3:8b-instruct-2512-q4_K_M",
        recommended_sampling=True  # Use verified sampling params
    )
    ctx = ContextManager(
        strategy=TieredCompact(keep_recent=2),
        budget_tokens=8192
    )
    runner = WorkflowRunner(client=client, context_manager=ctx)
    # ... run workflows

asyncio.run(main())

Per-Call Sampling Override

response = await client.send(
    messages,
    tools,
    sampling={
        "temperature": 0.7,
        "top_p": 0.9
    }
)

The caller's explicit non-None fields merge with client-level defaults without mutating the original configuration.

Error Handling

ErrorCauseResolution
UnsupportedModelErrorModel not in sampling defaults mapAdd model to sampling_defaults.py or pass recommended_sampling=False
httpx.HTTPErrorBackend unreachableVerify backend is running on correct port
BudgetResolutionErrorCannot determine context lengthCheck backend /props endpoint returns n_ctx

Sources: src/forge/server.py:1-50

Backend Server Management

ServerManager Integration

Forge can auto-manage backend servers through ServerManager, which handles starting, stopping, and context resolution:

from forge.server import create_server_and_context

server, ctx = await create_server_and_context(
    backend="ollama",
    model="qwen3:8b-q4_K_M",
    budget_mode=BudgetMode.FORGE_FAST,
    client=client,
)

Supported Backends

BackendModel SpecificationPortFeatures
ollamaModel name string11434Auto model management
llamaserverGGUF file path8080Direct GGUF serving
llamafileGGUF file path8080Single-file server

Sources: src/forge/server.py:50-150

Sources: src/forge/clients/base.py:1-50

Backend Setup Guide

Related topics: Backend Clients, Model Selection Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Ollama Backend

Continue reading this section for the full explanation and source context.

Section Llama-server and Llamafile Backends

Continue reading this section for the full explanation and source context.

Section State Management

Continue reading this section for the full explanation and source context.

Related topics: Backend Clients, Model Selection Guide

Backend Setup Guide

Overview

The Backend Setup Guide covers how to configure, initialize, and manage LLM backend servers within forge. Forge supports multiple backend types—Ollama, llama-server, and llamafile—each with distinct initialization patterns, context management strategies, and operational characteristics. Understanding these backends is essential for running forge's Workflow and WorkflowRunner components effectively.

Forge abstracts backend management through two primary classes: ServerManager (for direct server lifecycle control) and setup_backend() (a high-level async factory that combines server startup with ContextManager creation). Sources: src/forge/server.py:1-50

Supported Backend Types

Forge supports three backend implementations, each targeting different deployment scenarios:

BackendIdentityConfigurationUse Case
ollamaModel name (e.g., qwen3:8b)Uses Ollama's native model managementQuick local development, model switching
llamaserverGGUF file pathRequires explicit GGUF path and context sizeProduction GGUF inference with fine control
llamafileGGUF file pathAuto-discovers llamafile runtimeSingle-file distribution, portable setups

The backend type is determined at ServerManager instantiation and cannot be changed afterward. Sources: src/forge/server.py:140-145

Ollama Backend

The Ollama backend leverages Ollama's built-in model management system. Models are pulled and managed through the Ollama CLI rather than requiring manual GGUF file handling.

server = ServerManager(backend="ollama", port=8080)
await server.start(model="qwen3:8b-q4_K_M", gguf_path="", mode="native")

Key constraints:

  • Does not accept gguf_path (use the model parameter instead)
  • Requires model to be specified
  • VRAM cleanup between model switches is handled via ollama stop Sources: src/forge/server.py:170-180

Llama-server and Llamafile Backends

Both llamaserver and llamafile backends operate on GGUF files directly. The server identity is derived from the GGUF path, enabling cache-equality checks for server reuse. Sources: src/forge/server.py:185-200

# Llama-server
server = ServerManager(backend="llamaserver", port=8080)
await server.start(model="identity", gguf_path="/models/qwen3-8b-q4_K_M.gguf", mode="native")

# Llamafile (auto-discovers runtime)
server = ServerManager(backend="llamafile", port=8080)
await server.start(model="identity", gguf_path="/models/llamafile-binary", mode="native")

ServerManager Architecture

The ServerManager class encapsulates all backend server lifecycle operations, providing a unified interface across different backend types.

State Management

graph TD
    A[ServerManager.__init__] --> B[_proc: Popen | None]
    A --> C[_current_model: str | None]
    A --> D[_current_mode: str | None]
    A --> E[_current_ctx: int | None]
    A --> F[_current_flags: tuple]
    A --> G[_current_cache_type_k/v: str | None]
    A --> H[_current_n_slots: int | None]
    A --> I[_current_kv_unified: bool]
    
    J[ServerManager.start] --> K{Cache Hit?}
    K -->|Yes| L[Return - reuse existing]
    K -->|No| M[await stop]
    M --> N[Build command flags]
    N --> O[Spawn subprocess]

Cache Equality Check

Before starting a new server instance, ServerManager checks if an existing server matches the requested configuration. This prevents unnecessary VRAM allocation and model reloading. Sources: src/forge/server.py:50-65

flags = tuple(extra_flags) if extra_flags else ()
if (
    self._current_model == model
    and self._current_mode == mode
    and self._current_ctx == ctx_override
    and self._current_flags == flags
    and self._current_cache_type_k == cache_type_k
    and self._current_cache_type_v == cache_type_v
    and self._current_n_slots == n_slots
    and self._current_kv_unified == kv_unified
):
    return  # Reuse existing server

Server Initialization Parameters

ParameterTypeDefaultDescription
backendstrRequiredBackend type: "ollama" \"llamaserver" \"llamafile"
portint8080Server listen port (llama-server / llamafile only)
models_dir`str \Path`NoneDirectory containing GGUF files

Startup Parameters

The start() method accepts numerous parameters for fine-grained control over server behavior:

ParameterTypeDescription
modelstrModel identity (Ollama: model name; others: GGUF path as string)
gguf_path`str \Path`Path to GGUF file for llamaserver/llamafile
modestr"native" or "prompt" reasoning mode
extra_flagslist[str]Additional CLI flags passed to the server
ctx_override`int \None`Override context window size (-c <value>)
cache_type_kstrKV cache quantization type for keys (e.g., "q8_0", "q4_0")
cache_type_vstrKV cache quantization type for values
n_slotsintConcurrent slot count for multi-agent architectures
kv_unifiedboolUse unified KV cache across all slots

Sources: src/forge/server.py:95-115

Budget Modes

Budget modes control how forge resolves the context window budget for the ContextManager. The resolve_budget() method maps BudgetMode enum values to actual token counts. Sources: src/forge/server.py:220-250

graph TD
    A[resolve_budget mode] --> B{MANUAL?}
    B -->|Yes, Ollama| C[Return manual_tokens]
    B -->|Yes, others| D[await get_server_context]
    B -->|No| E{Ollama?}
    E -->|Yes| F[await _ollama.full]
    E -->|No| G{Mode == FORGE_FAST?}
    G -->|Yes| H[await get_server_context ÷ 4]
    G -->|No| I[await get_server_context]

Budget Resolution Table

BudgetModeOllama BackendLlama-server/Llamafile Backend
MANUALReturns manual_tokens parameterQueries /props for n_ctx
BACKENDOllama's reported context lengthQueries /props for n_ctx
FORGE_FASTn_ctx / 4n_ctx / 4

High-Level Setup with `setup_backend()`

For most use cases, prefer setup_backend() which combines server startup with ContextManager creation. Sources: src/forge/server.py:280-330

from forge.server import setup_backend, BudgetMode

async def example():
    client, ctx = await setup_backend(
        backend="llamaserver",
        gguf_path="/models/qwen3-8b-q4_K_M.gguf",
        budget_mode=BudgetMode.FORGE_FAST,
        client=None,  # Will create default client
    )
    # ... run workflows ...
    await client.close()

`setup_backend()` Parameters

ParameterTypeDefaultDescription
backendstrRequiredBackend type
model`str \None`NoneOllama model name
gguf_path`str \Path \None`NoneGGUF file path
budget_modeBudgetModeBudgetMode.BACKENDContext budget strategy
manual_tokens`int \None`NoneRequired for MANUAL mode on Ollama
client`Any \None`NoneExisting client or None to create default
modestr"native"Reasoning mode
portint8080Server port
extra_flags`list[str] \None`NoneAdditional backend flags
on_compact`Callable \None`NoneCallback for compaction events
compact_thresholdfloat0.75Compaction trigger threshold
phase_thresholdstuple(0.5, 0.7, 0.9)Tiered compaction thresholds

Server Readiness Detection

Forge uses /props polling rather than /health for readiness confirmation. This eliminates the gap between health-ok and props-available states. Sources: src/forge/server.py:260-278

async def wait_for_ready(self, timeout: float = 60.0) -> None:
    url = f"http://localhost:{self._port}/props"
    while time.monotonic() < deadline:
        try:
            resp = await client.get(url)
            if resp.status_code == 200:
                data = resp.json()
                if "default_generation_settings" in data:
                    return
        except (httpx.ConnectError, httpx.ReadError, httpx.TimeoutException):
            pass
        await asyncio.sleep(2)

The readiness check looks for default_generation_settings in the response—a strong indicator that the model is fully loaded and serving. Sources: src/forge/server.py:260-278

Proxy Server Configuration

Forge includes a proxy server (forge.proxy) that plumbs OpenAI-compatible sampling parameters through to backends. The proxy does not consult the sampling defaults map; it passes through whatever parameters the inbound request carries. Sources: src/forge/proxy/__main__.py:1-60

Proxy CLI Options

FlagTypeDefaultDescription
--backend-urlstrRequiredTarget backend URL
--backendstrRequiredBackend type
--modelstrRequiredModel identifier
--ggufstr""GGUF path (for non-Ollama)
--budget-modestr"backend"Budget resolution mode
--budget-tokensintNoneManual token budget
--hoststr127.0.0.1Proxy listen host
--portint8081Proxy listen port
--serializeflagNoneForce request serialization
--max-retriesint3Max retries per request
--verboseflagFalseEnable debug logging

Proxy Sampling Passthrough

The proxy supports these OpenAI-compatible body fields:

ParameterTypeDescription
temperaturefloatSampling temperature
top_pfloatNucleus sampling threshold
top_kintTop-k sampling
min_pfloatMinimum probability threshold
repeat_penaltyfloatRepetition penalty
presence_penaltyfloatPresence penalty
seedintDeterministic sampling seed

For per-model recommended sampling in proxy mode, callers should look up forge.clients.get_sampling_defaults(model) and include the values in the request body. Sources: src/forge/clients/sampling_defaults.py:1-50

Per-Model Sampling Defaults

Forge ships verified per-model sampling recommendations for supported models. These must be explicitly opted into via recommended_sampling=True. Sources: src/forge/clients/sampling_defaults.py:50-80

Supported Models

The sampling defaults map includes recommendations for:

  • Qwen3 / 3.5 / 3.6 series
  • Qwen3-Coder
  • Gemma 4
  • Mistral Small 3.2
  • Devstral Small 2
  • Ministral 3 Instruct + Reasoning
  • Mistral Nemo
  • Granite 4.0 (h-micro, h-tiny)

Each entry includes an inline HuggingFace model card URL for verification. Sources: CHANGELOG.md:0.6.0

Sampling Policy

strictModel in MapBehavior
TrueYesReturn dict copy
TrueNoRaise UnsupportedModelError
FalseYesOne-shot INFO log; return {}
FalseNoReturn {} (silent)

Context Length Resolution

For non-Ollama backends, forge queries the server's /props endpoint to determine the configured context length. This value feeds into budget resolution. Sources: src/forge/server.py:200-220

async def get_server_context(self) -> int:
    """Query /props for actual n_ctx.
    
    For Ollama: ``ollama stop`` for clean VRAM unloads between model switches.
    """
    props = await self.query_props()
    ctx = props.get("default_generation_settings", {}).get("n_ctx")
    if ctx is None:
        raise BudgetResolutionError()
    return ctx

Best Practices

Server Reuse

Always check if a server with the desired configuration is already running before starting a new one. The ServerManager performs this check internally based on:

  1. Model identity (name or GGUF path)
  2. Mode (native or prompt)
  3. Context override
  4. CLI flags
  5. KV cache quantization settings
  6. Slot configuration

VRAM Management

For Ollama backends, use ollama stop to cleanly unload models and free VRAM before switching to a different model. The llama-server/llamafile backends handle this through server restart. Sources: src/forge/server.py:200

Graceful Shutdown

Always call server.stop() when finished to properly terminate the backend process:

server = ServerManager(backend="llamaserver", port=8080)
try:
    await server.start(...)
    # ... work ...
finally:
    await server.stop()

Multi-Agent Configurations

For multi-agent architectures requiring concurrent slots, configure n_slots and optionally kv_unified=True for shared KV cache across slots:

await server.start(
    model="...",
    gguf_path="...",
    n_slots=4,
    kv_unified=True,  # Each slot can use full context
)

Quick Start Example

import asyncio
from forge import OllamaClient, WorkflowRunner
from forge.server import setup_backend, BudgetMode
from forge.context import ContextManager, TieredCompact

async def main():
    # Setup backend with forge-managed context
    client, ctx = await setup_backend(
        backend="ollama",
        model="ministral-3:8b-instruct-2512-q4_K_M",
        budget_mode=BudgetMode.FORGE_FAST,
        recommended_sampling=True,
    )
    
    try:
        runner = WorkflowRunner(client=client, context_manager=ctx)
        # ... run workflows ...
    finally:
        await client.close()

asyncio.run(main())

For GGUF-based setups:

from forge.server import setup_backend, BudgetMode

client, ctx = await setup_backend(
    backend="llamaserver",
    gguf_path="/path/to/model-q4_K_M.gguf",
    budget_mode=BudgetMode.FORGE_FAST,
    extra_flags=["--reasoning-format", "auto"],
)

Sources: src/forge/server.py:95-115

Model Selection Guide

Related topics: Backend Setup Guide, Backend Clients

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Backend Selection Logic

Continue reading this section for the full explanation and source context.

Section Model Families

Continue reading this section for the full explanation and source context.

Section Model Naming Conventions

Continue reading this section for the full explanation and source context.

Related topics: Backend Setup Guide, Backend Clients

Model Selection Guide

Overview

The Model Selection Guide covers how to choose, configure, and deploy language models within forge. Forge provides a unified workflow engine that abstracts backend differences (Ollama, Llamafile, LlamaServer) while offering per-model recommended sampling parameters sourced directly from HuggingFace model cards.

Sources: src/forge/clients/sampling_defaults.py:1-20

Supported Backends

Forge supports three LLM backend types, each with distinct configuration requirements.

BackendConfiguration MethodModel SpecificationNotes
Ollamamodel parameterModel name from ollama listNo GGUF path needed
Llamafilegguf_path parameterPath to .llamafile binarySelf-contained executables
LlamaServergguf_path parameterPath to GGUF model fileRequires llama.cpp server binary

Sources: src/forge/server.py:1-50

Backend Selection Logic

graph TD
    A[Choose Backend] --> B{Backend Type?}
    B -->|Ollama| C[Use model name]
    B -->|Llamafile| D[Use gguf_path]
    B -->|LlamaServer| D
    C --> E[Connect via localhost:11434]
    D --> F[Start server process]
    F --> G[Connect via port 8080]

For Ollama, the model parameter directly references the model name from ollama list. For Llamafile and LlamaServer, you must provide the gguf_path pointing to the model file, and Forge will manage the server process lifecycle.

Sources: src/forge/server.py:80-120

Supported Models

Model Families

Forge has been tested and evaluated with the following model families across different quantization levels.

Model FamilyVariantsRecommended QuantizationNotes
Qwen38B, 3.5, 3.6Q4_K_M, Q8_0Includes Qwen3-Coder
Gemma4 (all sizes)Q4_K_MUse --reasoning-budget 0 workaround
MistralSmall 3.2, Nemo, 7BQ4_K_M, Q8_0Ministral variants available
DevstralSmall 2Q4_K_MCode-focused model
Granite4.0 (h-micro, h-tiny)Q4_K_M, Q8_0OpenAI-style tool calls
Llama3.1 8BQ4_K_M, Q8_08B Reasoning variants
Ministral3 Instruct, 8B Instruct, ReasoningQ4_K_M, Q8_0Reasoning requires budget fix

Sources: CHANGELOG.md:1-50

Model Naming Conventions

Model names vary by backend. When using Ollama, use the exact model tag as shown in ollama list. For GGUF-based backends, the model name is derived from the filename stem of the GGUF file.

# Ollama example
client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M")

# Llamafile/LlamaServer example - model derived from path
client = LlamafileClient(gguf_path="/models/mistral-7b-q4_K_M.gguf")

Sources: README.md:1-30

The Sampling Defaults Map

Forge ships forge.clients.sampling_defaults containing a verified per-model sampling recommendations map. Each entry includes parameters such as temperature, top_p, top_k, min_p, repeat_penalty, and presence_penalty sourced directly from HuggingFace model cards.

Sources: src/forge/clients/sampling_defaults.py:20-40

To use per-model recommended sampling, pass recommended_sampling=True when initializing the client.

from forge import OllamaClient

# Opt-in to recommended sampling
client = OllamaClient(
    model="qwen3:8b-q4_K_M",
    recommended_sampling=True
)

If the model is not in the map and recommended_sampling=True is set, Forge raises UnsupportedModelError rather than silently falling back to backend defaults.

Sources: src/forge/errors.py:1-25

Sampling Policy Behavior

The following table describes the four-quadrant behavior when applying sampling defaults.

strictModel in MapBehavior
TrueYesReturn recommended dict
TrueNoRaise UnsupportedModelError
FalseYesOne-shot INFO log; return {}
FalseNoReturn {} (silent)

Sources: src/forge/clients/sampling_defaults.py:60-85

Per-Call Sampling Overrides

The send() and send_stream() methods accept a sampling: dict | None kwarg that merges field-by-field with the client's instance-level sampling without mutating it. The caller's explicit non-None fields take precedence.

# Merge with instance defaults
response = await client.send(
    messages,
    sampling={"temperature": 0.7, "top_p": 0.9}
)

Sources: CHANGELOG.md:50-70

Proxy Mode Configuration

When running forge in proxy mode, sampling parameters are plumbed through from the incoming request body. OpenAI-compatible fields supported include temperature, top_p, top_k, min_p, repeat_penalty, presence_penalty, and seed.

Proxy Server Startup

python -m forge.proxy \
    --backend-url http://localhost:11434 \
    --backend ollama \
    --model qwen3:8b-q4_K_M \
    --port 8081

For per-model recommended sampling in proxy mode, the calling client should look up forge.clients.get_sampling_defaults(model) and include the values in the request body.

Sources: src/forge/proxy/__main__.py:1-40

Context and Budget Management

Server Context Resolution

Forge automatically queries the backend's /props endpoint to determine the maximum context length. For Ollama, use ollama stop to cleanly unload VRAM between model switches.

from forge import ContextManager, TieredCompact

ctx = ContextManager(
    strategy=TieredCompact(keep_recent=2),
    budget_tokens=8192
)

Budget Modes

ModeDescriptionToken Source
MANUALUser-specified token budgetmanual_tokens parameter
FORGE_FASTFast iteration modeServer-reported context
FORGE_BALANCEDBalanced speed/qualityServer-reported context
FORGE_THOROUGHMaximum qualityServer-reported context

Sources: src/forge/server.py:150-200

Known Issues with Reasoning Models

Models using extended reasoning (Gemma 4, Qwen 3.5, Ministral Reasoning) may hang with unbounded reasoning budgets on builds after April 10, 2026. The workaround is to set --reasoning-budget 0 when starting the backend.

Sources: CHANGELOG.md:70-90

Client Configuration Reference

OllamaClient

OllamaClient(
    model: str,                          # Model name from `ollama list`
    base_url: str = "http://localhost:11434/v1",
    api_key: str | None = None,
    timeout: float = 120.0,
    recommended_sampling: bool = False,  # Opt-in to per-model defaults
    **kwargs
)

LlamafileClient

LlamafileClient(
    gguf_path: str | Path,               # Path to .llamafile binary
    model: str | None = None,            # Optional model name
    base_url: str = "http://localhost:8080/v1",
    recommended_sampling: bool = False,
    **kwargs
)

Sources: src/forge/clients/llamafile.py:1-50

Complete Workflow Example

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(
        model="ministral-3:8b-instruct-2512-q4_K_M",
        recommended_sampling=True
    )
    ctx = ContextManager(
        strategy=TieredCompact(keep_recent=2),
        budget_tokens=8192
    )
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

Sources: README.md:30-80

Error Handling

UnsupportedModelError

Raised when recommended_sampling=True is specified but the model is not in the sampling defaults map.

from forge.errors import UnsupportedModelError

try:
    client = OllamaClient(
        model="unknown-model:latest",
        recommended_sampling=True
    )
except UnsupportedModelError as e:
    print(f"Model not supported: {e.model}")
    # Solution: Either add entry to MODEL_SAMPLING_DEFAULTS
    # or drop recommended_sampling=True

Tool Call Errors

Tool-related errors include ToolCallError (LLM failed to produce valid tool call), ToolExecutionError (tool callable raised an exception), and ToolResolutionError (valid arguments but data didn't resolve).

Sources: src/forge/errors.py:25-60

Best Practices

Selecting Quantization Levels

Use CaseRecommended Quantization
Development/TestingQ4_K_M (balanced quality/size)
Production (quality priority)Q8_0 (near-float quality)
Resource-constrainedQ4_0 (smaller, lower quality)

Guardrail Integration

Guardrails in forge are defined in src/forge/core/runner.py and nudge templates in src/forge/prompts/nudges.py. Each guardrail can be independently toggled via ablation presets for evaluation.

Sources: CONTRIBUTING.md:1-30

Server Management

When running multiple evaluations, reuse ServerManager instances when the model and configuration match to avoid unnecessary server restarts.

# ServerManager caches configuration to avoid redundant restarts
if (
    self._current_model == model
    and self._current_mode == mode
    and self._current_ctx == ctx_override
):
    # Reuse existing server
    return

Sources: src/forge/server.py:60-75

Sources: src/forge/clients/sampling_defaults.py:1-20

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Client sampling params: thread top_p/top_k/min_p/repeat_penalty through request body

First-time setup may fail or require extra isolation and rollback planning.

medium Investigate: integration paths with Hermes Agent

First-time setup may fail or require extra isolation and rollback planning.

medium Per-model recommended sampling defaults (map keyed by HF model cards)

First-time setup may fail or require extra isolation and rollback planning.

medium Rescue-parse ChatGPT-style XML tool calls

First-time setup may fail or require extra isolation and rollback planning.

Doramagic Pitfall Log

Doramagic extracted 15 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: Client sampling params: thread top_p/top_k/min_p/repeat_penalty through request body

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: Client sampling params: thread top_p/top_k/min_p/repeat_penalty through request body. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/58

2. Installation risk: Investigate: integration paths with Hermes Agent

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: Investigate: integration paths with Hermes Agent. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/51
  • Severity: medium
  • Finding: Installation risk is backed by a source signal: Per-model recommended sampling defaults (map keyed by HF model cards). Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/59

4. Installation risk: Rescue-parse ChatGPT-style XML tool calls

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: Rescue-parse ChatGPT-style XML tool calls. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/55

5. Configuration risk: Proxy external mode hardcodes native FC — no prompt-injection fallback

  • Severity: medium
  • Finding: Configuration risk is backed by a source signal: Proxy external mode hardcodes native FC — no prompt-injection fallback. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/53

6. Capability assumption: README/documentation is current enough for a first validation pass.

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: capability.assumptions | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | README/documentation is current enough for a first validation pass.

7. Maintenance risk: Maintainer activity is unknown

  • Severity: medium
  • Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: evidence.maintainer_signals | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | last_activity_observed missing

8. Security or permission risk: no_demo

  • Severity: medium
  • Finding: no_demo
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: downstream_validation.risk_items | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | no_demo; severity=medium

9. Security or permission risk: no_demo

  • Severity: medium
  • Finding: no_demo
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: risks.scoring_risks | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | no_demo; severity=medium

10. Security or permission risk: Hardware detection: AMD unified-memory rigs fall through to 4K Ollama budget

  • Severity: medium
  • Finding: Security or permission risk is backed by a source signal: Hardware detection: AMD unified-memory rigs fall through to 4K Ollama budget. Treat it as a review item until the current version is checked.
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/61

11. Security or permission risk: Sub-agent support: dynamic slot splitting

  • Severity: medium
  • Finding: Security or permission risk is backed by a source signal: Sub-agent support: dynamic slot splitting. Treat it as a review item until the current version is checked.
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/28

12. Security or permission risk: Sub-agent support: slot pool

  • Severity: medium
  • Finding: Security or permission risk is backed by a source signal: Sub-agent support: slot pool. Treat it as a review item until the current version is checked.
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/29

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 10

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using forge with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence