Doramagic Project Pack · Human Manual
forge
Forge addresses the core challenges of LLM-based agent development:
Introduction to Forge
Related topics: System Architecture, Quick Start Guide
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Architecture, Quick Start Guide
Introduction to Forge
Forge is a framework-agnostic LLM orchestration library that provides structured tool-calling workflows, guardrail enforcement, and context management for building reliable AI agents. It supports multiple LLM backends (Ollama, Llamafile, llama.cpp) and exposes both high-level runners and granular APIs for embedding into foreign orchestration loops.
Overview
Forge addresses the core challenges of LLM-based agent development:
| Challenge | Forge Solution |
|---|---|
| Unreliable tool parsing | Rescue parsing with retry mechanisms |
| Missing required steps | StepEnforcer validates call sequences |
| Context overflow | Tiered context compaction strategies |
| Model-specific sampling | Per-model verified sampling defaults |
| Multi-backend support | Unified client abstraction layer |
Sources: README.md
Core Architecture
Forge follows a layered architecture with clear separation of concerns:
graph TD
subgraph "User Layer"
A[User Workflow] --> B[WorkflowRunner]
end
subgraph "Core Layer"
B --> C[ContextManager]
B --> D[LLMClient]
B --> E[Guardrails]
end
subgraph "Client Layer"
D --> F[OllamaClient]
D --> G[LlamafileClient]
D --> H[AnthropicClient]
end
subgraph "Backend Layer"
F --> I[Ollama Backend]
G --> J[Llamafile Backend]
H --> K[Anthropic API]
endSources: CONTRIBUTING.md
Project Structure
src/forge/ # Library source
clients/ # LLM backend adapters (one per backend)
core/ # Workflow, runner, messages, steps
context/ # Context management and compaction
prompts/ # Prompt templates and nudges
guardrails/ # Response validation and step enforcement
proxy/ # OpenAI-compatible proxy server
Sources: CONTRIBUTING.md
Quick Start
The fundamental building blocks of a Forge workflow:
import asyncio
from pydantic import BaseModel, Field
from forge import (
Workflow, ToolDef, ToolSpec,
WorkflowRunner, OllamaClient,
ContextManager, TieredCompact,
)
def get_weather(city: str) -> str:
return f"72°F and sunny in {city}"
class GetWeatherParams(BaseModel):
city: str = Field(description="City name")
workflow = Workflow(
name="weather",
description="Look up weather for a city.",
tools={
"get_weather": ToolDef(
spec=ToolSpec(
name="get_weather",
description="Get current weather",
parameters=GetWeatherParams,
),
callable=get_weather,
),
},
required_steps=[],
terminal_tool="get_weather",
system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)
async def main():
client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
runner = WorkflowRunner(client=client, context_manager=ctx)
await runner.run(workflow, "What's the weather in Paris?")
asyncio.run(main())
Sources: README.md
Core Components
Workflow
The Workflow class defines the structure and constraints of an agent task:
| Parameter | Type | Description |
|---|---|---|
name | str | Workflow identifier |
description | str | Human-readable description |
tools | dict[str, ToolDef] | Tool definitions keyed by name |
required_steps | list[str] | Tools that must precede terminal tool |
terminal_tool | str | Tool(s) that can end the workflow |
Sources: src/forge/core/workflow.py:1-50
ToolDef
Binds a tool schema to its implementation:
@dataclass
class ToolDef:
spec: ToolSpec
callable: Callable[..., Any]
prerequisites: list[str | dict[str, str]] = field(default_factory=list)
The prerequisites field supports:
- String entries: Name-only requirements ("read_file" — any prior call satisfies)
- Dict entries: Arg-matched requirements (
{"tool": "read_file", "match_arg": "path"})
Sources: src/forge/core/workflow.py
LLM Clients
Forge provides backend-specific clients:
| Client | Backend | Features |
|---|---|---|
OllamaClient | Ollama API | recommended_sampling, async streaming |
LlamafileClient | Llamafile binary | Context length detection, reasoning extraction |
AnthropicClient | Anthropic API | Native Claude support |
All clients support send() and send_stream() methods with sampling parameter overrides:
# Instance-level sampling
client = OllamaClient(model="qwen3:8b", recommended_sampling=True)
# Per-call override (merged without mutation)
await client.send(messages, sampling={"temperature": 0.5})
Sources: src/forge/clients/sampling_defaults.py
Per-Model Sampling Defaults
The sampling_defaults module provides verified per-model sampling parameters sourced from HuggingFace model cards:
def get_sampling_defaults(model: str) -> dict[str, float | int]:
"""Pure lookup - returns copy of map value or {} for unknown models."""
return dict(MODEL_SAMPLING_DEFAULTS.get(model, {}))
def apply_sampling_defaults(model: str, *, strict: bool) -> dict[str, float | int]:
"""Policy layer with four-quadrant behavior."""
| strict | model in map | behavior |
|---|---|---|
| True | yes | return dict |
| True | no | raise UnsupportedModelError |
| False | yes | one-shot INFO log; return {} |
| False | no | return {} (silent) |
Sources: src/forge/clients/sampling_defaults.py
Guardrails System
Forge's guardrails provide multi-layered response validation:
graph LR
A[LLM Response] --> B[ResponseValidator]
B --> C{Valid ToolCall?}
C -->|No| D[Rescue Parsing]
D --> E{Rescued?}
E -->|Yes| F[Return ToolCall]
E -->|No| G[Retry Nudge]
C -->|Yes| H[StepEnforcer]
H --> I{Valid Sequence?}
I -->|Yes| J[ErrorTracker]
I -->|No| K[Step Blocked Nudge]
J --> L{Error Limit?}
L -->|Yes| M[Fatal Error]
L -->|No| N[Execute]Sources: src/forge/guardrails/guardrails.py
Guardrails API
class Guardrails:
def __init__(
self,
tool_names: list[str],
terminal_tool: str | frozenset[str],
required_steps: list[str] | None = None,
max_retries: int = 3,
max_tool_errors: int = 2,
rescue_enabled: bool = True,
max_premature_attempts: int = 3,
retry_nudge: Callable[[str], str] | None = None,
) -> None:
| Parameter | Default | Description |
|---|---|---|
max_retries | 3 | Consecutive bad responses before fatal |
max_tool_errors | 2 | Tool execution failures before exhaustion |
max_premature_attempts | 3 | Premature terminal attempts before fatal |
rescue_enabled | True | Parse tool calls from plain text |
retry_nudge | None | Custom nudge for bare text responses |
Sources: src/forge/guardrails/guardrails.py
CheckResult Actions
@dataclass
class CheckResult:
action: Literal["execute", "retry", "step_blocked", "fatal"]
tool_calls: list[ToolCall] | None = None
nudge: Nudge | None = None # Set when action is "retry" or "step_blocked"
reason: str | None = None # Only when action == "fatal"
Sources: src/forge/guardrails/guardrails.py
Context Management
The ContextManager handles token budget enforcement and message compaction:
| Strategy | Purpose |
|---|---|
TieredCompact | Keeps recent N messages, compacts older ones |
| Custom strategies | Pluggable compaction algorithms |
ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
Sources: src/forge/server.py
Server Manager
For local GGUF model execution, ServerManager handles backend lifecycle:
server, ctx = await forge.serve(
backend="llamaserver",
gguf_path="/path/to/model.gguf",
mode=BudgetMode.FORGE_FAST,
client=client,
)
runner = WorkflowRunner(client=client, context_manager=ctx)
| Parameter | Description |
|---|---|
backend | "ollama", "llamaserver", or "llamafile" |
gguf_path | Path to GGUF file (not for Ollama) |
mode | Budget mode (FORGE_FAST, FORGE_BALANCED, FORGE_DEEP) |
n_slots | Concurrent slots for multi-agent |
kv_unified | Single shared KV cache across slots |
Sources: src/forge/server.py
Error Handling
Forge defines a structured exception hierarchy:
class ForgeError(Exception): # Base exception
pass
class UnsupportedModelError(ForgeError):
"""raised when recommended_sampling=True for unknown model."""
class ToolCallError(ForgeError):
"""LLM failed to produce valid tool call after retries."""
class ToolExecutionError(ForgeError):
"""Tool callable raised during execution."""
Sources: src/forge/errors.py
Foreign Loop Integration
Forge can be embedded into existing orchestration systems using the Guardrails middleware API:
from forge.guardrails import Guardrails
guardrails = Guardrails(
tool_names=["search", "lookup", "answer"],
required_steps=["search", "lookup"],
terminal_tool="answer",
)
def handle_response(response):
result = guardrails.check(response)
if result.action == "fatal":
return f"FATAL: {result.reason}"
if result.action in ("retry", "step_blocked"):
return f"{result.action}: {result.nudge.content[:80]}..."
# Execute tools, then record
executed = [tc.tool for tc in result.tool_calls]
done = guardrails.record(executed)
return f"executed {executed}" + (" -- DONE" if done else "")
Sources: examples/foreign_loop.py
Granular Component Access
For fine-grained control, access components directly:
from forge.guardrails import ErrorTracker, ResponseValidator, StepEnforcer
validator = ResponseValidator(tool_names=["search", "lookup", "answer"], rescue_enabled=True)
enforcer = StepEnforcer(required_steps=["search", "lookup"], terminal_tool="answer")
errors = ErrorTracker(max_retries=3, max_tool_errors=2)
Sources: examples/foreign_loop.py
Code Style and Requirements
| Requirement | Details | |
|---|---|---|
| Python | 3.12+ | |
| Async | asyncio throughout — all client methods and runner are async | |
| Type Safety | Pydantic for tool parameter schemas | |
| Type Syntax | Modern unions with `\ | ` |
Sources: CONTRIBUTING.md
Running Tests
# Full suite (865 tests)
python -m pytest tests/unit/ -v --tb=short
# With coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing
# Skip integration tests (require live backend)
python -m pytest tests/ -m "not integration"
Sources: CONTRIBUTING.md
Sources: README.md
Installation Guide
Related topics: Backend Setup Guide, Quick Start Guide
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Backend Setup Guide, Quick Start Guide
Installation Guide
This guide covers all aspects of setting up the forge framework, from basic installation to advanced configuration for different LLM backends.
Overview
Forge is a Python library for building LLM-powered workflows with automatic tool calling, context management, and multi-backend support. The installation process involves:
- Python environment setup (Python 3.12+)
- Package installation via pip
- LLM backend selection and configuration
- Optional development environment for contributors
Sources: CONTRIBUTING.md:1-15
Prerequisites
System Requirements
| Component | Requirement |
|---|---|
| Python | 3.12+ |
| OS | Linux, macOS, Windows |
| LLM Backend | Ollama, llama-server, or llamafile |
| VRAM | Varies by model (8GB minimum for small models) |
Backend Options
Forge supports three LLM backends:
| Backend | Description | Use Case |
|---|---|---|
| Ollama | Local model management with simple API | Quick setup, model management |
| llama-server | llama.cpp server binary | Production, GGUF files |
| llamafile | Single-file executable models | Distribution, portability |
Each backend requires different installation steps:
- Ollama: Install via
ollama runcommands - llama-server: Download llama.cpp server binary
- llamafile: Download pre-built executables or convert GGUF files
Sources: src/forge/server.py:1-50
Installation Methods
Standard Installation
For end users wanting to use forge as a library:
pip install forge
This installs the core package with all necessary dependencies.
Development Installation
For contributors and those wanting to modify the codebase:
git clone https://github.com/antoinezambelli/forge.git
cd forge
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# or: .venv\Scripts\activate # Windows
pip install -e ".[dev]"
Sources: CONTRIBUTING.md:7-14
The .[dev] extras include:
- Testing dependencies (
pytest) - Development tools
- Documentation build tools
Package Configuration
The project uses pyproject.toml for dependency management:
[project]
name = "forge"
version = "0.5.0"
requires-python = ">=3.12"
[project.optional-dependencies]
dev = [
"pytest>=7.0",
"pytest-asyncio",
"pytest-cov",
]
Environment Setup
Virtual Environment
Using a virtual environment is strongly recommended to avoid dependency conflicts:
python -m venv forge-env
source forge-env/bin/activate
Backend Installation
#### Ollama Setup
- Install Ollama from ollama.ai
- Pull a model:
ollama pull ministral-3:8b-instruct-2512-q4_K_M
- Verify the installation:
ollama list
#### llama-server Setup
- Download the llama.cpp server binary for your platform
- Place the binary in your models directory or system PATH
- Verify with:
./llama-server --help
#### llamafile Setup
- Download a pre-built llamafile (e.g., from TheBloke)
- Make it executable:
chmod +x model-name.Q4_K_M.llamafile
Sources: src/forge/server.py:180-220
Configuration Flow
graph TD
A[Install forge] --> B{Use Case}
B -->|Library| C[pip install forge]
B -->|Development| D[pip install -e .[dev]]
C --> E[Choose Backend]
D --> E
E -->|Ollama| F[Install Ollama + Pull Model]
E -->|llama-server| G[Download llama.cpp binary]
E -->|llamafile| H[Download llamafile]
F --> I[Verify with test script]
G --> I
H --> IQuick Start Verification
After installation, verify your setup with this minimal example:
import asyncio
from pydantic import BaseModel, Field
from forge import (
Workflow, ToolDef, ToolSpec,
WorkflowRunner, OllamaClient,
ContextManager, TieredCompact,
)
def get_weather(city: str) -> str:
return f"72°F and sunny in {city}"
class GetWeatherParams(BaseModel):
city: str = Field(description="City name")
workflow = Workflow(
name="weather",
description="Look up weather for a city.",
tools={
"get_weather": ToolDef(
spec=ToolSpec(
name="get_weather",
description="Get current weather",
parameters=GetWeatherParams,
),
callable=get_weather,
),
},
required_steps=[],
terminal_tool="get_weather",
system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)
async def main():
client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
runner = WorkflowRunner(client=client, context_manager=ctx)
await runner.run(workflow, "What's the weather in Paris?")
asyncio.run(main())
Sources: README.md:1-60
Advanced Backend Configuration
KV Cache Quantization
Forge supports KV cache quantization to reduce VRAM usage:
| Setting | VRAM Savings | Quality Impact |
|---|---|---|
q8_0 | ~50% vs F16 | Minimal |
q4_0 | ~75% vs F16 | Low |
# In setup_backend call
from forge.server import setup_backend, BudgetMode
server, ctx = await setup_backend(
backend="llamaserver",
gguf_path="/path/to/model.gguf",
budget_mode=BudgetMode.FORGE_FAST,
cache_type_k="q4_0", # Key cache quantization
cache_type_v="q4_0", # Value cache quantization
)
Sources: src/forge/server.py:20-45
Multi-Slot Configuration
For multi-agent architectures:
server, ctx = await setup_backend(
backend="llamaserver",
gguf_path="/path/to/model.gguf",
n_slots=4, # Number of concurrent slots
kv_unified=True, # Shared KV cache pool
)
When kv_unified=True, all slots share a single KV cache pool, allowing each slot to use the full context window.
Sources: src/forge/server.py:40-55
Recommended Sampling Parameters
Forge provides recommended sampling defaults for specific models:
client = OllamaClient(
model="qwen3:8b-q4_K_M",
recommended_sampling=True # Enable recommended defaults
)
The recommended_sampling=True parameter enables tuned temperature, top_p, top_k, and other sampling parameters sourced from HuggingFace model cards.
Sources: src/forge/clients/sampling_defaults.py:1-80
Testing Your Installation
Unit Tests
Run the deterministic unit test suite (no backend required):
python -m pytest tests/unit/ -v --tb=short
Sources: CONTRIBUTING.md:18-25
Integration Tests
Integration tests require a running backend:
# Skip integration tests
python -m pytest tests/ -m "not integration"
# Run with coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing
Troubleshooting
Common Issues
| Issue | Solution |
|---|---|
ModuleNotFoundError: forge | Run pip install forge or check virtual environment |
| Backend connection refused | Verify backend is running on correct port |
| Model not found (Ollama) | Run ollama pull <model-name> |
| VRAM out of memory | Enable KV cache quantization or use smaller model |
Backend Health Check
Verify backend connectivity:
import httpx
import asyncio
async def check_backend():
try:
async with httpx.AsyncClient(timeout=5.0) as client:
resp = await client.get("http://localhost:8080/props")
if resp.status_code == 200:
print("Backend is ready")
except httpx.ConnectError:
print("Backend not reachable")
Sources: src/forge/server.py:250-280
Next Steps
After installation, refer to:
- User Guide — Complete workflow creation and execution
- Backend Setup Guide — Detailed backend configuration
- Model Guide — Recommended models by hardware tier
- Architecture Decision Records — Design rationale documentation
Source: https://github.com/antoinezambelli/forge / Human Manual
Quick Start Guide
Related topics: WorkflowRunner and Agentic Loop, System Architecture
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: WorkflowRunner and Agentic Loop, System Architecture
Quick Start Guide
This guide provides a practical introduction to Forge, a reliability layer for self-hosted LLM tool-calling. Forge elevates an 8B local model to top-tier performance on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction).
Purpose and Scope
The Quick Start Guide covers:
- Installation and environment setup for local LLM backends
- Core concepts including Workflows, ToolDefs, and the WorkflowRunner
- Basic usage patterns for single-step and multi-step agentic workflows
- Integration options for foreign orchestration loops
- Backend management with auto-start capabilities
Forge targets developers building agentic applications that require structured tool-calling with local models. It works with llama.cpp-based backends (llama-server, llamafile) and Ollama.
Sources: README.md:1-20
Installation
Prerequisites
| Requirement | Version | Notes | |
|---|---|---|---|
| Python | 3.12+ | Modern syntax required (type unions with `\ | `) |
| pip | Latest | For package installation | |
| LLM Backend | llama.cpp / Ollama | For inference |
Setup Commands
git clone https://github.com/antoinezambelli/forge.git
cd forge
python -m venv .venv
pip install -e ".[dev]"
Sources: CONTRIBUTING.md:1-15
Core Concepts
Architecture Overview
graph TD
A[User Input] --> B[WorkflowRunner]
B --> C[LLM Client]
C --> D[Tool Call Response]
D --> E[Guardrails Check]
E -->|execute| F[Tool Execution]
F --> G[Context Manager]
G -->|compact| B
E -->|retry| C
E -->|fatal| H[Error Handling]Workflow
The Workflow is the central definition for an agentic task. It binds together:
| Component | Type | Purpose |
|---|---|---|
name | str | Workflow identifier |
description | str | Human-readable description |
tools | dict[str, ToolDef] | Tool name → definition mapping |
required_steps | list[str] | Tools that must execute before terminal |
terminal_tool | str | Tool that ends the workflow |
system_prompt_template | str | System prompt for the LLM |
Sources: src/forge/core/workflow.py:1-50
ToolDef and ToolSpec
ToolSpec defines the schema exposed to the LLM:
class ToolSpec(BaseModel):
name: str
description: str
parameters: type[BaseModel] # Pydantic model
ToolDef binds the schema to its Python implementation:
@dataclass
class ToolDef:
spec: ToolSpec
callable: Callable[..., Any]
prerequisites: list[str | dict[str, str]] = field(default_factory=list)
The prerequisites field enables conditional dependencies:
str: "if you call this tool, you must have called tool X first"dict:{"tool": "read_file", "match_arg": "path"}— arg-matched prerequisites
Sources: src/forge/core/workflow.py:60-90
WorkflowRunner
The WorkflowRunner manages the full lifecycle:
- System prompt injection
- Tool execution and result handling
- Context compaction
- Guardrail enforcement
- Multi-turn conversation state
class WorkflowRunner:
def __init__(self, client, context_manager):
...
async def run(self, workflow, user_message):
...
Sources: src/forge/core/runner.py
ContextManager and Budget Modes
Forge provides VRAM-aware context management through budget modes:
| Mode | Behavior |
|---|---|
BudgetMode.MANUAL | User-specified token budget |
BudgetMode.FORGE_FAST | VRAM-optimized fast inference budget |
BudgetMode.FORGE_DEEP | Extended context for complex reasoning |
The ContextManager resolves budgets at runtime based on the backend:
async def resolve_budget(self, mode: BudgetMode, manual_tokens: int | None = None) -> int:
if mode == BudgetMode.MANUAL:
if self._backend == "ollama":
return manual_tokens
return await self.get_server_context()
Sources: src/forge/server.py:80-120
Quick Start Example
Basic Single-Tool Workflow
import asyncio
from pydantic import BaseModel, Field
from forge import (
Workflow, ToolDef, ToolSpec,
WorkflowRunner, OllamaClient,
ContextManager, TieredCompact,
)
def get_weather(city: str) -> str:
return f"72°F and sunny in {city}"
class GetWeatherParams(BaseModel):
city: str = Field(description="City name")
workflow = Workflow(
name="weather",
description="Look up weather for a city.",
tools={
"get_weather": ToolDef(
spec=ToolSpec(
name="get_weather",
description="Get current weather",
parameters=GetWeatherParams,
),
callable=get_weather,
),
},
required_steps=[],
terminal_tool="get_weather",
system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)
async def main():
client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
runner = WorkflowRunner(client=client, context_manager=ctx)
await runner.run(workflow, "What's the weather in Paris?")
asyncio.run(main())
Sources: README.md:20-60
Key Components Explained
| Component | Import | Purpose |
|---|---|---|
OllamaClient | forge | LLM backend adapter for Ollama |
TieredCompact | forge | Context compaction strategy |
ContextManager | forge | Token budget management |
WorkflowRunner | forge | Orchestrates the agent loop |
Multi-Step Workflows
For workflows requiring sequential tool execution:
# Define multi-step workflow with prerequisites
workflow = Workflow(
name="research_assistant",
description="Research and answer questions",
tools={
"search": ToolDef(spec=search_spec, callable=do_search),
"lookup": ToolDef(spec=lookup_spec, callable=do_lookup),
"answer": ToolDef(spec=answer_spec, callable=final_answer),
},
required_steps=["search", "lookup"],
terminal_tool="answer",
)
The required_steps list enforces that search and lookup must execute before answer. Attempting to call the terminal tool prematurely triggers a retry nudge.
Sources: examples/foreign_loop.py:1-50
Guardrails API
For foreign orchestration loops (non-WorkflowRunner usage), Forge provides standalone guardrails:
Simple API
from forge.guardrails import Guardrails
guardrails = Guardrails(
tool_names=["search", "lookup", "answer"],
required_steps=["search", "lookup"],
terminal_tool="answer",
)
def handle_response(response):
result = guardrails.check(response)
if result.action == "fatal":
return f"FATAL: {result.reason}"
if result.action in ("retry", "step_blocked"):
return f"{result.action}: {result.nudge.content[:80]}..."
# Execute tools
executed = [tc.tool for tc in result.tool_calls]
done = guardrails.record(executed)
return f"executed {executed}" + (" -- DONE" if done else "")
Granular API
Direct access to individual components:
from forge.guardrails import ErrorTracker, ResponseValidator, StepEnforcer
validator = ResponseValidator(
tool_names=["search", "lookup", "answer"],
rescue_enabled=True,
)
enforcer = StepEnforcer(
required_steps=["search", "lookup"],
terminal_tool="answer",
)
errors = ErrorTracker(max_retries=3, max_tool_errors=2)
| Component | Purpose |
|---|---|
ResponseValidator | Parses tool calls from LLM responses, rescue mode |
StepEnforcer | Enforces required step sequence |
ErrorTracker | Tracks retry attempts and tool errors |
Sources: examples/foreign_loop.py:80-150
Respond Tool Pattern
For conversational turns where the model should respond directly:
from forge.tools import RESPOND_TOOL_NAME, respond_spec
guardrails = Guardrails(
tool_names=["search", "lookup", "answer", RESPOND_TOOL_NAME],
required_steps=["search", "lookup"],
terminal_tool="answer",
)
def handle_response_with_respond(response):
result = guardrails.check(response)
# Check for respond() call
for tc in result.tool_calls:
if tc.tool == RESPOND_TOOL_NAME:
message = tc.args.get("message", "")
return f"MODEL SAYS: {message}"
# Normal tool execution
...
Sources: examples/foreign_loop.py:160-200
Backend Auto-Management
Forge can auto-start backends for multi-agent architectures:
from forge.server import run_with_server
from forge.clients import LlamafileClient, BudgetMode
async with run_with_server(
backend="llamafile",
gguf_path="/path/to/model.gguf",
budget_mode=BudgetMode.FORGE_FAST,
) as (server, ctx):
client = LlamafileClient(model="my-model")
runner = WorkflowRunner(client=client, context_manager=ctx)
# Run workflows...
Backend Options
| Backend | Model Source | GGUF Support |
|---|---|---|
ollama | Model name (e.g., ministral-3:8b) | No |
llamaserver | GGUF file path | Yes |
llamafile | GGUF file path | Yes |
Sources: src/forge/server.py:1-80
Recommended Sampling Parameters
Forge provides curated sampling defaults for supported models:
from forge.clients import OllamaClient, get_sampling_defaults
# Opt-in to recommended sampling
client = OllamaClient(
model="ministral-3:8b-q4_K_M",
recommended_sampling=True # Raises error for unknown models
)
| Parameter | Source | Verification |
|---|---|---|
temperature | HF model cards | Per-model verification |
top_p | HF model cards | Per-model verification |
top_k | HF model cards | Per-model verification |
min_p | HF model cards | Per-model verification |
repeat_penalty | HF model cards | Per-model verification |
Sources: src/forge/clients/sampling_defaults.py:1-50
Testing
Unit Tests (No Backend Required)
# Full suite (865 tests)
python -m pytest tests/unit/ -v --tb=short
# With coverage
python -m pytest tests/unit/ --cov=forge --cov-report=term-missing
# Single file
python -m pytest tests/unit/test_runner.py -v
Integration Tests (Requires Backend)
# Skip integration tests
python -m pytest tests/ -m "not integration"
Sources: CONTRIBUTING.md:15-30
Next Steps
| Topic | Description |
|---|---|
| User Guide | Multi-step workflows, long-running sessions |
| Model Guide | Model-specific configurations |
| Architecture Decisions | Design rationale and ADRs |
| Eval Suite | Performance evaluation methodology |
Sources: README.md:1-20
System Architecture
Related topics: WorkflowRunner and Agentic Loop
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: WorkflowRunner and Agentic Loop
System Architecture
Overview
Forge is an LLM agent framework that orchestrates multi-step tool-calling workflows with built-in guardrails, context management, and automatic backend server management. The architecture follows a clean separation of concerns: clients abstract LLM backends, workflows define agent behavior, guardrails enforce execution policies, and the context manager handles token budgeting. Sources: CONTRIBUTING.md
The framework is designed for determinism in unit tests and supports three backend types: Ollama, llama-server, and llamafile. Sources: src/forge/server.py
Project Layout
src/forge/ # Library source
clients/ # LLM backend adapters (one per backend)
core/ # Workflow, runner, messages, steps
context/ # Context management and compaction
prompts/ # Prompt templates and nudges
tests/
unit/ # Deterministic tests
eval/ # Eval harness (requires live backends)
scenarios/ # Eval scenario definitions
dashboard/ # React-based HTML dashboard (separate npm build)
docs/ # User-facing documentation
decisions/ # Architecture Decision Records (ADRs)
results/ # Eval results and raw data tables
Sources: CONTRIBUTING.md
Core Architecture Components
Architecture Diagram
graph TD
User["User / Application"]
Runner["WorkflowRunner"]
Workflow["Workflow"]
Guardrails["Guardrails"]
ContextMgr["ContextManager"]
Client["LLM Client"]
ServerMgr["ServerManager"]
User --> Runner
Runner --> Workflow
Runner --> Guardrails
Runner --> ContextMgr
Runner --> Client
Client --> ServerMgr
subgraph "forge Library"
Runner
Workflow
Guardrails
ContextMgr
Client
end
subgraph "Backend"
ServerMgr
endLLM Clients
The client layer abstracts different LLM backends behind a common async interface. Each client handles backend-specific protocol differences.
| Client | Backend | Protocol |
|---|---|---|
OllamaClient | Ollama | OpenAI-compatible REST |
LlamafileClient | Llamafile | OpenAI-compatible REST |
AnthropicClient | Anthropic API | Anthropic native |
OpenAIClient | OpenAI API | OpenAI native |
Sources: CONTRIBUTING.md
#### Sampling Defaults
Each client can optionally apply recommended sampling parameters sourced from HuggingFace model cards. The policy layer provides four-quadrant behavior:
strict | Model in map | Behavior |
|---|---|---|
True | Yes | Return dict |
True | No | Raise UnsupportedModelError |
False | Yes | One-shot INFO log; return {} |
False | No | Return {} (silent) |
Sources: src/forge/clients/sampling_defaults.py
Clients no longer ship hardcoded temperature defaults. With recommended_sampling=False (default), forge sends nothing and the backend's default applies. Sources: CHANGELOG.md
Workflow System
The Workflow class defines an agent's behavior declaratively:
workflow = Workflow(
name="weather",
description="Look up weather for a city.",
tools={
"get_weather": ToolDef(
spec=ToolSpec(...),
callable=get_weather,
),
},
required_steps=[],
terminal_tool="get_weather",
system_prompt_template="You are a helpful assistant. Use the available tools.",
)
Sources: README.md
#### Tool Definition
ToolDef binds a tool schema to its implementation:
@dataclass
class ToolDef:
"""Binds a tool schema to its implementation."""
spec: ToolSpec
callable: Callable[..., Any]
prerequisites: list[str | dict[str, str]] = field(default_factory=list)
Prerequisites express conditional dependencies:
str: Name-only ("read_file"— any prior call satisfies it)dict: Arg-matched ({"tool": "read_file", "match_arg": "path"})
Sources: src/forge/core/workflow.py
Guardrails System
Guardrails enforce execution policies through three coordinated components:
graph LR
Response["LLM Response"] --> Validator["ResponseValidator"]
Validator --> Enforcer["StepEnforcer"]
Enforcer --> Tracker["ErrorTracker"]
Validator --> "rescue parsing"
Enforcer --> "required steps"
Tracker --> "retry limits"#### Guardrails Configuration
| Parameter | Purpose | Default |
|---|---|---|
tool_names | List of available tools | Required |
terminal_tool | Final allowed tool | Required |
required_steps | Ordered prerequisite chain | None |
max_retries | Total retry attempts | 3 |
max_tool_errors | Consecutive tool failures | 2 |
rescue_enabled | Enable XML rescue parsing | True |
max_premature_attempts | Premature terminal attempts before fatal | 3 |
Sources: src/forge/guardrails/guardrails.py
#### Check Result Actions
The check() method returns a CheckResult with these actions:
| Action | Meaning |
|---|---|
proceed | Response passes all guardrails |
retry | Invalid response, apply nudge and retry |
step_blocked | Missing required step |
fatal | Max retries exceeded |
Context Manager
The ContextManager handles token budgeting and context compaction to prevent context overflow during long conversations. Sources: CONTRIBUTING.md
#### Budget Resolution
Budget is resolved based on the mode:
| Mode | Resolution Strategy |
|---|---|
MANUAL | Use manual_tokens parameter or query server |
FORGE_FAST | Server-reported context / 4 |
FORGE_BALANCED | Server-reported context / 2 |
FORGE_DEEP | Server-reported context * 3 / 4 |
For Ollama backends, the context length is obtained from ollama show. For llama-server/llamafile, a /props query retrieves the actual n_ctx. Sources: src/forge/server.py
Server Manager
ServerManager handles lifecycle management of backend servers (llama-server and llamafile only; Ollama is managed externally).
server = ServerManager(backend="llamaserver", port=8080)
context, ctx_mgr = await server.start_with_budget(
model="qwen3:8b-q4_K_M",
budget_mode=BudgetMode.FORGE_FAST,
client=client,
)
#### Server Configuration Parameters
| Parameter | Description | Backend |
|---|---|---|
model | Model identity for server | All |
gguf_path | Path to GGUF file | llamaserver/llamafile |
mode | Operation mode | All |
extra_flags | Additional CLI flags | llamaserver/llamafile |
ctx_override | Override context length (-c value) | llamaserver/llamafile |
cache_type_k | KV cache quantization type for keys | llamaserver |
cache_type_v | KV cache quantization type for values | llamaserver |
n_slots | Concurrent slots count | llamaserver |
kv_unified | Single unified KV cache | llamaserver |
The server reuses an existing process if the same configuration is requested, avoiding unnecessary restarts. Sources: src/forge/server.py
Execution Flow
sequenceDiagram
participant User
participant Runner
participant Workflow
participant Guardrails
participant Context
participant Client
User->>Runner: run(workflow, input)
Runner->>Context: begin_session()
Runner->>Workflow: Get system prompt
Runner->>Client: send(messages)
loop Until terminal or max iterations
Client-->>Runner: LLMResponse
Runner->>Guardrails: check(response)
Guardrails-->>Runner: CheckResult
alt proceed
Runner->>Runner: Execute tools
Runner->>Context: append(messages)
Runner->>Client: send(messages)
else retry
Runner->>Runner: Apply nudge, retry
else fatal
Runner->>User: Return error
end
end
Runner-->>User: Final resultWorkflow Runner Integration
The WorkflowRunner orchestrates all components:
async def main():
client = OllamaClient(
model="ministral-3:8b-instruct-2512-q4_K_M",
recommended_sampling=True
)
ctx = ContextManager(
strategy=TieredCompact(keep_recent=2),
budget_tokens=8192
)
runner = WorkflowRunner(client=client, context_manager=ctx)
await runner.run(workflow, "What's the weather in Paris?")
Sources: README.md
Design Principles
Async-First
All client methods and the runner are async, enabling efficient I/O handling across multiple concurrent requests. Sources: CONTRIBUTING.md
Type Safety
Pydantic is used for tool parameter schemas and response validation, ensuring runtime type safety for tool arguments. Sources: CONTRIBUTING.md
Modern Python
The codebase targets Python 3.12+ and uses modern syntax including:
- Type unions with
|(e.g.,str | None) dataclassdecoratorsfield(default_factory=list)patterns
Sources: CONTRIBUTING.md
Guardrails Configuration Examples
# Basic guardrails
guardrails = Guardrails(
tool_names=["search", "lookup", "answer"],
required_steps=["search", "lookup"],
terminal_tool="answer",
)
# With respond tool for middleware
respond_guardrails = Guardrails(
tool_names=["search", "lookup", "answer", RESPOND_TOOL_NAME],
required_steps=["search", "lookup"],
terminal_tool="answer",
)
Sources: examples/foreign_loop.py
Related Documentation
- User Guide — Multi-step workflows and backend auto-management
- Model Guide — Model recommendations by tier
- Architecture Decisions — Design rationale for significant changes
Sources: CONTRIBUTING.md
Module Structure and API
Related topics: System Architecture, WorkflowRunner and Agentic Loop
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Architecture, WorkflowRunner and Agentic Loop
Module Structure and API
Overview
The forge repository implements a modular LLM orchestration framework designed to handle multi-step tool-calling workflows with built-in guardrails, context management, and support for multiple LLM backends. The architecture follows a clean separation of concerns with distinct modules for client handling, workflow orchestration, guardrails enforcement, and server management.
Core Architecture
The forge library is organized into a layered architecture that separates backend communication, workflow definition, execution enforcement, and context management.
graph TD
A[User Code] --> B[WorkflowRunner]
B --> C[LLM Clients]
C --> D[llama.cpp / Ollama / Llamafile]
B --> E[Guardrails]
B --> F[ContextManager]
E --> G[ResponseValidator]
E --> H[StepEnforcer]
E --> I[ErrorTracker]Module Hierarchy
| Module | Purpose | Key Classes |
|---|---|---|
forge.core | Core workflow orchestration | Workflow, WorkflowRunner, ToolSpec, ToolDef, ToolCall |
forge.clients | LLM backend adapters | OllamaClient, LlamafileClient, LlamaServerClient |
forge.guardrails | Response validation and enforcement | Guardrails, ResponseValidator, StepEnforcer, ErrorTracker |
forge.context | Token budget and context management | ContextManager, TieredCompact |
forge.server | Backend server lifecycle management | ServerManager, BudgetMode |
Sources: src/forge/__init__.py
Tool Definition API
ToolSpec
The ToolSpec class defines the interface for tools that the LLM can invoke. It wraps a Pydantic model representing the tool's parameters.
class ToolSpec(BaseModel):
name: str
description: str
parameters: type[BaseModel]
Construction from OpenAI Schema:
Tools can be defined from an OpenAI-style JSON Schema:
tool_spec = ToolSpec.from_openai_schema(
name="get_weather",
description="Get current weather for a city",
schema={
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"]
}
)
Sources: src/forge/core/workflow.py:1-50
ToolDef
The ToolDef dataclass binds a tool schema to its implementation callable, along with prerequisites:
@dataclass
class ToolDef:
spec: ToolSpec
callable: Callable[..., Any]
prerequisites: list[str | dict[str, str]] = field(default_factory=list)
Prerequisites Syntax:
Prerequisites express conditional dependencies between tool calls:
| Type | Example | Behavior |
|---|---|---|
| String (name-only) | "read_file" | Any prior call to read_file satisfies it |
| Dict (arg-matched) | {"tool": "read_file", "match_arg": "path"} | Prior call with same path value required |
tool_def = ToolDef(
spec=tool_spec,
callable=get_weather_function,
prerequisites=[{"tool": "search", "match_arg": "query"}]
)
Sources: src/forge/core/workflow.py:52-72
Workflow Definition API
Workflow Class
The Workflow class is the central configuration object for a multi-step LLM task:
workflow = Workflow(
name="weather",
description="Look up weather for a city",
tools={
"get_weather": ToolDef(spec=..., callable=get_weather)
},
required_steps=["search", "lookup"],
terminal_tool="answer",
system_prompt_template="You are a helpful assistant."
)
Key Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
name | str | Yes | Workflow identifier |
description | str | Yes | Human-readable description for the LLM |
tools | dict[str, ToolDef] | Yes | Map of tool name to ToolDef |
required_steps | list[str] | No | Tools that must be called before terminal_tool |
terminal_tool | str | Yes | Tool(s) that can end the workflow |
system_prompt_template | str | No | System prompt injected into context |
Sources: src/forge/core/workflow.py
LLM Client API
Client Architecture
Forge provides backend-agnostic client adapters that implement a common interface:
graph LR
A[WorkflowRunner] --> B[Client Interface]
B --> C[OllamaClient]
B --> D[LlamafileClient]
B --> E[LlamaServerClient]OllamaClient
client = OllamaClient(
model="ministral-3:8b-instruct-2512-q4_K_M",
recommended_sampling=True
)
Sampling Defaults
Per-model recommended sampling parameters are managed through sampling_defaults.py:
def apply_sampling_defaults(
model: str,
*,
strict: bool,
) -> dict[str, float | int]:
"""Apply the recommended-sampling policy for model."""
Sampling Policy Quadrant:
strict | Model in Map | Behavior |
|---|---|---|
True | Yes | Return dict copy |
True | No | Raise UnsupportedModelError |
False | Yes | One-shot INFO log; return {} |
False | No | Return {} (silent) |
Sources: src/forge/clients/sampling_defaults.py:1-80
ToolCall Response Model
The ToolCall class represents a validated tool invocation returned by an LLM client:
class ToolCall(BaseModel):
tool: str
Additional fields may be populated by client implementations (e.g., args, reasoning).
Sources: src/forge/core/workflow.py:74-77
Guardrails API
The guardrails system provides middleware for orchestrating LLM responses with built-in validation, step enforcement, and error handling.
Guardrails Class
The main entry point for the guardrails system:
guardrails = Guardrails(
tool_names=["search", "lookup", "answer"],
required_steps=["search", "lookup"],
terminal_tool="answer",
max_retries=3,
max_tool_errors=2,
rescue_enabled=True,
max_premature_attempts=3
)
Constructor Parameters:
| Parameter | Type | Default | Description | |
|---|---|---|---|---|
tool_names | list[str] | Required | Valid tool names for this workflow | |
required_steps | list[str] | None | Tools that must be called before terminal_tool | |
terminal_tool | `str \ | frozenset` | Required | Tool(s) that can end the workflow |
max_retries | int | 3 | Consecutive bad responses before fatal | |
max_tool_errors | int | 2 | Consecutive tool failures before exhaustion | |
rescue_enabled | bool | True | Attempt to parse tool calls from plain text | |
max_premature_attempts | int | 3 | Premature terminal attempts before fatal | |
retry_nudge | Callable[[str], str] | None | Custom nudge for bare text responses |
Sources: src/forge/guardrails/guardrails.py:1-80
CheckResult
The return type of Guardrails.check():
class CheckResult:
action: Literal["execute", "retry", "step_blocked", "fatal"]
tool_calls: list[ToolCall] | None
nudge: Nudge | None
reason: str | None
Action Meanings:
| Action | Description |
|---|---|
execute | Safe to proceed; tool_calls contains valid calls |
retry | Invalid response; inject nudge and retry |
step_blocked | Attempted terminal tool before required steps |
fatal | Max retries exhausted; reason contains explanation |
Two-Method Guardrails API
# After each LLM response
result = guardrails.check(response)
if result.action == "fatal":
return f"FATAL: {result.reason}"
if result.action in ("retry", "step_blocked"):
return f"{result.action}: {result.nudge.content}"
# result.action == "execute"
# Run tools yourself, then record results
tool_calls = result.tool_calls
executed = [tc.tool for tc in tool_calls]
done = guardrails.record(executed)
Sources: src/forge/guardrails/guardrails.py:82-130
Granular API
For advanced use cases, individual guardrail components can be used directly:
from forge.guardrails import ResponseValidator, StepEnforcer, ErrorTracker
validator = ResponseValidator(
tool_names=["search", "lookup", "answer"],
rescue_enabled=True,
)
enforcer = StepEnforcer(
required_steps=["search", "lookup"],
terminal_tool="answer",
)
errors = ErrorTracker(max_retries=3, max_tool_errors=2)
Server Management API
ServerManager
The ServerManager class handles lifecycle management for llama.cpp-based backends:
server = ServerManager(backend="llamaserver", port=8080)
Backend Options:
| Backend | Description |
|---|---|
"ollama" | Ollama server (model name, no GGUF path) |
"llamaserver" | llama.cpp server via llama-server |
"llamafile" | Mozilla llamafile binary |
Budget Resolution
The resolve_budget() method determines context length based on mode:
async def resolve_budget(
self,
mode: BudgetMode,
manual_tokens: int | None = None,
) -> int:
| Mode | Behavior |
|---|---|
MANUAL | Use manual_tokens directly |
FORGE_FAST / FORGE_DEEP | Query server /props for context |
Sources: src/forge/server.py:1-100
Context Management
ContextManager
Token budget management for long-running conversations:
ctx = ContextManager(
strategy=TieredCompact(keep_recent=2),
budget_tokens=8192
)
BudgetMode Enum
class BudgetMode(Enum):
MANUAL = "manual"
FORGE_FAST = "forge_fast"
FORGE_DEEP = "forge_deep"
Sources: src/forge/server.py
Error Types
Forge defines custom exceptions for specific error conditions:
class UnsupportedModelError(Exception):
"""Raised when strict sampling defaults are requested for unknown models."""
pass
Additional error types in errors.py:
| Error | Use Case |
|---|---|
BudgetResolutionError | Server unreachable or missing n_ctx |
BackendError | Backend communication failures |
Sources: src/forge/errors.py
Quick Start Example
import asyncio
from pydantic import BaseModel, Field
from forge import (
Workflow, ToolDef, ToolSpec,
WorkflowRunner, OllamaClient,
ContextManager, TieredCompact,
)
class GetWeatherParams(BaseModel):
city: str = Field(description="City name")
def get_weather(city: str) -> str:
return f"72°F and sunny in {city}"
workflow = Workflow(
name="weather",
description="Look up weather for a city.",
tools={
"get_weather": ToolDef(
spec=ToolSpec(
name="get_weather",
description="Get current weather",
parameters=GetWeatherParams,
),
callable=get_weather,
),
},
required_steps=[],
terminal_tool="get_weather",
system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)
async def main():
client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
runner = WorkflowRunner(client=client, context_manager=ctx)
await runner.run(workflow, "What's the weather in Paris?")
asyncio.run(main())
Sources: README.md
Summary
The forge module structure provides:
- Tool Definition —
ToolSpecandToolDeffor declaring LLM-callable functions with prerequisites - Workflow Orchestration —
WorkflowandWorkflowRunnerfor managing multi-step tasks - Client Abstraction — Backend-agnostic clients with sampling defaults
- Guardrails Middleware — Built-in validation, step enforcement, and error handling
- Server Management — Lifecycle control for llama.cpp backends
- Context Management — Token budget and compaction strategies
Sources: src/forge/__init__.py
Architecture Decision Records
Related topics: System Architecture
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Architecture
Architecture Decision Records
Overview
Architecture Decision Records (ADRs) serve as the authoritative documentation for significant design choices within the forge project. They capture the *why* behind implementation decisions, enabling current and future contributors to understand the reasoning without reconstructing the original context.
The forge project stores ADRs in docs/decisions/, using a numbered naming convention (e.g., 001-ablation-framework.md, 011-guardrail-middleware.md, 013-text-response-intent.md). This numbering scheme allows for easy chronological tracking and establishes precedent relationships between decisions.
Purpose and Scope
ADRs in forge address several critical aspects:
| Category | Description | Example Documents |
|---|---|---|
| Framework Design | Ablation study methodology and tooling | 001-ablation-framework.md |
| Middleware Patterns | Guardrail implementation and composition | 011-guardrail-middleware.md |
| Response Handling | Intent classification for text responses | 013-text-response-intent.md |
| Backend Integration | LLM client configuration and sampling defaults | sampling_defaults.py |
| Server Management | Context resolution and budget modes | server.py |
Each ADR documents not only the chosen approach but also considered alternatives and the tradeoffs that influenced the final decision. This creates a historical record that prevents repeated debates over settled questions while enabling informed reconsideration when circumstances change.
ADR Contribution Workflow
According to the contribution guidelines, the process for introducing a new architecture decision follows a structured review pattern:
graph TD
A[Identify Design Decision] --> B[Review Existing ADRs]
B --> C{Decision Already Documented?}
C -->|Yes| D[Reference Existing ADR]
C -->|No| E[Draft New ADR]
E --> F[Propose ADR Format]
F --> G[Review Against Project Standards]
G --> H[Merge and Publish]
style A fill:#e1f5fe
style H fill:#c8e6c9The contribution workflow integrates with the broader project development cycle:
- Proposal Phase: Before implementing significant changes, contributors should draft an ADR following the established format
- Review Phase: The ADR undergoes peer review alongside code review
- Adoption Phase: Once approved, the ADR becomes the reference for implementation decisions
- Maintenance Phase: ADRs may be updated if subsequent decisions supersede them
Sources: CONTRIBUTING.md:1-50
Core Architecture Components
Workflow Engine
The Workflow and WorkflowRunner classes form the central orchestration layer. A workflow defines the available tools, required execution steps, and terminal conditions.
graph TD
subgraph Workflow Definition
W[Workflow] --> TD[Tool Definitions]
W --> RS[Required Steps]
W --> TT[Terminal Tool]
W --> SP[System Prompt Template]
end
subgraph Execution Layer
WR[WorkflowRunner] --> CM[Context Manager]
WR --> GR[Guardrails]
WR --> CL[LLM Client]
end
subgraph Tool Layer
TC[ToolCall] --> T[Tool Execution]
T --> TR[Tool Response]
end
WR --> TC
TC -->|Result| CM
CM -->|Context| WR
style W fill:#fff3e0
style WR fill:#e3f2fdThe ToolDef dataclass binds tool schemas to implementations, while ToolSpec defines the JSON Schema for parameter validation. Tool calls are represented as ToolCall objects containing the tool name and arguments.
Sources: src/forge/core/workflow.py:1-100
Guardrail Middleware
The guardrail system provides a composable validation layer that intercepts LLM responses before tool execution:
graph LR
LLM[LLM Response] --> GR[Guardrails.check]
GR --> RV[ResponseValidator]
GR --> SE[StepEnforcer]
GR --> ET[ErrorTracker]
RV -->|Valid| TC[ToolCalls]
RV -->|Invalid| NR[Retry Nudge]
SE -->|Correct Order| TC
SE -->|Wrong Order| SB[Step Blocked]
ET -->|OK| TC
ET -->|Max Errors| FT[Fatal]
style GR fill:#fce4ec
style TC fill:#c8e6c9The Guardrails class orchestrates three sub-components:
| Component | Responsibility | Key Parameters |
|---|---|---|
ResponseValidator | Parses tool calls, enables rescue parsing | rescue_enabled, retry_nudge_fn |
StepEnforcer | Ensures required steps precede terminal tool | required_steps, max_premature_attempts |
ErrorTracker | Tracks consecutive errors and retries | max_retries, max_tool_errors |
Sources: src/forge/guardrails/guardrails.py:1-100
Server Management and Budget Resolution
The ServerManager handles lifecycle management for llama.cpp-based backends, while the ContextManager implements token budget strategies:
graph TD
SM[ServerManager] --> BM[BudgetMode]
BM -->|FORGE_FAST| FT[Fast Budget]
BM -->|FORGE_BALANCED| BT[Balanced Budget]
BM -->|FORGE_DEEP| DT[Deep Budget]
BM -->|MANUAL| MT[Manual Tokens]
CM[ContextManager] --> TC[TieredCompact]
CM --> SC[SimpleCompact]
SM -->|Context Query| Props[/props endpoint]
Props -->|n_ctx| CMBudget resolution follows platform-specific paths:
- Ollama: Uses
manual_tokensparameter forMANUALmode - Llamafile/Llama Server: Queries
/propsendpoint for server-configured context length
Sources: src/forge/server.py:1-100
Sampling Configuration System
The sampling defaults system separates lookup from policy, enabling fine-grained control over model parameters:
graph TD
subgraph Lookup Layer
GM[get_model_defaults] --> MAP[MODEL_SAMPLING_DEFAULTS]
end
subgraph Policy Layer
AS[apply_sampling_defaults] --> |strict=True| KR[Known + Known]
AS --> |strict=False| KU[Known + Unknown]
KR -->|In Map| ReturnDict[Return Dict]
KU -->|Not In Map| InfoLog[INFO Log Once]
end
subgraph Client Integration
OC[OllamaClient] --> AS
LC[LlamafileClient] --> AS
AC[AnthropicClient] --> AS
endThe two-function design (get_sampling_defaults for pure lookup, apply_sampling_defaults for policy) ensures that:
- Unknown models don't cause errors when
strict=False - Known models log a one-time INFO message when not opted in
- Explicit opt-in via
recommended_sampling=Trueenables strict behavior
Sources: src/forge/clients/sampling_defaults.py:1-100
Proxy Server Architecture
The ProxyServer provides a forwarding layer with additional control features:
graph TD
subgraph Proxy Layer
PS[ProxyServer] --> SF[Serialize Flag]
PS --> RT[Retry Logic]
PS --> RC[Rescue Parser]
end
subgraph Backend Routing
PS --> Ollama[Ollama Backend]
PS --> Llama[Llama Backend]
PS --> LLF[Llamafile Backend]
end
subgraph Configuration
SF --> |serialize=True| Serial[Serialize Requests]
SF --> |serialize=False| Parallel[Parallel Requests]
RT --> |max_retries=N| RetryN[N Attempts]
endKey proxy options include:
| Flag | Default | Purpose |
|---|---|---|
--host | 127.0.0.1 | Proxy listen address |
--port | 8081 | Proxy listen port |
--serialize | None | Request serialization control |
--max-retries | 3 | Retries per request |
--no-rescue | False | Disable rescue parsing |
Sources: src/forge/proxy/__main__.py:1-80
ADR Format and Standards
Each ADR in the forge repository follows a consistent structure:
- Title: Descriptive name with ADR number
- Status: Proposed, Accepted, Deprecated, or Superseded
- Context: Background and problem statement
- Decision: The chosen approach with rationale
- Consequences: Benefits, drawbacks, and tradeoffs
- Related Decisions: Links to dependent or related ADRs
This format ensures that future maintainers can quickly assess whether an ADR is current and understand the full context of each decision.
Versioning and Evolution
The CHANGELOG maintains a parallel record of implementation milestones, cross-referenced with ADRs. Major architectural changes increment the minor version number, while bug fixes increment the patch version (semantic versioning).
Changes that require ADR updates include:
- New LLM backend support
- Guardrail algorithm modifications
- Context management strategy changes
- Tool execution model alterations
- Breaking API changes
Sources: CHANGELOG.md:1-100
Best Practices for ADR Readers
When reviewing ADRs to understand forge's architecture:
- Start with the index: The
docs/decisions/directory lists all ADRs chronologically - Check status: Deprecated ADRs indicate historical context, not current practice
- Cross-reference implementations: Source files in
src/forge/implement ADR decisions - Review CHANGELOG: Implementation dates and version numbers provide temporal context
- Examine tests: Unit tests in
tests/unit/validate ADR-enforced behaviors
Sources: CONTRIBUTING.md:1-50
WorkflowRunner and Agentic Loop
Related topics: System Architecture
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Architecture
WorkflowRunner and Agentic Loop
Overview
The WorkflowRunner is the central execution engine in the Forge framework, implementing an agentic loop that orchestrates multi-step tool-calling workflows with an LLM backend. It manages the complete lifecycle of a workflow: from initializing messages, through iterative LLM inference and tool execution, to context management and termination.
Core responsibilities:
- Building initial message lists (system prompt + user input)
- Coordinating LLM inference with streaming or batch responses
- Validating and executing tool calls returned by the LLM
- Managing context budget through the
ContextManager - Enforcing required step sequences via the
StepEnforcer - Handling retries for malformed responses
- Terminating on terminal tool execution or max iterations
Sources: src/forge/core/runner.py:1-50
Sources: src/forge/core/runner.py:1-50
Guardrails Middleware for External Loops
Forge's Guardrails Middleware provides a composable reliability layer that can be integrated into any external orchestration loop. Rather than requiring adoption of the full WorkflowRunner...
Forge's Guardrails Middleware provides a composable reliability layer that can be integrated into any external orchestration loop. Rather than requiring adoption of the full WorkflowRunner...
Forge's Guardrails Middleware provides a composable reliability layer that can be integrated into any external orchestration loop. Rather than requiring adoption of the full WorkflowRunner, external projects can embed forge's retry nudges, rescue parsing, step enforcement, and error tracking directly within their own agent execution frameworks.
Source: https://github.com/antoinezambelli/forge / Human Manual
Proxy Server Setup
Related topics: Backend Clients
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Backend Clients
Proxy Server Setup
Overview
The forge proxy server is an OpenAI-compatible HTTP proxy that transparently applies forge's guardrail stack to any backend that speaks the Chat Completions API. It acts as a drop-in replacement for local model servers, enabling existing OpenAI-compatible clients to benefit from forge's reliability layer without code changes.
Sources: README.md
Architecture
The proxy operates in two distinct modes, determined at startup:
graph TD
subgraph "Managed Mode"
P["ProxyServer<br/>:8081"] --> SM["ServerManager"]
SM --> BE["llama-server<br/>llamafile<br/>ollama<br/>:8080"]
end
subgraph "External Mode"
P2["ProxyServer<br/>:8081"] --> BU["User-managed<br/>Backend<br/>:8080"]
end
C["OpenAI Client"] --> P
C2["OpenAI Client"] --> P2Managed Mode
In managed mode, forge starts and controls the backend process lifecycle. The ServerManager class handles:
- Backend binary discovery and execution
- Model loading and initialization
- Health verification via
/propsendpoint polling - Graceful shutdown and restart
Sources: src/forge/proxy/proxy.py:1-40
External Mode
In external mode, the proxy connects to a user-managed backend. This is useful when:
- The backend runs on a different machine or container
- Custom backend configurations are required
- The backend is managed by an external orchestration system
Sources: src/forge/proxy/proxy.py:35-45
Supported Backends
| Backend | Description | Requirements |
|---|---|---|
llamaserver | Llama.cpp's HTTP server | Local GGUF model file |
llamafile | Mozilla's single-file model executable | Single-file executable |
ollama | Ollama local inference server | Ollama runtime + model pulled |
Sources: src/forge/proxy/__main__.py:25-29
CLI Usage
Basic Invocation
# External mode — you manage the backend
python -m forge.proxy --backend-url http://localhost:8080 --port 8081
# Managed mode — forge starts llama-server and proxy together
python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081
# Managed mode with ollama
python -m forge.proxy --backend ollama --model llama3.2 --port 8081
Sources: README.md
Command-Line Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--backend-url | string | - | External backend URL (mutually exclusive with --backend) |
--backend | choice | - | Backend type: llamaserver, llamafile, ollama |
--model | string | - | Model name (required for ollama) |
--gguf | string | - | Path to GGUF file (llamaserver/llamafile) |
--backend-port | int | 8080 | Backend port for managed mode |
--budget-mode | choice | backend | Context budget: backend, manual, forge-full, forge-fast |
--budget-tokens | int | - | Manual token budget override |
--extra-flags | list | - | Additional backend CLI flags |
--host | string | 127.0.0.1 | Proxy listen host |
--port | int | 8081 | Proxy listen port |
--serialize | flag | - | Force request serialization |
--no-serialize | flag | - | Disable request serialization |
--max-retries | int | 3 | Max retries per request |
--no-rescue | flag | - | Disable rescue parsing |
-v, --verbose | flag | - | Enable debug logging |
Sources: src/forge/proxy/__main__.py:13-53
Programmatic API
ProxyServer Class
The ProxyServer class provides a programmatic interface for embedding the proxy in Python applications.
from forge.proxy import ProxyServer
# External mode
proxy = ProxyServer(backend_url="http://localhost:8080")
proxy.start()
print(f"Proxy running at {proxy.url}") # http://127.0.0.1:8081
# ... use proxy ...
proxy.stop()
# Managed mode
proxy = ProxyServer(
backend="llamaserver",
gguf="model.gguf",
budget_mode="forge-fast",
port=8081
)
proxy.start()
proxy.stop() # Stops both backend and proxy
Sources: src/forge/proxy/proxy.py:50-75
Constructor Parameters
| Parameter | Type | Default | Description | ||
|---|---|---|---|---|---|
backend_url | `str \ | None` | None | External backend URL | |
backend | `str \ | None` | None | Backend type: llamaserver, llamafile, ollama | |
model | `str \ | None` | None | Model name for ollama | |
gguf | `str \ | Path \ | None` | None | Path to GGUF file |
backend_port | int | 8080 | Backend port | ||
budget_mode | BudgetMode | BudgetMode.BACKEND | Context budget strategy | ||
budget_tokens | int | - | Manual token budget | ||
extra_flags | `list[str] \ | None` | None | Additional CLI flags | |
host | str | 127.0.0.1 | Listen host | ||
port | int | 8081 | Listen port | ||
serialize | `bool \ | None` | None | Request serialization control | |
max_retries | int | 3 | Max retries per request | ||
rescue_enabled | bool | True | Enable rescue parsing |
Sources: src/forge/proxy/proxy.py:56-100
Lifecycle Methods
| Method | Description |
|---|---|
start() | Start the proxy (blocks until ready, max 120s timeout) |
stop() | Stop the proxy and managed backend (30s shutdown timeout) |
url | Property returning the proxy's base URL |
Sources: src/forge/proxy/proxy.py:102-125
Respond Tool Injection
Purpose
Small local models (~8B parameters) cannot reliably choose between text output and tool calls. The proxy automatically injects a synthetic respond tool when tools are present in the request, forcing the model into tool-calling mode.
Behavior
- When the request contains
tools, forge injects arespond(message="...")tool into the tools list - The model calls
respond(message="...")instead of producing bare text - The
respondcall is stripped from the outbound response - The client receives a normal text response with
finish_reason: "stop"
This keeps the model in tool-calling mode where forge's full guardrail stack applies.
Sources: README.md
sequenceDiagram
participant C as OpenAI Client
participant P as ProxyServer
participant B as Backend
C->>P: POST /v1/chat/completions<br/>(with tools)
P->>P: Inject respond tool
P->>B: Forward request<br/>(tools + respond)
B->>P: respond(message="answer")
P->>P: Strip respond call
P->>C: Normal text response<br/>(finish_reason: "stop")Context Budget Modes
The proxy supports different strategies for managing context window usage:
| Mode | Description |
|---|---|
backend | Let the backend manage context (default) |
manual | Use --budget-tokens for fixed budget |
forge-full | Full tiered compaction strategy |
forge-fast | Fast tiered compaction (reduced) |
Sources: src/forge/proxy/__main__.py:35-38
Tiered Compaction
The forge-full and forge-fast modes utilize TieredCompact, a three-phase compaction strategy:
- Truncate — Remove oldest messages
- Drop results — Remove tool result content
- Sliding window — Maintain recent context
Sources: src/forge/proxy/proxy.py:22
Request Serialization
By default, the proxy handles concurrent requests independently. The serialization flags control this behavior:
| Flag | Behavior |
|---|---|
| (none) | Proxy decides based on backend capabilities |
--serialize | Force sequential request processing |
--no-serialize | Allow concurrent processing |
Sources: src/forge/proxy/__main__.py:31-34
Sampling Parameters Pass-Through
The proxy forwards OpenAI-compatible sampling fields directly to the backend without modification:
temperaturetop_ptop_kmin_prepeat_penaltypresence_penaltyseed
Sources: CHANGELOG.md
To use model-card-recommended sampling in proxy mode:
from forge.clients import get_sampling_defaults
# Look up recommended sampling parameters
sampling = get_sampling_defaults("ministral-3-8b-instruct")
# Include in request body
response = client.post("/v1/chat/completions", json={
"model": "ministral-3-8b-instruct",
"messages": [...],
**sampling
})
Signal Handling
The proxy gracefully handles shutdown signals:
SIGINT(Ctrl+C) — Immediate shutdownSIGTERM— Graceful shutdown
The main thread uses a timed sleep loop (time.sleep(0.1)) to allow Python to deliver signals between iterations, ensuring proper shutdown on Windows.
Sources: src/forge/proxy/__main__.py:95-105
Testing with Smoke Test Script
The repository includes a smoke test at scripts/smoke_test_proxy.py that:
- Starts a mock backend on port 18080
- Launches the proxy in external mode on port 18081
- Verifies health endpoint
- Sends a test chat completion request
- Validates the response structure
python scripts/smoke_test_proxy.py
Sources: scripts/smoke_test_proxy.py
Health Endpoint
The proxy exposes a /health endpoint for monitoring:
curl http://127.0.0.1:8081/health
Sources: scripts/smoke_test_proxy.py:70
Configuration Example: Complete Setup
# Start llama-server with custom flags, proxy it
python -m forge.proxy \
--backend llamaserver \
--gguf ./models/ministral-3-8b-instruct-q8_0.gguf \
--model ministral-3-8b-instruct \
--budget-mode forge-full \
--backend-port 8080 \
--port 8081 \
--host 0.0.0.0 \
--extra-flags --reasoning-format auto \
--verbose
Then configure your client:
import httpx
client = httpx.AsyncClient(
base_url="http://localhost:8081/v1",
timeout=120.0
)
response = await client.post("/chat/completions", json={
"model": "ministral-3-8b-instruct",
"messages": [
{"role": "user", "content": "What's the weather in Paris?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}
}
]
})Sources: README.md
Backend Clients
Related topics: Backend Setup Guide, Model Selection Guide
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Backend Setup Guide, Model Selection Guide
Backend Clients
Overview
The Backend Clients subsystem provides a unified abstraction layer over various LLM backends, enabling forge to interact with different inference engines through a consistent interface. This modular design allows users to switch between Ollama, llamafile, and Anthropic backends without modifying workflow code.
Each client handles backend-specific communication protocols, response parsing, streaming, and tool call extraction while exposing a common async API for send operations and context resolution.
Sources: src/forge/clients/base.py:1-50
Architecture
Client Hierarchy
graph TD
A[BaseClient] --> B[OllamaClient]
A --> C[LlamafileClient]
A --> D[AnthropicClient]
E[sampling_defaults.py] --> B
E --> C
E --> D
F[WorkflowRunner] --> B
F --> C
F --> DAll clients inherit from BaseClient, which defines the core async interface including send(), send_stream(), and get_context_length() methods. Backend-specific implementations override these methods to handle vendor-specific APIs and response formats.
Sources: src/forge/clients/base.py:1-100
Common Interface
All clients implement the following async methods:
| Method | Purpose |
|---|---|
send(messages, tools, **kwargs) | Send a request and receive a complete response |
send_stream(messages, tools, **kwargs) | Stream responses as an async generator |
get_context_length() | Query the backend for maximum context window |
stop() | Stop any ongoing generation |
Sources: src/forge/clients/base.py:50-150
OllamaClient
Purpose
The OllamaClient connects to a local Ollama server instance, supporting both standard models and GGUF-formatted models served through Ollama's model management system.
Sources: src/forge/clients/ollama.py:1-100
Configuration Options
OllamaClient(
model: str, # Model name (e.g., "qwen3:8b-q4_K_M")
base_url: str = "http://localhost:11434",
recommended_sampling: bool = False, # Use verified per-model sampling params
**kwargs # Passed to httpx client
)
Key Features
- Recommended Sampling: When
recommended_sampling=True, the client retrieves verified sampling parameters fromforge.clients.sampling_defaultsfor known models. If a model is not in the map andstrict=True, anUnsupportedModelErroris raised.
- Streaming Support: Full streaming support with token-level async generation through
send_stream().
- Tool Call Extraction: Parses Ollama's JSON tool call format and converts to forge's internal
ToolCallformat.
Sources: src/forge/clients/sampling_defaults.py:1-80
LlamafileClient
Purpose
The LlamafileClient communicates with llamafile or llama-server instances, providing support for GGUF models served directly without Ollama's model management layer.
Sources: src/forge/clients/llamafile.py:1-100
Context Resolution
Unlike Ollama, llamafile and llama-server require querying the /props endpoint to determine the configured context length:
async def get_context_length(self) -> int | None:
"""Query the Llamafile /props endpoint for configured context length."""
base = self.base_url.rstrip("/")
if base.endswith("/v1"):
base = base[:-3]
resp = await self._http.get(f"{base}/props")
data = resp.json()
n_ctx = data.get("default_generation_settings", {}).get("n_ctx")
return int(n_ctx) if n_ctx is not None else None
Sources: src/forge/clients/llamafile.py:180-200
Tool Call Modes
The client supports multiple tool call parsing strategies:
| Mode | Description |
|---|---|
native | Uses backend's native tool call format |
function | Parses <function=name>...</function> style tags |
prompt | Extracts tool calls from prompted responses |
Sources: src/forge/clients/llamafile.py:100-180
AnthropicClient
Purpose
The AnthropicClient integrates with Anthropic's Claude API, enabling forge workflows to leverage Claude Opus, Sonnet, and Haiku models.
Sources: src/forge/clients/anthropic.py:1-100
Key Differences
- No hardcoded temperature defaults — relies on Anthropic API's own defaults
- Supports Anthropic-specific headers and request formatting
- Compatible with tools via Anthropic's tool use API
Sampling Defaults System
Overview
The sampling_defaults module provides verified per-model sampling parameters sourced from HuggingFace model cards. This ensures optimal generation quality for supported models without requiring users to manually tune hyperparameters.
Sources: src/forge/clients/sampling_defaults.py:1-50
Supported Parameters
| Parameter | Description | Typical Range |
|---|---|---|
temperature | Sampling temperature | 0.0 - 1.0 |
top_p | Nucleus sampling threshold | 0.0 - 1.0 |
top_k | Top-k sampling | 1 - 100 |
min_p | Minimum probability threshold | 0.0 - 1.0 |
repeat_penalty | Repetition penalty | 0.0 - 2.0 |
presence_penalty | Presence penalty (OpenAI compat) | -2.0 - 2.0 |
Policy Behavior
The apply_sampling_defaults() function implements a four-quadrant policy:
def apply_sampling_defaults(model: str, *, strict: bool) -> dict[str, float | int]:
"""Apply the recommended-sampling policy for model."""
in_map = model in MODEL_SAMPLING_DEFAULTS
if strict:
if not in_map:
raise UnsupportedModelError(model)
return dict(MODEL_SAMPLING_DEFAULTS[model])
# strict=False: one-shot INFO log if known, else silent
if in_map and model not in _INFO_LOGGED:
log.info("Recommended sampling params exist for %r...", model)
_INFO_LOGGED.add(model)
return {}
strict | Model in Map | Behavior |
|---|---|---|
True | Yes | Return dict copy |
True | No | Raise UnsupportedModelError |
False | Yes | One-shot INFO log; return {} |
False | No | Return {} (silent) |
Sources: src/forge/clients/sampling_defaults.py:60-120
Verified Models
The following model families are currently supported with verified sampling parameters:
- Qwen3 / Qwen3.5 / Qwen3.6
- Qwen3-Coder
- Gemma 4
- Mistral Small 3.2
- Devstral Small 2
- Ministral 3 Instruct + Reasoning
- Mistral Nemo
- Granite 4.0
Each entry includes an inline HuggingFace card URL comment for verification.
Sources: src/forge/clients/sampling_defaults.py:50-80
Tool Call Processing
Extraction Flow
graph TD
A[LLM Response] --> B{Response Type}
B -->|tool_calls| C[extract_tool_call]
B -->|text| D[TextResponse]
C --> E{Tool Call Format}
E -->|OpenAI style| F[Parse name + arguments]
E -->|function tags| G[Parse XML-style tags]
E -->|dict style| H[Parse dict with name field]
F --> I[ToolCall object]
G --> I
H --> ISupported Formats
| Backend | Format | Example |
|---|---|---|
| Ollama | OpenAI-style function calls | {"name": "get_weather", "arguments": {"city": "Paris"}} |
| Llamafile | Function tags or native | <function=name><parameter=city>Paris</parameter></function> |
| Anthropic | Claude tool_use blocks | {name: "get_weather", input: {city: "Paris"}} |
Sources: src/forge/core/workflow.py:1-50
Proxy Mode Integration
Request Passthrough
When running in proxy mode, the client plumbs OpenAI-compatible body fields through to backends without modification:
# Proxy plumbs these fields through per request:
- temperature
- top_p
- top_k
- min_p
- repeat_penalty
- presence_penalty
- seed
For per-model recommended sampling in proxy mode, the calling client must look up forge.clients.get_sampling_defaults(model) and include the values in the request body.
Sources: src/forge/proxy/__main__.py:1-50
Usage Examples
Basic Workflow with OllamaClient
from forge import OllamaClient, WorkflowRunner, ContextManager, TieredCompact
async def main():
client = OllamaClient(
model="ministral-3:8b-instruct-2512-q4_K_M",
recommended_sampling=True # Use verified sampling params
)
ctx = ContextManager(
strategy=TieredCompact(keep_recent=2),
budget_tokens=8192
)
runner = WorkflowRunner(client=client, context_manager=ctx)
# ... run workflows
asyncio.run(main())
Per-Call Sampling Override
response = await client.send(
messages,
tools,
sampling={
"temperature": 0.7,
"top_p": 0.9
}
)
The caller's explicit non-None fields merge with client-level defaults without mutating the original configuration.
Error Handling
| Error | Cause | Resolution |
|---|---|---|
UnsupportedModelError | Model not in sampling defaults map | Add model to sampling_defaults.py or pass recommended_sampling=False |
httpx.HTTPError | Backend unreachable | Verify backend is running on correct port |
BudgetResolutionError | Cannot determine context length | Check backend /props endpoint returns n_ctx |
Sources: src/forge/server.py:1-50
Backend Server Management
ServerManager Integration
Forge can auto-manage backend servers through ServerManager, which handles starting, stopping, and context resolution:
from forge.server import create_server_and_context
server, ctx = await create_server_and_context(
backend="ollama",
model="qwen3:8b-q4_K_M",
budget_mode=BudgetMode.FORGE_FAST,
client=client,
)
Supported Backends
| Backend | Model Specification | Port | Features |
|---|---|---|---|
ollama | Model name string | 11434 | Auto model management |
llamaserver | GGUF file path | 8080 | Direct GGUF serving |
llamafile | GGUF file path | 8080 | Single-file server |
Sources: src/forge/server.py:50-150
Sources: src/forge/clients/base.py:1-50
Backend Setup Guide
Related topics: Backend Clients, Model Selection Guide
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Backend Clients, Model Selection Guide
Backend Setup Guide
Overview
The Backend Setup Guide covers how to configure, initialize, and manage LLM backend servers within forge. Forge supports multiple backend types—Ollama, llama-server, and llamafile—each with distinct initialization patterns, context management strategies, and operational characteristics. Understanding these backends is essential for running forge's Workflow and WorkflowRunner components effectively.
Forge abstracts backend management through two primary classes: ServerManager (for direct server lifecycle control) and setup_backend() (a high-level async factory that combines server startup with ContextManager creation). Sources: src/forge/server.py:1-50
Supported Backend Types
Forge supports three backend implementations, each targeting different deployment scenarios:
| Backend | Identity | Configuration | Use Case |
|---|---|---|---|
ollama | Model name (e.g., qwen3:8b) | Uses Ollama's native model management | Quick local development, model switching |
llamaserver | GGUF file path | Requires explicit GGUF path and context size | Production GGUF inference with fine control |
llamafile | GGUF file path | Auto-discovers llamafile runtime | Single-file distribution, portable setups |
The backend type is determined at ServerManager instantiation and cannot be changed afterward. Sources: src/forge/server.py:140-145
Ollama Backend
The Ollama backend leverages Ollama's built-in model management system. Models are pulled and managed through the Ollama CLI rather than requiring manual GGUF file handling.
server = ServerManager(backend="ollama", port=8080)
await server.start(model="qwen3:8b-q4_K_M", gguf_path="", mode="native")
Key constraints:
- Does not accept
gguf_path(use themodelparameter instead) - Requires
modelto be specified - VRAM cleanup between model switches is handled via
ollama stopSources: src/forge/server.py:170-180
Llama-server and Llamafile Backends
Both llamaserver and llamafile backends operate on GGUF files directly. The server identity is derived from the GGUF path, enabling cache-equality checks for server reuse. Sources: src/forge/server.py:185-200
# Llama-server
server = ServerManager(backend="llamaserver", port=8080)
await server.start(model="identity", gguf_path="/models/qwen3-8b-q4_K_M.gguf", mode="native")
# Llamafile (auto-discovers runtime)
server = ServerManager(backend="llamafile", port=8080)
await server.start(model="identity", gguf_path="/models/llamafile-binary", mode="native")
ServerManager Architecture
The ServerManager class encapsulates all backend server lifecycle operations, providing a unified interface across different backend types.
State Management
graph TD
A[ServerManager.__init__] --> B[_proc: Popen | None]
A --> C[_current_model: str | None]
A --> D[_current_mode: str | None]
A --> E[_current_ctx: int | None]
A --> F[_current_flags: tuple]
A --> G[_current_cache_type_k/v: str | None]
A --> H[_current_n_slots: int | None]
A --> I[_current_kv_unified: bool]
J[ServerManager.start] --> K{Cache Hit?}
K -->|Yes| L[Return - reuse existing]
K -->|No| M[await stop]
M --> N[Build command flags]
N --> O[Spawn subprocess]Cache Equality Check
Before starting a new server instance, ServerManager checks if an existing server matches the requested configuration. This prevents unnecessary VRAM allocation and model reloading. Sources: src/forge/server.py:50-65
flags = tuple(extra_flags) if extra_flags else ()
if (
self._current_model == model
and self._current_mode == mode
and self._current_ctx == ctx_override
and self._current_flags == flags
and self._current_cache_type_k == cache_type_k
and self._current_cache_type_v == cache_type_v
and self._current_n_slots == n_slots
and self._current_kv_unified == kv_unified
):
return # Reuse existing server
Server Initialization Parameters
| Parameter | Type | Default | Description | ||
|---|---|---|---|---|---|
backend | str | Required | Backend type: "ollama" \ | "llamaserver" \ | "llamafile" |
port | int | 8080 | Server listen port (llama-server / llamafile only) | ||
models_dir | `str \ | Path` | None | Directory containing GGUF files |
Startup Parameters
The start() method accepts numerous parameters for fine-grained control over server behavior:
| Parameter | Type | Description | |
|---|---|---|---|
model | str | Model identity (Ollama: model name; others: GGUF path as string) | |
gguf_path | `str \ | Path` | Path to GGUF file for llamaserver/llamafile |
mode | str | "native" or "prompt" reasoning mode | |
extra_flags | list[str] | Additional CLI flags passed to the server | |
ctx_override | `int \ | None` | Override context window size (-c <value>) |
cache_type_k | str | KV cache quantization type for keys (e.g., "q8_0", "q4_0") | |
cache_type_v | str | KV cache quantization type for values | |
n_slots | int | Concurrent slot count for multi-agent architectures | |
kv_unified | bool | Use unified KV cache across all slots |
Sources: src/forge/server.py:95-115
Budget Modes
Budget modes control how forge resolves the context window budget for the ContextManager. The resolve_budget() method maps BudgetMode enum values to actual token counts. Sources: src/forge/server.py:220-250
graph TD
A[resolve_budget mode] --> B{MANUAL?}
B -->|Yes, Ollama| C[Return manual_tokens]
B -->|Yes, others| D[await get_server_context]
B -->|No| E{Ollama?}
E -->|Yes| F[await _ollama.full]
E -->|No| G{Mode == FORGE_FAST?}
G -->|Yes| H[await get_server_context ÷ 4]
G -->|No| I[await get_server_context]Budget Resolution Table
| BudgetMode | Ollama Backend | Llama-server/Llamafile Backend |
|---|---|---|
MANUAL | Returns manual_tokens parameter | Queries /props for n_ctx |
BACKEND | Ollama's reported context length | Queries /props for n_ctx |
FORGE_FAST | n_ctx / 4 | n_ctx / 4 |
High-Level Setup with `setup_backend()`
For most use cases, prefer setup_backend() which combines server startup with ContextManager creation. Sources: src/forge/server.py:280-330
from forge.server import setup_backend, BudgetMode
async def example():
client, ctx = await setup_backend(
backend="llamaserver",
gguf_path="/models/qwen3-8b-q4_K_M.gguf",
budget_mode=BudgetMode.FORGE_FAST,
client=None, # Will create default client
)
# ... run workflows ...
await client.close()
`setup_backend()` Parameters
| Parameter | Type | Default | Description | ||
|---|---|---|---|---|---|
backend | str | Required | Backend type | ||
model | `str \ | None` | None | Ollama model name | |
gguf_path | `str \ | Path \ | None` | None | GGUF file path |
budget_mode | BudgetMode | BudgetMode.BACKEND | Context budget strategy | ||
manual_tokens | `int \ | None` | None | Required for MANUAL mode on Ollama | |
client | `Any \ | None` | None | Existing client or None to create default | |
mode | str | "native" | Reasoning mode | ||
port | int | 8080 | Server port | ||
extra_flags | `list[str] \ | None` | None | Additional backend flags | |
on_compact | `Callable \ | None` | None | Callback for compaction events | |
compact_threshold | float | 0.75 | Compaction trigger threshold | ||
phase_thresholds | tuple | (0.5, 0.7, 0.9) | Tiered compaction thresholds |
Server Readiness Detection
Forge uses /props polling rather than /health for readiness confirmation. This eliminates the gap between health-ok and props-available states. Sources: src/forge/server.py:260-278
async def wait_for_ready(self, timeout: float = 60.0) -> None:
url = f"http://localhost:{self._port}/props"
while time.monotonic() < deadline:
try:
resp = await client.get(url)
if resp.status_code == 200:
data = resp.json()
if "default_generation_settings" in data:
return
except (httpx.ConnectError, httpx.ReadError, httpx.TimeoutException):
pass
await asyncio.sleep(2)
The readiness check looks for default_generation_settings in the response—a strong indicator that the model is fully loaded and serving. Sources: src/forge/server.py:260-278
Proxy Server Configuration
Forge includes a proxy server (forge.proxy) that plumbs OpenAI-compatible sampling parameters through to backends. The proxy does not consult the sampling defaults map; it passes through whatever parameters the inbound request carries. Sources: src/forge/proxy/__main__.py:1-60
Proxy CLI Options
| Flag | Type | Default | Description |
|---|---|---|---|
--backend-url | str | Required | Target backend URL |
--backend | str | Required | Backend type |
--model | str | Required | Model identifier |
--gguf | str | "" | GGUF path (for non-Ollama) |
--budget-mode | str | "backend" | Budget resolution mode |
--budget-tokens | int | None | Manual token budget |
--host | str | 127.0.0.1 | Proxy listen host |
--port | int | 8081 | Proxy listen port |
--serialize | flag | None | Force request serialization |
--max-retries | int | 3 | Max retries per request |
--verbose | flag | False | Enable debug logging |
Proxy Sampling Passthrough
The proxy supports these OpenAI-compatible body fields:
| Parameter | Type | Description |
|---|---|---|
temperature | float | Sampling temperature |
top_p | float | Nucleus sampling threshold |
top_k | int | Top-k sampling |
min_p | float | Minimum probability threshold |
repeat_penalty | float | Repetition penalty |
presence_penalty | float | Presence penalty |
seed | int | Deterministic sampling seed |
For per-model recommended sampling in proxy mode, callers should look up forge.clients.get_sampling_defaults(model) and include the values in the request body. Sources: src/forge/clients/sampling_defaults.py:1-50
Per-Model Sampling Defaults
Forge ships verified per-model sampling recommendations for supported models. These must be explicitly opted into via recommended_sampling=True. Sources: src/forge/clients/sampling_defaults.py:50-80
Supported Models
The sampling defaults map includes recommendations for:
- Qwen3 / 3.5 / 3.6 series
- Qwen3-Coder
- Gemma 4
- Mistral Small 3.2
- Devstral Small 2
- Ministral 3 Instruct + Reasoning
- Mistral Nemo
- Granite 4.0 (
h-micro,h-tiny)
Each entry includes an inline HuggingFace model card URL for verification. Sources: CHANGELOG.md:0.6.0
Sampling Policy
strict | Model in Map | Behavior |
|---|---|---|
True | Yes | Return dict copy |
True | No | Raise UnsupportedModelError |
False | Yes | One-shot INFO log; return {} |
False | No | Return {} (silent) |
Context Length Resolution
For non-Ollama backends, forge queries the server's /props endpoint to determine the configured context length. This value feeds into budget resolution. Sources: src/forge/server.py:200-220
async def get_server_context(self) -> int:
"""Query /props for actual n_ctx.
For Ollama: ``ollama stop`` for clean VRAM unloads between model switches.
"""
props = await self.query_props()
ctx = props.get("default_generation_settings", {}).get("n_ctx")
if ctx is None:
raise BudgetResolutionError()
return ctx
Best Practices
Server Reuse
Always check if a server with the desired configuration is already running before starting a new one. The ServerManager performs this check internally based on:
- Model identity (name or GGUF path)
- Mode (
nativeorprompt) - Context override
- CLI flags
- KV cache quantization settings
- Slot configuration
VRAM Management
For Ollama backends, use ollama stop to cleanly unload models and free VRAM before switching to a different model. The llama-server/llamafile backends handle this through server restart. Sources: src/forge/server.py:200
Graceful Shutdown
Always call server.stop() when finished to properly terminate the backend process:
server = ServerManager(backend="llamaserver", port=8080)
try:
await server.start(...)
# ... work ...
finally:
await server.stop()
Multi-Agent Configurations
For multi-agent architectures requiring concurrent slots, configure n_slots and optionally kv_unified=True for shared KV cache across slots:
await server.start(
model="...",
gguf_path="...",
n_slots=4,
kv_unified=True, # Each slot can use full context
)
Quick Start Example
import asyncio
from forge import OllamaClient, WorkflowRunner
from forge.server import setup_backend, BudgetMode
from forge.context import ContextManager, TieredCompact
async def main():
# Setup backend with forge-managed context
client, ctx = await setup_backend(
backend="ollama",
model="ministral-3:8b-instruct-2512-q4_K_M",
budget_mode=BudgetMode.FORGE_FAST,
recommended_sampling=True,
)
try:
runner = WorkflowRunner(client=client, context_manager=ctx)
# ... run workflows ...
finally:
await client.close()
asyncio.run(main())
For GGUF-based setups:
from forge.server import setup_backend, BudgetMode
client, ctx = await setup_backend(
backend="llamaserver",
gguf_path="/path/to/model-q4_K_M.gguf",
budget_mode=BudgetMode.FORGE_FAST,
extra_flags=["--reasoning-format", "auto"],
)
Related Documentation
- CONTRIBUTING.md - Project setup and testing
- README.md - Quick start and Workflow overview
- CHANGELOG.md - Version history and breaking changes
Sources: src/forge/server.py:95-115
Model Selection Guide
Related topics: Backend Setup Guide, Backend Clients
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Backend Setup Guide, Backend Clients
Model Selection Guide
Overview
The Model Selection Guide covers how to choose, configure, and deploy language models within forge. Forge provides a unified workflow engine that abstracts backend differences (Ollama, Llamafile, LlamaServer) while offering per-model recommended sampling parameters sourced directly from HuggingFace model cards.
Sources: src/forge/clients/sampling_defaults.py:1-20
Supported Backends
Forge supports three LLM backend types, each with distinct configuration requirements.
| Backend | Configuration Method | Model Specification | Notes |
|---|---|---|---|
| Ollama | model parameter | Model name from ollama list | No GGUF path needed |
| Llamafile | gguf_path parameter | Path to .llamafile binary | Self-contained executables |
| LlamaServer | gguf_path parameter | Path to GGUF model file | Requires llama.cpp server binary |
Sources: src/forge/server.py:1-50
Backend Selection Logic
graph TD
A[Choose Backend] --> B{Backend Type?}
B -->|Ollama| C[Use model name]
B -->|Llamafile| D[Use gguf_path]
B -->|LlamaServer| D
C --> E[Connect via localhost:11434]
D --> F[Start server process]
F --> G[Connect via port 8080]For Ollama, the model parameter directly references the model name from ollama list. For Llamafile and LlamaServer, you must provide the gguf_path pointing to the model file, and Forge will manage the server process lifecycle.
Sources: src/forge/server.py:80-120
Supported Models
Model Families
Forge has been tested and evaluated with the following model families across different quantization levels.
| Model Family | Variants | Recommended Quantization | Notes |
|---|---|---|---|
| Qwen3 | 8B, 3.5, 3.6 | Q4_K_M, Q8_0 | Includes Qwen3-Coder |
| Gemma | 4 (all sizes) | Q4_K_M | Use --reasoning-budget 0 workaround |
| Mistral | Small 3.2, Nemo, 7B | Q4_K_M, Q8_0 | Ministral variants available |
| Devstral | Small 2 | Q4_K_M | Code-focused model |
| Granite | 4.0 (h-micro, h-tiny) | Q4_K_M, Q8_0 | OpenAI-style tool calls |
| Llama | 3.1 8B | Q4_K_M, Q8_0 | 8B Reasoning variants |
| Ministral | 3 Instruct, 8B Instruct, Reasoning | Q4_K_M, Q8_0 | Reasoning requires budget fix |
Sources: CHANGELOG.md:1-50
Model Naming Conventions
Model names vary by backend. When using Ollama, use the exact model tag as shown in ollama list. For GGUF-based backends, the model name is derived from the filename stem of the GGUF file.
# Ollama example
client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M")
# Llamafile/LlamaServer example - model derived from path
client = LlamafileClient(gguf_path="/models/mistral-7b-q4_K_M.gguf")
Sources: README.md:1-30
Recommended Sampling Configuration
The Sampling Defaults Map
Forge ships forge.clients.sampling_defaults containing a verified per-model sampling recommendations map. Each entry includes parameters such as temperature, top_p, top_k, min_p, repeat_penalty, and presence_penalty sourced directly from HuggingFace model cards.
Sources: src/forge/clients/sampling_defaults.py:20-40
Enabling Recommended Sampling
To use per-model recommended sampling, pass recommended_sampling=True when initializing the client.
from forge import OllamaClient
# Opt-in to recommended sampling
client = OllamaClient(
model="qwen3:8b-q4_K_M",
recommended_sampling=True
)
If the model is not in the map and recommended_sampling=True is set, Forge raises UnsupportedModelError rather than silently falling back to backend defaults.
Sources: src/forge/errors.py:1-25
Sampling Policy Behavior
The following table describes the four-quadrant behavior when applying sampling defaults.
strict | Model in Map | Behavior |
|---|---|---|
True | Yes | Return recommended dict |
True | No | Raise UnsupportedModelError |
False | Yes | One-shot INFO log; return {} |
False | No | Return {} (silent) |
Sources: src/forge/clients/sampling_defaults.py:60-85
Per-Call Sampling Overrides
The send() and send_stream() methods accept a sampling: dict | None kwarg that merges field-by-field with the client's instance-level sampling without mutating it. The caller's explicit non-None fields take precedence.
# Merge with instance defaults
response = await client.send(
messages,
sampling={"temperature": 0.7, "top_p": 0.9}
)
Sources: CHANGELOG.md:50-70
Proxy Mode Configuration
When running forge in proxy mode, sampling parameters are plumbed through from the incoming request body. OpenAI-compatible fields supported include temperature, top_p, top_k, min_p, repeat_penalty, presence_penalty, and seed.
Proxy Server Startup
python -m forge.proxy \
--backend-url http://localhost:11434 \
--backend ollama \
--model qwen3:8b-q4_K_M \
--port 8081
For per-model recommended sampling in proxy mode, the calling client should look up forge.clients.get_sampling_defaults(model) and include the values in the request body.
Sources: src/forge/proxy/__main__.py:1-40
Context and Budget Management
Server Context Resolution
Forge automatically queries the backend's /props endpoint to determine the maximum context length. For Ollama, use ollama stop to cleanly unload VRAM between model switches.
from forge import ContextManager, TieredCompact
ctx = ContextManager(
strategy=TieredCompact(keep_recent=2),
budget_tokens=8192
)
Budget Modes
| Mode | Description | Token Source |
|---|---|---|
MANUAL | User-specified token budget | manual_tokens parameter |
FORGE_FAST | Fast iteration mode | Server-reported context |
FORGE_BALANCED | Balanced speed/quality | Server-reported context |
FORGE_THOROUGH | Maximum quality | Server-reported context |
Sources: src/forge/server.py:150-200
Known Issues with Reasoning Models
Models using extended reasoning (Gemma 4, Qwen 3.5, Ministral Reasoning) may hang with unbounded reasoning budgets on builds after April 10, 2026. The workaround is to set --reasoning-budget 0 when starting the backend.
Sources: CHANGELOG.md:70-90
Client Configuration Reference
OllamaClient
OllamaClient(
model: str, # Model name from `ollama list`
base_url: str = "http://localhost:11434/v1",
api_key: str | None = None,
timeout: float = 120.0,
recommended_sampling: bool = False, # Opt-in to per-model defaults
**kwargs
)
LlamafileClient
LlamafileClient(
gguf_path: str | Path, # Path to .llamafile binary
model: str | None = None, # Optional model name
base_url: str = "http://localhost:8080/v1",
recommended_sampling: bool = False,
**kwargs
)
Sources: src/forge/clients/llamafile.py:1-50
Complete Workflow Example
import asyncio
from pydantic import BaseModel, Field
from forge import (
Workflow, ToolDef, ToolSpec,
WorkflowRunner, OllamaClient,
ContextManager, TieredCompact,
)
def get_weather(city: str) -> str:
return f"72°F and sunny in {city}"
class GetWeatherParams(BaseModel):
city: str = Field(description="City name")
workflow = Workflow(
name="weather",
description="Look up weather for a city.",
tools={
"get_weather": ToolDef(
spec=ToolSpec(
name="get_weather",
description="Get current weather",
parameters=GetWeatherParams,
),
callable=get_weather,
),
},
required_steps=[],
terminal_tool="get_weather",
system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)
async def main():
client = OllamaClient(
model="ministral-3:8b-instruct-2512-q4_K_M",
recommended_sampling=True
)
ctx = ContextManager(
strategy=TieredCompact(keep_recent=2),
budget_tokens=8192
)
runner = WorkflowRunner(client=client, context_manager=ctx)
await runner.run(workflow, "What's the weather in Paris?")
asyncio.run(main())
Sources: README.md:30-80
Error Handling
UnsupportedModelError
Raised when recommended_sampling=True is specified but the model is not in the sampling defaults map.
from forge.errors import UnsupportedModelError
try:
client = OllamaClient(
model="unknown-model:latest",
recommended_sampling=True
)
except UnsupportedModelError as e:
print(f"Model not supported: {e.model}")
# Solution: Either add entry to MODEL_SAMPLING_DEFAULTS
# or drop recommended_sampling=True
Tool Call Errors
Tool-related errors include ToolCallError (LLM failed to produce valid tool call), ToolExecutionError (tool callable raised an exception), and ToolResolutionError (valid arguments but data didn't resolve).
Sources: src/forge/errors.py:25-60
Best Practices
Selecting Quantization Levels
| Use Case | Recommended Quantization |
|---|---|
| Development/Testing | Q4_K_M (balanced quality/size) |
| Production (quality priority) | Q8_0 (near-float quality) |
| Resource-constrained | Q4_0 (smaller, lower quality) |
Guardrail Integration
Guardrails in forge are defined in src/forge/core/runner.py and nudge templates in src/forge/prompts/nudges.py. Each guardrail can be independently toggled via ablation presets for evaluation.
Sources: CONTRIBUTING.md:1-30
Server Management
When running multiple evaluations, reuse ServerManager instances when the model and configuration match to avoid unnecessary server restarts.
# ServerManager caches configuration to avoid redundant restarts
if (
self._current_model == model
and self._current_mode == mode
and self._current_ctx == ctx_override
):
# Reuse existing server
return
Sources: src/forge/server.py:60-75
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
First-time setup may fail or require extra isolation and rollback planning.
First-time setup may fail or require extra isolation and rollback planning.
First-time setup may fail or require extra isolation and rollback planning.
First-time setup may fail or require extra isolation and rollback planning.
Doramagic Pitfall Log
Doramagic extracted 15 source-linked risk signals. Review them before installing or handing real data to the project.
1. Installation risk: Client sampling params: thread top_p/top_k/min_p/repeat_penalty through request body
- Severity: medium
- Finding: Installation risk is backed by a source signal: Client sampling params: thread top_p/top_k/min_p/repeat_penalty through request body. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/58
2. Installation risk: Investigate: integration paths with Hermes Agent
- Severity: medium
- Finding: Installation risk is backed by a source signal: Investigate: integration paths with Hermes Agent. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/51
3. Installation risk: Per-model recommended sampling defaults (map keyed by HF model cards)
- Severity: medium
- Finding: Installation risk is backed by a source signal: Per-model recommended sampling defaults (map keyed by HF model cards). Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/59
4. Installation risk: Rescue-parse ChatGPT-style XML tool calls
- Severity: medium
- Finding: Installation risk is backed by a source signal: Rescue-parse ChatGPT-style XML tool calls. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/55
5. Configuration risk: Proxy external mode hardcodes native FC — no prompt-injection fallback
- Severity: medium
- Finding: Configuration risk is backed by a source signal: Proxy external mode hardcodes native FC — no prompt-injection fallback. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/53
6. Capability assumption: README/documentation is current enough for a first validation pass.
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: capability.assumptions | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | README/documentation is current enough for a first validation pass.
7. Maintenance risk: Maintainer activity is unknown
- Severity: medium
- Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: evidence.maintainer_signals | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | last_activity_observed missing
8. Security or permission risk: no_demo
- Severity: medium
- Finding: no_demo
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: downstream_validation.risk_items | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | no_demo; severity=medium
9. Security or permission risk: no_demo
- Severity: medium
- Finding: no_demo
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: risks.scoring_risks | hn_item:48192383 | https://news.ycombinator.com/item?id=48192383 | no_demo; severity=medium
10. Security or permission risk: Hardware detection: AMD unified-memory rigs fall through to 4K Ollama budget
- Severity: medium
- Finding: Security or permission risk is backed by a source signal: Hardware detection: AMD unified-memory rigs fall through to 4K Ollama budget. Treat it as a review item until the current version is checked.
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/61
11. Security or permission risk: Sub-agent support: dynamic slot splitting
- Severity: medium
- Finding: Security or permission risk is backed by a source signal: Sub-agent support: dynamic slot splitting. Treat it as a review item until the current version is checked.
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/28
12. Security or permission risk: Sub-agent support: slot pool
- Severity: medium
- Finding: Security or permission risk is backed by a source signal: Sub-agent support: slot pool. Treat it as a review item until the current version is checked.
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/antoinezambelli/forge/issues/29
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using forge with real data or production workflows.
- Hardware detection: AMD unified-memory rigs fall through to 4K Ollama bu - github / github_issue
- Per-model recommended sampling defaults (map keyed by HF model cards) - github / github_issue
- Client sampling params: thread top_p/top_k/min_p/repeat_penalty through - github / github_issue
- llama.cpp reasoning budget sampler causes silent hangs after April 10 bu - github / github_issue
- Rescue-parse ChatGPT-style XML tool calls - github / github_issue
- Proxy external mode hardcodes native FC — no prompt-injection fallback - github / github_issue
- Investigate: integration paths with Hermes Agent - github / github_issue
- Sub-agent support: slot pool - github / github_issue
- Sub-agent support: dynamic slot splitting - github / github_issue
- README/documentation is current enough for a first validation pass. - GitHub / issue
Source: Project Pack community evidence and pitfall evidence