arksim Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

arksim

ArkSim addresses the challenge of systematically evaluating conversational AI agents by providing:

Introduction to ArkSim

Related topics: Quickstart Guide, System Architecture Overview

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Python Class Integration (BaseAgent)

Continue reading this section for the full explanation and source context.

Section AgentResponse Data Model

Continue reading this section for the full explanation and source context.

Introduction to ArkSim

ArkSim is a multi-turn agent evaluation and simulation platform designed to test, measure, and improve AI agent quality across conversational scenarios. It enables developers to define test scenarios, simulate user interactions with their agents, and generate comprehensive evaluation reports with quantitative and qualitative metrics.

Overview

ArkSim addresses the challenge of systematically evaluating conversational AI agents by providing:

Scenario-based simulation: Define test cases with user profiles, knowledge bases, and expected behaviors
Automated evaluation: Score agent responses across multiple dimensions including goal completion, helpfulness, and coherence
Framework integration: Connect with various agent frameworks including LangGraph, LangChain, AutoGen, Claude Agent SDK, Rasa, Dify, and OpenClaw
Custom metrics: Extend evaluation capabilities with user-defined quantitative and qualitative metrics
Visual reporting: Generate HTML reports with conversation transcripts, failure analysis, and per-metric scores

Sources: README.md

Architecture Overview

ArkSim follows a pipeline architecture consisting of three primary stages:

graph TD
    A[Configuration<br/>config.yaml] --> B[Simulation Engine]
    C[Scenarios<br/>scenarios.json] --> B
    D[Agent<br/>custom_agent.py] --> B
    B --> E[Conversation Data]
    E --> F[Evaluator]
    F --> G[HTML Report<br/>final_report.html]
    H[Custom Metrics<br/>custom_metrics.py] --> F

Core Components

Component	Description
Simulation Engine	Orchestrates multi-turn conversations between simulated users and the agent
Evaluator	Scores agent performance using built-in and custom metrics
CLI Interface	Command-line interface for running simulations (`arksim simulate-evaluate`)
Report Generator	Produces HTML reports with visualizations and conversation transcripts

Sources: examples/customer-service/README.md:1-25

Agent Integration Methods

ArkSim supports multiple agent integration patterns to accommodate different agent architectures.

Python Class Integration (BaseAgent)

The default integration method uses a Python class inheriting from BaseAgent. This pattern provides maximum flexibility for connecting any agent implementation.

from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"

The agent can return either:

A plain string response
An AgentResponse object containing both content and tool calls

Sources: README.md:40-55

AgentResponse Data Model

The AgentResponse model encapsulates structured agent outputs:

class AgentResponse(BaseModel):
    """Structured return from agent execution, carrying both text and tool calls."""
    model_config = ConfigDict(extra="ignore")
    
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)

Sources: arksim/simulation_engine/tool_types.py:32-42

ToolCall Model

Tool calls are captured using the ToolCall model:

class ToolCall(BaseModel):
    """A single tool/function call observed during a turn."""
    model_config = ConfigDict(extra="ignore")
    
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None

The extra="ignore" configuration ensures forward compatibility with future versions.

Sources: arksim/simulation_engine/tool_types.py:18-30

Chat Completions Endpoint Integration

For agents exposing a standard OpenAI-compatible API:

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

Sources: README.md:57-62

A2A Protocol Integration

For agents implementing the Agent-to-Agent (A2A) protocol:

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

Sources: README.md:64-70

Automatic Tool Call Capture (Tracing)

ArkSim supports automatic tool call capture through a tracing processor, eliminating the need for agents to explicitly return tool calls:

graph LR
    A[Simulator sets<br/>routing context] --> B[agent.execute()]
    B --> C[SDK fires<br/>TracingProcessor.on_span_end]
    C --> D[arksim captures<br/>tool calls]
    D --> E[Evaluator scores]

pip install -r requirements-traced.txt
arksim simulate-evaluate config_traced.yaml

Sources: examples/customer-service/README.md:35-55

Configuration System

ArkSim uses YAML configuration files to define simulation and evaluation parameters.

Configuration File Structure

# Simulation settings
simulation:
  max_turns: 10
  timeout: 300

# Agent configuration
agent_config:
  agent_type: chat_completions  # or 'a2a', 'custom'
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

# Trace receiver (optional)
trace_receiver:
  enabled: true
  wait_timeout: 5

# Custom metrics
custom_metrics_file_paths:
  - ./custom_metrics.py

metrics_to_run:
  - goal_completion
  - verification_compliance

Sources: examples/ci/README.md:1-25

Scenario Definition

Scenarios are defined in scenarios.json with the following structure:

Field	Description
`scenario_id`	Unique identifier for the scenario
`user_profile`	Description of the simulated user
`knowledge`	Context and information available to the user
`goals`	Objectives the user aims to achieve
`expected_behavior`	Expected agent behavior patterns

Sources: examples/integrations/dify/README.md:1-20

Evaluation Metrics

ArkSim provides built-in metrics and supports custom metric definitions.

Built-in Metrics

Metric	Description	Weight
Goal Completion	Whether the agent achieved the user's objectives	60%
Turn Success Ratio	Ratio of successful turns to total turns	40%
Final Score	Weighted average of other metrics	-

Sources: arksim/utils/html_report/report_template.html:80-95

Custom Metrics

Custom metrics can be implemented as either quantitative (scored numerically) or qualitative (evaluated via LLM).

#### Quantitative Metric Pattern

from arksim.evaluator import (
    QuantitativeMetric,
    QuantResult,
    ScoreInput,
    format_chat_history,
)

class VerificationComplianceMetric(Queresult):
    identity_verification: float  # 0.0-1.0
    action_gating: float  # 0.0-1.0
    reason: str

VERIFICATION_COMPLIANCE_SYSTEM_PROMPT = """\
You are an impartial evaluator for a customer service agent.
Score how well the agent followed identity verification protocols..."""

def verification_compliance_metric(
    input_data: ScoreInput,
) -> QuantResult:
    # Implementation
    return QuantResult(score=0.85, reason="Verification completed")

Sources: examples/customer-service/custom_metrics.py:15-40

#### Qualitative Metric Pattern

from arksim.evaluator import (
    QualitativeMetric,
    QualResult,
    ScoreInput,
    format_chat_history,
)

def helpfulness_metric(input_data: ScoreInput) -> QualResult:
    system_prompt = """Evaluate the helpfulness of the agent response..."""
    return QualResult(
        assessment="The agent provided comprehensive assistance",
        score=0.9,
        reason="Clear explanation with actionable steps"
    )

Metric Configuration

To use custom metrics:

Implement the metric function in a Python file
Reference the file path in custom_metrics_file_paths in config.yaml
Add the metric name to metrics_to_run

Sources: examples/customer-service/custom_metrics.py:1-15

CLI Usage

ArkSim provides a command-line interface for running simulations and evaluations.

Primary Commands

Command	Description
`arksim simulate-evaluate config.yaml`	Run simulation and evaluation in one step
`arksim simulate config_simulate.yaml`	Run simulation only
`arksim evaluate config_evaluate.yaml`	Evaluate existing simulation results

Sources: examples/customer-service/README.md:15-22

Workflow Examples

#### Combined Simulation and Evaluation

arksim simulate-evaluate config.yaml

#### Separate Simulation and Evaluation

# Step 1: Simulate
arksim simulate config_simulate.yaml

# Step 2: Evaluate
arksim evaluate config_evaluate.yaml

Results are written to ./results/simulation/simulation.json. The evaluation report is printed to stdout with per-scenario metric scores and failure analysis.

Sources: examples/integrations/dify/README.md:15-25

Framework Integrations

ArkSim provides integration examples for popular agent frameworks.

Integration Matrix

Framework	Package Required	Documentation
LangGraph	`langgraph langchain-openai`	Link
LangChain	`langgraph langchain-openai`	Link
AutoGen	`autogen-agentchat autogen-ext[openai]`	Link
Claude Agent SDK	`claude-agent-sdk`	Link
Rasa	`rasa-pro`	Link
Dify	(custom agent)	Link
OpenClaw	(custom agent)	Link

Sources: examples/integrations/langgraph/README.md, examples/integrations/autogen/README.md

General Integration Pattern

Regardless of the framework, the integration follows this pattern:

Create a custom_agent.py file with a BaseAgent subclass
Implement the execute() method to call the framework's agent
Return either a string or AgentResponse with tool calls
Configure config.yaml to use the custom agent
Create scenarios.json with test cases

from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class FrameworkAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-session-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Call your framework's agent here
        result = await your_agent.run(user_query)
        return str(result)

Report Generation

ArkSim generates comprehensive HTML evaluation reports containing:

Report Sections

Section	Content
Summary	Overall scores, pass/fail status
Per-Scenario Scores	Individual metric scores for each scenario
Failure Categories	Grouped failure patterns with counts
Conversation Viewer	Full transcript of each simulated conversation
Score Explanations	Rationale for each score with references to conversation turns

Sources: README.md:25-30

Report Output

The report tells you where your agent is strong and where it breaks. You get per-metric scores, categorized failures, and full conversation transcripts so you can read the exact turns where things went wrong.

Sources: README.md:30-35

Development Guidelines

Project Setup

# Fork and clone
git clone https://github.com/<your-username>/arksim.git
cd arksim

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# Create a branch
git checkout -b my-feature

Code Quality

ArkSim uses Ruff for linting and formatting:

ruff check .    # Lint
ruff format .   # Format

Pre-commit hooks run both automatically on commit.

Code Style

Follow PEP 8 conventions
Code lines: maximum 120 characters
Comments and docstrings: maximum 80 characters
Type hints encouraged for function signatures
Use absolute imports over relative imports

Commit Message Format

<component>: <verb> <description>

Examples:

evaluator: add custom metric support
simulator: fix profile generation for empty attributes
cli: support verbose flag for streaming output

Keep the subject line under 72 characters, use lowercase, imperative mood.

Branch Naming

<type>/<short-description>

Examples: feat/retry-logic, fix/empty-list-handling, docs/update-quickstart.

Sources: CONTRIBUTING.md:1-50

CI/CD Integration

ArkSim can be integrated into GitHub Actions workflows for automated testing.

Workflow Options

Workflow	Use Case
`arksim.yml`	HTTP server endpoints requiring startup/shutdown
`arksim-pytest.yml`	Python-based pytest integration

Setup Requirements

Update TODO sections in arksim.yml (startup command and health-check URL)
Create tests/arksim/config.yaml pointing to your server endpoint
Create tests/arksim/scenarios.json with test cases
Add custom metrics to tests/arksim/custom_metrics/ if needed
Add OPENAI_API_KEY (and optionally AGENT_API_KEY) to GitHub secrets

Sources: examples/ci/README.md:1-15

Summary

ArkSim provides a comprehensive framework for evaluating multi-turn conversational AI agents through:

Flexible integration: Connect agents via Python classes, REST APIs, or framework-specific connectors
Rich scenario definitions: Test agents with diverse user profiles and interaction goals
Extensible evaluation: Implement custom metrics for domain-specific assessment
Actionable reporting: Generate HTML reports that pinpoint exactly where and why agent behavior falls short

By standardizing agent evaluation, ArkSim enables data-driven improvements to conversational AI systems across different frameworks and architectures.

Sources: README.md

Quickstart Guide

Related topics: Introduction to ArkSim, Agent Types and Integration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Agent Connection Types

Continue reading this section for the full explanation and source context.

Section Command Options

Continue reading this section for the full explanation and source context.

Section Python Class (Recommended)

Continue reading this section for the full explanation and source context.

Quickstart Guide

This guide walks you through setting up and running your first agent simulation with ArkSim. By the end, you'll understand how to scaffold a project, connect your agent, define test scenarios, and execute simulation and evaluation workflows.

Prerequisites

Before starting, ensure you have:

Python 3.10 or higher
pip or pipx for package installation
An API key for your LLM provider (OpenAI, Anthropic, etc.) if your agent requires external API calls

Installation

Install ArkSim using pip:

pip install arksim

For development with test dependencies:

pip install -e ".[dev]"

Sources: README.md

Project Initialization

The arksim init command scaffolds a starter project with all necessary configuration files. Run it from your desired working directory:

arksim init

Agent Connection Types

ArkSim supports three agent connection types via the --agent-type flag:

Type	Description	Use Case
`custom`	Python class implementing `BaseAgent`	Full control, no external server needed
`chat_completions`	HTTP endpoint compatible with OpenAI format	Existing REST APIs
`a2a`	Agent-to-Agent protocol	Multi-agent systems

Sources: arksim/cli.py:120-136

Command Options

arksim init --agent-type custom    # Default: Python agent class
arksim init --agent-type chat_completions  # HTTP endpoint
arksim init --agent-type a2a       # A2A protocol
arksim init --agent-type custom --force  # Overwrite existing files

Scaffolding generates these files:

File	Purpose
`my_agent.py`	Agent implementation stub
`config.yaml`	Simulator configuration
`scenarios.json`	Test scenario definitions
`.env`	Environment variables template

Project Structure

your-project/
├── my_agent.py        # Your agent implementation
├── config.yaml         # Simulation & evaluation settings
├── scenarios.json     # Test scenarios
└── results/            # Output directory (created on run)
    ├── simulation/
    │   └── simulation.json
    └── evaluation/
        └── evaluation.json

Connecting Your Agent

Python Class (Recommended)

Replace the generated my_agent.py with your agent logic. Your class must inherit from BaseAgent:

from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Your agent logic here
        return "agent response"

Sources: README.md

The execute method supports two return types:

Return Type	Description
`str`	Plain text response
`AgentResponse`	Structured response with text and tool calls

from arksim.simulation_engine.tool_types import AgentResponse, ToolCall

async def execute(self, user_query: str, **kwargs: object) -> AgentResponse:
    tool_calls = [
        ToolCall(
            id="call_123",
            name="search_database",
            arguments={"query": user_query}
        )
    ]
    return AgentResponse(
        content="Found results for your query",
        tool_calls=tool_calls
    )

Sources: arksim/simulation_engine/tool_types.py:45-72

Chat Completions Endpoint

For HTTP-based agents, configure config.yaml:

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

A2A Protocol

For Agent-to-Agent protocol support:

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

A2A agents can surface tool calls for evaluation via the protocol extension.

Sources: README.md

Configuring Simulation

Edit config.yaml to control simulation behavior:

simulator:
  max_turns: 20
  timeouts:
    agent_response: 30
    tool_execution: 10
  model: gpt-4o-mini

Core Configuration Parameters

Parameter	Type	Default	Description
`max_turns`	int	20	Maximum conversation turns per scenario
`agent_response`	int	30	Timeout for agent response (seconds)
`tool_execution`	int	10	Timeout for tool execution (seconds)
`model`	string	gpt-4o-mini	Simulator model for user simulation

Defining Test Scenarios

Edit scenarios.json to define your test cases:

[
  {
    "id": "scenario-001",
    "name": "Customer Inquiry",
    "category": "support",
    "user_profile": {
      "name": "Alice Johnson",
      "age": 34,
      "account_type": "premium"
    },
    "knowledge": [
      "Customer has a premium account",
      "Customer is inquiring about billing"
    ],
    "conversation": [
      {
        "turn": 1,
        "user": "I noticed a charge on my account that I don't recognize",
        "goal": "Customer wants to understand and dispute the unfamiliar charge"
      }
    ]
  }
]

Scenario Schema Fields

Field	Required	Description
`id`	Yes	Unique scenario identifier
`name`	Yes	Human-readable name
`category`	No	Grouping category for filtering
`user_profile`	No	Simulated user attributes
`knowledge`	No	Facts the simulated user knows
`conversation`	Yes	Multi-turn conversation structure

Sources: examples/customer-service/README.md

Running Simulation and Evaluation

Combined Workflow

Run both simulation and evaluation in one command:

arksim simulate-evaluate config.yaml

Separate Steps

For more control, run simulation and evaluation separately:

# Step 1: Simulate
arksim simulate config_simulate.yaml

# Step 2: Evaluate
arksim evaluate config_evaluate.yaml

Sources: examples/customer-service/README.md

Command Reference

Command	Description
`arksim init`	Scaffold new project
`arksim simulate <config>`	Run simulation only
`arksim evaluate <config>`	Run evaluation only
`arksim simulate-evaluate <config>`	Run both steps
`arksim ui`	Launch web UI (port 8080)
`arksim examples`	Download example projects
`arksim prompts`	List available prompts

Sources: arksim/cli.py:89-152

Web UI

Launch the web-based control plane:

arksim ui --port 8080

The UI provides:

Scenario management
Real-time simulation progress
Log viewing
Dark/light mode toggle

Integration Examples

ArkSim supports integrations with popular agent frameworks:

Framework	Command
LangChain/LangGraph	`pip install langgraph langchain-openai`
Claude Agent SDK	`pip install claude-agent-sdk`
Microsoft AutoGen	`pip install autogen-agentchat autogen-ext[openai]`
Dify	HTTP-based integration

Example workflow for LangGraph:

cd examples/integrations/langgraph
pip install langgraph langchain-openai
export OPENAI_API_KEY="<your-key>"
arksim simulate-evaluate config.yaml

Sources: examples/integrations/langchain/README.md

Next Steps

Review the Evaluation Metrics documentation for customizing scoring
Explore the Custom Agent guide for advanced integration patterns
Download example projects with arksim examples

Sources: README.md

System Architecture Overview

Related topics: Simulation Engine, Evaluation System, LLM Provider Integration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Simulation Modes

Continue reading this section for the full explanation and source context.

Section Tool Call Capture

Continue reading this section for the full explanation and source context.

Section Traced Agent Flow

Continue reading this section for the full explanation and source context.

System Architecture Overview

ArkSim is a multi-turn agent evaluation framework that simulates user conversations with agents, captures tool calls, and scores agent performance against customizable metrics. This document provides a comprehensive overview of the system's architecture, core components, and data flows.

Core Components

ArkSim's architecture is organized into three primary subsystems that work together to simulate, capture, and evaluate agent behavior.

Component	Purpose
Simulation Engine	Orchestrates multi-turn conversations between user simulators and agents
Evaluator	Scores agent responses against qualitative and quantitative metrics
LLM Layer	Powers user simulation and metric evaluation via configurable model providers

High-Level Architecture

graph TD
    subgraph Simulation["Simulation Engine"]
        CLI[CLI Interface]
        Config[Config Loader]
        Simulator[Simulator Core]
        AgentConnector[Agent Connector]
        UserSimulator[User Simulator]
    end

    subgraph Evaluation["Evaluator"]
        Metrics[Custom Metrics]
        Scoring[Scoring Engine]
        Report[HTML Report Generator]
    end

    subgraph LLM["LLM Layer"]
        ChatLLM[Chat LLM]
        EvalLLM[Evaluation LLM]
    end

    subgraph External["External Systems"]
        CustomAgent[Custom Agent]
        HTTPEndpoint[HTTP Endpoint]
        A2AAgent[A2A Agent]
    end

    CLI --> Config
    Config --> Simulator
    Simulator --> UserSimulator
    Simulator --> AgentConnector
    AgentConnector --> CustomAgent
    AgentConnector --> HTTPEndpoint
    AgentConnector --> A2AAgent
    UserSimulator --> ChatLLM
    Metrics --> EvalLLM
    EvalLLM --> Scoring
    Scoring --> Report

    style CLI fill:#e1f5fe
    style Simulator fill:#fff3e0
    style Report fill:#e8f5e9

CLI Interface

The command-line interface provides multiple entry points for running simulations and evaluations. The CLI is implemented in arksim/cli.py and supports the following commands:

Command	Description
`arksim simulate-evaluate config.yaml`	Run simulation and evaluation in a single pipeline
`arksim simulate config.yaml`	Run simulation only
`arksim evaluate config.yaml`	Run evaluation on existing simulation results
`arksim init`	Scaffold starter files for agent testing
`arksim ui`	Launch web UI control plane
`arksim examples`	Download example projects from GitHub

Sources: arksim/cli.py:1-100

Simulation Modes

ArkSim supports three distinct agent integration patterns:

graph LR
    subgraph AgentTypes["Agent Types"]
        Custom[Custom Agent<br/>Python class extending BaseAgent]
        HTTP[Chat Completions<br/>HTTP endpoint /v1/chat/completions]
        A2A[A2A Protocol<br/>Agent-to-Agent standard]
    end

Custom Agent (Python class): Extend BaseAgent and implement the execute() method
Chat Completions: Configure an HTTP endpoint for OpenAI-compatible chat completions API
A2A Protocol: Connect via the Agent-to-Agent protocol standard

Sources: README.md

Simulation Engine

The simulation engine orchestrates multi-turn conversations between a user simulator and the agent under test. Each turn consists of:

User simulator generates a response based on conversation history and scenario knowledge
Agent executes the user query and returns a response (optionally with tool calls)
Simulator captures the interaction for evaluation

Tool Call Capture

ArkSim captures tool/function calls in two ways:

Capture Method	Mechanism	Configuration
AgentResponse	Agent returns structured `AgentResponse` with `tool_calls` list	Default behavior
TracingProcessor	SDK's `TracingProcessor.on_span_end` captures calls automatically	`trace_receiver.enabled: true`

Sources: examples/customer-service/README.md

#### Tool Call Data Model

Tool calls are represented by the ToolCall class:

class ToolCall(BaseModel):
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None

The AgentResponse wraps both content and tool calls:

class AgentResponse(BaseModel):
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)

Sources: arksim/simulation_engine/tool_types.py:1-50

Traced Agent Flow

sequenceDiagram
    participant Simulator
    participant Agent
    participant Tracing as ArksimTracingProcessor
    participant Evaluator

    Simulator->>Agent: execute(user_query)
    Agent->>Tracing: on_span_end(span)
    Tracing->>Simulator: captured_tool_calls
    Simulator->>Agent: RunResult
    Agent->>Simulator: AgentResponse
    Simulator->>Evaluator: Tool calls + Conversation
    Evaluator->>Simulator: Metric Scores

Sources: examples/customer-service/README.md

Evaluator

The evaluator scores agent performance using both quantitative and qualitative metrics. The evaluation framework supports custom metrics defined via Python files.

Metric Types

Type	Description	Scoring Method
QuantitativeMetric	Numerical scores (0.0-1.0 scale)	Structured JSON schema validation
QualitativeMetric	Free-form evaluation with reasoning	LLM-generated analysis

Sources: examples/customer-service/custom_metrics.py:1-60

Custom Metrics Structure

Custom metrics require:

A Pydantic BaseModel defining the output schema
A system prompt describing evaluation criteria
Implementation of QuantitativeMetric or QualitativeMetric

class ConversionSchema(BaseModel):
    intent_strength: float
    conversion_outcome: float
    evidence: list[str]
    reason: str

CONVERSION_SYSTEM_PROMPT = """\
You are an impartial evaluator for an e-commerce shopping agent.
Your job is to score (1) the shopper's purchase intent and (2) whether the agent achieved a conversion outcome...
"""

Sources: examples/e-commerce/custom_metrics.py:1-40

Score Calculation

The evaluator computes a Final Score as a weighted average:

Component	Weight	Description
Turn Success Ratio	40%	Ratio of successful turns to total turns
Goal Completion Score	60%	LLM-assessed goal achievement score

Status	Condition
Done	Final score = 1.0
Partial Failure	0.0 < Final score < 1.0
Complete Failure	Final score = 0.0

Sources: arksim/utils/html_report/report_template.html:1-50

Data Flow

graph LR
    subgraph Input["Input"]
        Config[config.yaml]
        Scenarios[scenarios.json]
        Metrics[custom_metrics.py]
    end

    subgraph Process["Processing"]
        Sim[Simulation]
        Eval[Evaluation]
        Score[Scoring]
    end

    subgraph Output["Output"]
        Results[simulation.json]
        Report[evaluation.html]
    end

    Config --> Sim
    Scenarios --> Sim
    Sim --> Eval
    Metrics --> Eval
    Eval --> Score
    Score --> Results
    Score --> Report

Agent Connector Types

ArkSim supports multiple agent integration patterns via the configuration system:

# Option 1: Custom Python Agent
agent_config:
  agent_type: custom
  agent_name: my-agent

# Option 2: Chat Completions HTTP Endpoint
agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

# Option 3: A2A Protocol
agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

Sources: README.md

User Simulator

The user simulator generates realistic multi-turn conversations based on:

Scenario definitions: Predefined conversation flows and expected behaviors
User profiles: Demographic and behavioral attributes for persona simulation
Knowledge bases: Domain-specific information the simulated user possesses
Conversation history: Full context of the multi-turn interaction

The simulator uses LLM-based generation to produce contextually appropriate user responses that evolve through the conversation based on agent actions and conversation state.

HTML Report Generation

After evaluation completes, ArkSim generates an interactive HTML report containing:

Section	Content
Summary Statistics	Overall scores, pass/fail rates
Per-Scenario Metrics	Individual metric scores with reasoning
Failure Categories	Grouped analysis of failure types
Conversation Transcripts	Full turn-by-turn dialogue viewer

Sources: arksim/utils/html_report/report_template.html:1-100

Integration Examples

ArkSim provides example integrations for popular agent frameworks:

Framework	Example Location	Connection Method
LangGraph	`examples/integrations/langgraph/`	Custom agent connector
LangChain	`examples/integrations/langchain/`	Custom agent connector
Claude Agent SDK	`examples/integrations/claude-agent-sdk/`	Custom agent connector
AutoGen	`examples/integrations/autogen/`	Custom agent connector
Pydantic AI	`examples/integrations/pydantic-ai/`	Custom agent connector
Dify	`examples/integrations/dify/`	HTTP client

Sources: examples/integrations/langgraph/README.md, examples/integrations/claude-agent-sdk/README.md, examples/integrations/autogen/README.md

Configuration Schema

The main configuration file (config.yaml) controls all aspects of simulation and evaluation:

Section	Key Options
`agent_config`	`agent_type`, `agent_name`, `api_config.endpoint`
`scenario_file`	Path to scenarios.json
`metrics_to_run`	List of metric names to execute
`custom_metrics_file_paths`	Paths to custom metric Python files
`trace_receiver.enabled`	Enable TracingProcessor capture (default: false)
`trace_receiver.wait_timeout`	Timeout for trace capture (seconds)

Sources: examples/customer-service/README.md

Web UI

ArkSim includes a web-based control plane for managing simulations:

graph TD
    subgraph UI["Web UI"]
        Build[Build Scenarios]
        Load[Load Existing]
        Run[Run Simulation]
        View[View Results]
    end

    subgraph Features["UI Features"]
        AutoGen[Auto-generate Scenarios PRO]
        Browse[File Browser]
        Refresh[Refresh Results]
    end

Features include:

Scenario building and loading
Auto-generate scenarios (PRO feature)
File browser integration
Results viewing and refresh

Sources: arksim/ui/frontend/index.html:1-80

Summary

ArkSim provides a comprehensive multi-turn agent evaluation framework with:

Flexible agent integration: Support for custom Python agents, HTTP endpoints, and A2A protocol
Tool call capture: Two mechanisms for capturing agent tool executions
Customizable metrics: Both quantitative and qualitative evaluation approaches
Interactive reporting: HTML-based results with conversation viewer
CLI and UI: Command-line and web-based interfaces for running evaluations
Framework integrations: Pre-built examples for LangGraph, LangChain, Claude SDK, AutoGen, Pydantic AI, and Dify

Sources: arksim/cli.py:1-100

Simulation Engine

Related topics: System Architecture Overview, Evaluation System, Scenario Management

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section ToolCall

Continue reading this section for the full explanation and source context.

Section AgentResponse

Continue reading this section for the full explanation and source context.

Simulation Engine

The Simulation Engine is the core component of ArkSim responsible for executing multi-turn conversations between simulated users and agent systems under test. It orchestrates the entire simulation lifecycle, from scenario loading to agent execution, while capturing tool calls, responses, and conversation state for downstream evaluation.

Architecture Overview

The Simulation Engine follows a layered architecture that separates concerns between scenario management, agent execution, tool call capture, and knowledge handling.

graph TD
    A[Scenarios JSON] --> B[Simulator]
    C[Config YAML] --> B
    B --> D[Agent Executor]
    D --> E[BaseAgent]
    E --> F[Custom Agent / Chat Completions / A2A]
    F --> G[Tool Calls Capture]
    G --> H[Simulation Results]
    D --> I[Multi-Knowledge Handler]
    I --> J[Knowledge Sources]
    
    style B fill:#e1f5fe
    style D fill:#fff3e0
    style E fill:#e8f5e9

Core Components

Component	File	Purpose
Simulator	`simulator.py`	Orchestrates the simulation lifecycle
Entities	`entities.py`	Data models for scenarios, profiles, and turns
Agent Base	`agent/base.py`	Abstract base class for all agent implementations
Tool Types	`tool_types.py`	Data models for tool calls and agent responses
Multi-Knowledge Handler	`core/multi_knowledge_handling.py`	Manages multiple knowledge sources for user profiles
Prompt Utilities	`utils/prompts.py`	Generates prompts for simulated users

Data Models

ToolCall

The ToolCall class represents a single tool or function call observed during a conversation turn.

class ToolCall(BaseModel):
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None

Key characteristics:

Declares extra="ignore" for forward compatibility with future versions
Supports capturing arguments, results, and error states
Tracks the source of tool calls (e.g., from response parsing or tracing)

Sources: arksim/simulation_engine/tool_types.py:26-37

AgentResponse

The AgentResponse class provides a structured return from agent execution.

class AgentResponse(BaseModel):
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)

Sources: arksim/simulation_engine/tool_types.py:45-52

BaseAgent Interface

All agent implementations must inherit from BaseAgent and implement the required async methods.

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"

Sources: README.md:45-55

Required Methods

Method	Return Type	Description
`get_chat_id()`	`str`	Returns a unique identifier for the chat session
`execute()`	`str \	AgentResponse`	Executes the agent with a user query and returns content with optional tool calls

Sources: README.md:45-55

Simulation Workflow

graph LR
    A[Load Config] --> B[Load Scenarios]
    B --> C[Initialize Agent]
    C --> D[For Each Scenario]
    D --> E[Generate User Profile]
    E --> F[Execute Turn]
    F --> G{Capture Tool Calls}
    G --> H[Record Response]
    H --> I{More Turns?}
    I -->|Yes| F
    I -->|No| J[Save Results]
    J --> K[Next Scenario]
    K -->|Yes| D
    K -->|No| L[Complete]

Step 0: Build Scenarios

Before running a simulation, users must create or load test scenarios. The UI provides options to:

Auto-generate Scenarios (Pro feature) - Automatically generate realistic test scenarios from the agent's knowledge base
Load Existing - Load scenario files from a specified path

Sources: arksim/ui/frontend/index.html:1-50

Step 1: Simulation Execution

The simulator executes each scenario by:

Loading the scenario configuration and user profiles
Generating multi-turn conversations based on the scenario goals
Capturing all tool calls and responses
Producing a simulation.json output file

Sources: examples/integrations/dify/README.md:18-20

Agent Types

ArkSim supports multiple agent integration patterns:

Agent Type	Configuration	Use Case
Custom Python Class	`agent_type: custom`	Full control via subclassing `BaseAgent`
Chat Completions	`agent_type: chat_completions`	HTTP endpoint compatible with OpenAI format
A2A Protocol	`agent_type: a2a`	Agent-to-Agent protocol endpoints

Sources: README.md:57-68

Chat Completions Configuration

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

A2A Protocol Configuration

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

Sources: README.md:57-68

Tool Call Capture

ArkSim provides two mechanisms for capturing tool calls:

Response-Based Capture (Default)

The agent returns an AgentResponse containing explicit tool calls:

async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
    run_result = self.agent.run(user_query)
    return AgentResponse(
        content=run_result.text,
        tool_calls=extract_tool_calls(run_result)
    )

Tracing-Based Capture (Automatic)

For agents using the ArkSim tracing processor, tool calls are captured automatically without modifying the agent response:

# At module load
from arksim.simulation_engine.tracing import ArksimTracingProcessor
agent.register(ArksimTracingProcessor())

# Agent returns plain str
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
    return self.agent.run(user_query)  # Returns plain string

Sources: examples/customer-service/README.md:1-35

Configuration

CLI Usage

# Combined simulation and evaluation
arksim simulate-evaluate config.yaml

# Separate steps
arksim simulate config_simulate.yaml
arksim evaluate config_evaluate.yaml

Sources: examples/customer-service/README.md:8-14

Trace Receiver Configuration

trace_receiver:
  enabled: true
  wait_timeout: 5

When trace_receiver.enabled is false, ArkSim only captures tool calls from AgentResponse.

Sources: examples/customer-service/README.md:30-35

Integration Examples

ArkSim provides pre-built integrations for popular agent frameworks:

Framework	Package	Example Path
LangGraph	`langgraph`, `langchain-openai`	`examples/integrations/langgraph/`
AutoGen	`autogen-agentchat`, `autogen-ext[openai]`	`examples/integrations/autogen/`
Claude Agent SDK	`claude-agent-sdk`	`examples/integrations/claude-agent-sdk/`
CrewAI	`crewai`	`examples/integrations/crewai/`
Pydantic AI	`pydantic-ai`	`examples/integrations/pydantic-ai/`
LangChain	`langgraph`, `langchain-openai`	`examples/integrations/langchain/`

Sources: examples/integrations/langgraph/README.md, examples/integrations/autogen/README.md, examples/integrations/claude-agent-sdk/README.md, examples/integrations/crewai/README.md, examples/integrations/pydantic-ai/README.md, examples/integrations/langchain/README.md

Output Format

Simulation results are written to ./results/simulation/simulation.json containing:

Complete conversation transcripts
Captured tool calls with arguments and results
Turn-by-turn timing information
User profile data
Scenario metadata

Sources: examples/integrations/dify/README.md:18-20

Custom Metrics Support

The simulation engine supports custom quantitative and qualitative metrics through the evaluator. Custom metric files can be added to custom_metrics/ directories and referenced in the configuration.

Sources: examples/customer-service/custom_metrics.py:1-20

Forward Compatibility

Both ToolCall and AgentResponse declare extra="ignore" in their Pydantic configuration. This ensures that snapshots from future ArkSim versions containing new fields can be loaded by older versions without raising ValidationError.

model_config = ConfigDict(extra="ignore")

Sources: arksim/simulation_engine/tool_types.py:26-28, arksim/simulation_engine/tool_types.py:45-47

Sources: arksim/simulation_engine/tool_types.py:26-37

Evaluation System

Related topics: System Architecture Overview, Simulation Engine

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Step-by-Step Process

Continue reading this section for the full explanation and source context.

Section Score Ranges

Continue reading this section for the full explanation and source context.

Evaluation System

The Evaluation System in ArkSim is responsible for scoring simulated conversations between users and agents. It measures goal completion, helpfulness, coherence, and other quality metrics to identify where agents succeed and where they fail.

Overview

After simulation generates conversation transcripts, the evaluator analyzes each conversation and assigns scores across multiple dimensions. The system supports both built-in metrics and custom-defined metrics.

Purpose:

Quantify agent performance objectively
Identify specific failure patterns
Generate actionable HTML reports for debugging

Scope:

Scores conversations on a 0.0-1.0 scale
Computes weighted final scores combining multiple metrics
Categorizes outcomes as done, partial failure, or failed
Supports LLM-based evaluation (using configured providers like OpenAI)

Sources: README.md

Architecture

graph TD
    A[Simulation Results] --> B[Evaluator]
    B --> C[Built-in Metrics]
    B --> D[Custom Metrics]
    B --> E[Tool Call Metrics]
    C --> F[Final Scores]
    D --> F
    E --> F
    F --> G[HTML Report Generator]
    G --> H[evaluation/final_report.html]

Core Components

Component	File	Purpose
Evaluator	`evaluator.py`	Main orchestration of evaluation pipeline
Base Metric	`base_metric.py`	Abstract base classes for metrics
Built-in Metrics	`builtin_metrics.py`	Standard metrics (faithfulness, helpfulness, etc.)
Tool Call Metrics	`tool_call_metrics.py`	Evaluation of tool usage patterns
Trajectory Matching	`trajectory_matching.py`	Compare expected vs actual agent trajectories
Thresholds	`thresholds.py`	Score classification logic
HTML Report	`generate_html_report.py`	Generate visual evaluation reports

Evaluation Workflow

sequenceDiagram
    participant S as Simulator
    participant E as Evaluator
    participant M as Metrics
    participant R as Report Generator
    
    S->>E: Simulation results (JSON)
    E->>M: For each conversation
    M->>M: Run quantitative metrics
    M->>M: Run qualitative metrics
    M->>E: Per-metric scores
    E->>E: Calculate final scores
    E->>E: Determine status
    E->>R: Score data
    R->>R: Generate HTML

Step-by-Step Process

Load Simulation Data: Read conversation transcripts from simulation output
Select Metrics: Determine which built-in and custom metrics to run
Score Each Conversation: Apply metrics to score goal completion, behavior, etc.
Compute Final Scores: Calculate weighted averages (goal_completion_weight=0.25, turn_success_ratio_weight=0.75)
Classify Status: Assign done/partial_failure/failed based on thresholds
Generate Report: Create HTML report with detailed breakdowns

Sources: arksim/utils/html_report/report_template.html

Scoring System

Score Ranges

Score Type	Range	Description
Quantitative	0.0 - 1.0	Numeric scores for measurable criteria
Qualitative	Label-based	Categorical results (compliant, professional, pass, etc.)
Goal Completion	0.0 - 1.0	Whether agent completed user goal
Final Score	0.0 - 1.0	Weighted combination of metrics

Status Classification

Status	Condition	Description
Done	final_score == 1.0	Perfect performance, goal completed
Partial Failure	final_score >= 0.6	Acceptable but with some failures
Failed	final_score < 0.6	Poor performance requiring attention

Sources: arksim/utils/html_report/report_template.html:85-87

Built-in Metrics

The system provides seven built-in metrics that can be selected via configuration:

Metric	Purpose
`faithfulness`	Did the agent provide factually accurate information?
`helpfulness`	Was the agent's response useful to the user?
`coherence`	Were responses logically connected and consistent?
`verbosity`	Did the agent maintain appropriate response length?
`relevance`	Did responses address the user's actual query?
`goal_completion`	Did the agent help the user accomplish their goal?
`agent_behavior_failure`	Did the agent exhibit any problematic behaviors?

Metric Selection

From the frontend, users can select which built-in metrics to run:

<template x-for="m in ['faithfulness', 'helpfulness', 'coherence', 'verbosity', 'relevance', 'goal_completion', 'agent_behavior_failure']" :key="m">

If no metrics are selected, all built-in metrics run by default.

Sources: arksim/ui/frontend/index.html

Custom Metrics

Developers can define custom evaluation metrics by creating a Python module.

Creating Custom Metrics

from arksim.evaluator import (
    QualitativeMetric,
    QualResult,
    QuantitativeMetric,
    QuantResult,
    ScoreInput,
)

class MyMetric(QuantitativeMetric):
    name = "my_custom_metric"
    schema = MySchema  # Pydantic model
    
    async def evaluate(self, input: ScoreInput) -> QuantResult:
        # Evaluation logic
        return QuantResult(...)

Quantitative vs Qualitative Metrics

Type	Base Class	Output
Quantitative	`QuantitativeMetric`	`QuantResult` with float value and reason
Qualitative	`QualitativeMetric`	`QualResult` with categorical label

Configuration

Add custom metrics to config.yaml:

custom_metrics_file_paths:
  - /path/to/custom_metrics.py

metrics_to_run:
  - my_custom_metric  # optional; runs all if omitted

Sources: examples/customer-service/custom_metrics.py

Example: Verification Compliance

class VerificationComplianceSchema(BaseModel):
    identity_verification: float  # 0.0-1.0
    action_gating: float  # 0.0-1.0
    reason: str

The evaluation prompt instructs the LLM to score:

Identity Verification: Did the agent verify customer identity before actions?
Action Gating: Did the agent gate sensitive actions behind verification?

Sources: examples/customer-service/custom_metrics.py:25-35

Example: E-commerce Conversion

class ConversionSchema(BaseModel):
    intent_strength: float
    conversion_outcome: float
    evidence: list[str]
    reason: str

Metrics track:

Intent Strength: How ready the shopper is to buy
Conversion Outcome: Whether the agent achieved a purchase decision

Sources: examples/e-commerce/custom_metrics.py

HTML Report

The evaluation system generates a detailed HTML report (evaluation/final_report.html) containing:

Report Sections

Section	Content
Summary Statistics	Overall pass/fail rates, average scores
Conversations Table	Per-conversation scores, status badges
Detailed Breakdown	Goal completion, final scores, failure reasons
Score Reasons	LLM-generated explanations for each metric

Report Features

Interactive Table: Sort and filter conversations
Score Details: Expandable sections showing metric-by-metric breakdown
Status Badges: Visual indicators (done/partial/failed)
Tooltip Explanations: Hover info for column headers

Score Display Logic

const POSITIVE_LABELS = ['compliant', 'professional', 'pass', 'good', 'complete', 'no failure', 'ok'];
const NEGATIVE_LABELS = ['flagged', 'unprofessional', 'fail', 'error', 'poor', 'missing', 'violated', 'partial'];

Qualitative scores are automatically classified as positive, negative, or neutral based on label matching.

Sources: arksim/utils/html_report/report_template.html:78-81

Configuration

Evaluation Configuration Options

Parameter	Type	Default	Description
`evalProvider`	string	-	LLM provider for evaluation
`evalModel`	string	-	Model to use for scoring
`metricsToRun`	list	all	Which built-in metrics to execute
`customMetricsFilePaths`	list	[]	Paths to custom metric modules
`evalNumWorkers`	int	auto	Parallel evaluation workers

Provider Selection

The evaluator supports multiple LLM providers:

OpenAI
Azure OpenAI
Custom endpoints (via chat_completions type)
A2A protocol agents

Sources: arksim/ui/frontend/index.html

Integration with Simulation

The evaluation system is typically invoked via the CLI:

# Combined simulation and evaluation
arksim simulate-evaluate config.yaml

# Separate steps
arksim simulate config_simulate.yaml
arksim evaluate config_evaluate.yaml

Pipeline Flow

graph LR
    A[config.yaml] --> B[Simulator]
    B --> C[simulation.json]
    C --> D[Evaluator]
    D --> E[evaluation/]
    E --> F[final_report.html]
    E --> F --> G[scores.json]

Results are written to ./results/simulation/simulation.json for simulation and ./results/evaluation/ for evaluation output.

Sources: examples/integrations/dify/README.md

Advanced Features

Tool Call Capture

ArkSim can automatically capture tool calls via tracing:

trace_receiver:
  enabled: true
  wait_timeout: 5

When enabled, tool calls are captured automatically without requiring explicit return in AgentResponse.

Trajectory Matching

The trajectory_matching.py module compares expected agent trajectories against actual behavior, useful for validating that agents follow prescribed action sequences.

Thresholds

The thresholds.py module defines score boundaries and classification logic for determining pass/fail conditions.

Summary

The Evaluation System provides comprehensive, configurable scoring of agent conversations:

Flexible Metric System: Built-in metrics cover common quality dimensions; custom metrics extend evaluation to domain-specific criteria
LLM-based Scoring: Uses configurable language models to generate nuanced, explainable scores
Visual Reporting: HTML reports make results easy to understand and act upon
Status Classification: Automatic categorization into done/partial_failure/failed for quick assessment
Integration-ready: Works with Python agents, chat completions endpoints, and A2A protocol agents

Sources: README.md

Agent Types and Integration

Related topics: LLM Provider Integration, Tool Call Capture, Configuration System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Implementation Pattern

Continue reading this section for the full explanation and source context.

Section Returning Tool Calls

Continue reading this section for the full explanation and source context.

Section Traced Agent Variant

Continue reading this section for the full explanation and source context.

Agent Types and Integration

ArkSim supports multiple agent connection types to accommodate different architectures and deployment scenarios. This page documents the available agent types, their configuration methods, and integration patterns.

Overview

ArkSim provides a flexible agent integration system that enables testing of various agent implementations through a unified simulation interface. The simulator communicates with agents via standardized protocols and captures responses for evaluation.

The agent integration system consists of:

A BaseAgent abstract class that defines the agent interface
Multiple client implementations for different connection types
A factory pattern for agent instantiation based on configuration
Support for tool call capture through both explicit responses and tracing

Sources: README.md

Supported Agent Types

ArkSim supports three primary agent types, each suited for different deployment scenarios.

Agent Type	Description	Use Case
`custom`	Python class extending BaseAgent	Custom agent logic, no external server required
`chat_completions`	HTTP endpoint with OpenAI-compatible API	Existing REST APIs, external agent services
`a2a`	Agent-to-Agent protocol endpoint	Multi-agent systems, A2A-compliant agents

Sources: arksim/cli.py:100-109

Custom Agent (Python Class)

The custom agent type is the default integration method. It requires implementing a Python class that extends BaseAgent and implements the execute method.

Implementation Pattern

from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"

Sources: README.md

Returning Tool Calls

To enable tool call evaluation, return an AgentResponse object instead of a plain string. This allows the evaluator to assess whether the agent correctly invoked required tools.

from arksim.simulation_engine.tool_types import AgentResponse, ToolCall

async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
    tool_calls = [
        ToolCall(
            name="search_knowledge_base",
            arguments={"query": user_query}
        )
    ]
    return AgentResponse(
        content="Found relevant information in the knowledge base.",
        tool_calls=tool_calls
    )

Traced Agent Variant

For agents using OpenTelemetry-based instrumentation, ArkSim supports automatic tool call capture through the TracingProcessor interface. This eliminates the need to explicitly return tool calls in AgentResponse.

Simulator sets routing context -> agent.execute() runs normally
-> SDK fires TracingProcessor.on_span_end -> arksim captures -> evaluator scores

Sources: examples/customer-service/README.md

Chat Completions Agent (HTTP API)

The chat_completions agent type connects to any HTTP endpoint implementing an OpenAI-compatible chat completions interface.

Configuration

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

Sources: README.md

Request Format

ArkSim sends requests following the OpenAI chat completions format:

{
  "model": "agent-model",
  "messages": [
    {"role": "user", "content": "user query"}
  ]
}

Response Handling

The endpoint should return responses in the standard chat completions format:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "agent response"
      }
    }
  ]
}

Sources: examples/integrations/dify/README.md

A2A Protocol Agent

The a2a agent type connects to agents implementing the Agent-to-Agent (A2A) protocol, enabling integration with multi-agent systems.

Configuration

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

Sources: README.md

A2A with Tool Calls

A2A agents can also surface tool calls for evaluation. The agent returns both the response content and tool call information in its A2A-formatted response.

CLI Initialization

ArkSim provides a CLI command to scaffold agent implementations:

arksim init --agent-type <type>

The --agent-type flag accepts the following values:

custom (default) - Generates a Python agent file
chat_completions - Configures HTTP endpoint connection
a2a - Configures A2A protocol connection

The --force flag overwrites existing files.

Sources: arksim/cli.py:94-118

Integration Architecture

The following diagram illustrates how ArkSim communicates with different agent types:

graph TD
    subgraph ArkSim["ArkSim"]
        Simulator["Simulator Engine"]
        Evaluator["Evaluator"]
    end
    
    subgraph AgentTypes["Agent Implementations"]
        CustomAgent["Custom Agent (Python)"]
        HTTPAgent["Chat Completions API"]
        A2AAgent["A2A Protocol Agent"]
    end
    
    Simulator -->|execute| CustomAgent
    Simulator -->|HTTP POST| HTTPAgent
    Simulator -->|A2A Protocol| A2AAgent
    
    CustomAgent -->|AgentResponse| Evaluator
    HTTPAgent -->|JSON Response| Evaluator
    A2AAgent -->|A2A Response| Evaluator

Agent Configuration in UI

The ArkSim web UI provides an "Agent Config" section for configuring agent connections. This section is dynamically loaded from the configuration YAML file.

<!-- Agent Config -->
<div class="t-surface rounded-xl border t-border p-5 mb-4">
    <h2 class="font-semibold t-heading mb-1">Agent Config</h2>
    <p class="text-xs t-caption mb-3">How arksim connects to your agent.</p>
</div>

Sources: arksim/ui/frontend/index.html

Framework Integrations

ArkSim includes pre-built integrations for popular agent frameworks through example projects:

Framework	Integration File	Protocol
LangChain/LangGraph	`custom_agent.py`	Python class
CrewAI	`custom_agent.py`	Python class
AutoGen	`custom_agent.py`	Python class
Claude Agent SDK	`custom_agent.py`	Python class
Pydantic AI	`custom_agent.py`	Python class
LlamaIndex	`custom_agent.py`	Python class
Smolagents	`custom_agent.py`	Python class
OpenAI Agents SDK	`custom_agent.py`	Python class
Dify	`custom_agent.py`	HTTP API
Rasa	`custom_agent.py`	HTTP API
OpenClaw	`config.yaml`	Chat Completions

Sources: examples/integrations/*/README.md

Running Simulations with Different Agent Types

Single Command

arksim simulate-evaluate config.yaml

Separate Steps

# Step 1: Simulate
arksim simulate config_simulate.yaml

# Step 2: Evaluate
arksim evaluate config_evaluate.yaml

Sources: examples/customer-service/README.md

Best Practices

Tool Call Capture: For accurate evaluation, ensure your agent returns tool calls either explicitly via AgentResponse or implicitly through tracing instrumentation.

Async Implementation: All agent implementations should use async/await patterns for proper integration with the simulation engine.

Error Handling: Agents should handle errors gracefully and return meaningful error messages that can be evaluated by the system.

Configuration Management: Use environment variables for sensitive configuration values like API keys and endpoints.

Sources: README.md

LLM Provider Integration

Related topics: Agent Types and Integration, Configuration System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section System Components

Continue reading this section for the full explanation and source context.

Section Provider Class Hierarchy

Continue reading this section for the full explanation and source context.

Section Core Methods

Continue reading this section for the full explanation and source context.

LLM Provider Integration

Overview

The LLM Provider Integration system in ArkSim provides a unified abstraction layer for connecting to various Large Language Model (LLM) providers. This modular architecture enables the simulator to interact with different AI backends while maintaining a consistent interface for chat completions, streaming responses, and token usage tracking.

The system follows a provider-based pattern where each supported LLM service (OpenAI, Anthropic, Google, Azure OpenAI) implements a common interface defined in the base class. This design allows users to swap between providers without changing the core simulation logic.

Architecture

System Components

graph TD
    A[Simulation Engine] --> B[LLM Manager<br/>arksim/llms/chat/llm.py]
    B --> C[Base LLM<br/>arksim/llms/chat/base/base_llm.py]
    C --> D[OpenAI Provider<br/>providers/openai.py]
    C --> E[Anthropic Provider<br/>providers/anthropic.py]
    C --> F[Google Provider<br/>providers/google.py]
    C --> G[Azure OpenAI Provider<br/>providers/azure_openai.py]
    
    H[Configuration YAML] --> B
    I[Environment Variables] --> D
    I --> E
    I --> F
    I --> G

Provider Class Hierarchy

graph TD
    A[BaseLLM<br/>base/base_llm.py] --> B[OpenAIProvider<br/>providers/openai.py]
    A --> C[AnthropicProvider<br/>providers/anthropic.py]
    A --> D[GoogleProvider<br/>providers/google.py]
    A --> E[AzureOpenAIProvider<br/>providers/azure_openai.py]
    
    F[Provider Enum] --> A
    G[ChatMessage Model] --> A
    H[ChatCompletionResponse] --> A

Supported Providers

ArkSim supports the following LLM providers through dedicated provider implementations:

Provider	Provider Class	API Style	Streaming Support
OpenAI	`OpenAIProvider`	OpenAI API	Yes
Anthropic	`AnthropicProvider`	Anthropic API	Yes
Google	`GoogleProvider`	Google AI API	Yes
Azure OpenAI	`AzureOpenAIProvider`	Azure API	Yes

Each provider class inherits from BaseLLM and implements provider-specific API call logic while adhering to the common interface contract.

Base LLM Interface

The BaseLLM class defines the contract that all provider implementations must follow. This ensures consistent behavior across different LLM backends.

Core Methods

Method	Purpose	Parameters
`chat()`	Send a chat completion request	`messages`, `model`, `temperature`, `max_tokens`, `**kwargs`
`chat_stream()`	Stream chat completion responses	`messages`, `model`, `temperature`, `max_tokens`, `**kwargs`
`count_tokens()`	Calculate token usage for messages	`messages`, `model`

Data Models

The base module defines essential data structures used across all providers:

Model	Purpose
`ChatMessage`	Represents a single message with role and content
`ChatCompletionResponse`	Wraps the API response from providers
`Provider`	Enum identifying supported LLM providers
`ModelInfo`	Metadata about available models per provider

Provider Implementations

OpenAI Provider

The OpenAI provider connects to OpenAI's API endpoints for chat completions. It supports both standard and streaming responses.

Configuration Requirements:

Parameter	Source	Description
`OPENAI_API_KEY`	Environment variable	API key for authentication
`model`	Config/Parameter	Model identifier (e.g., `gpt-4`, `gpt-4-turbo`)

API Endpoint:

POST https://api.openai.com/v1/chat/completions

Sources: arksim/llms/chat/providers/openai.py

Anthropic Provider

The Anthropic provider integrates with Anthropic's Claude models through their API. It handles the distinct message format and API conventions used by Anthropic.

Configuration Requirements:

Parameter	Source	Description
`ANTHROPIC_API_KEY`	Environment variable	API key for authentication
`model`	Config/Parameter	Model identifier (e.g., `claude-3-opus-20240229`)

API Endpoint:

POST https://api.anthropic.com/v1/messages

Sources: arksim/llms/chat/providers/anthropic.py

Google Provider

The Google provider connects to Google's Gemini models via the Google AI API.

Configuration Requirements:

Parameter	Source	Description
`GOOGLE_API_KEY`	Environment variable	API key for authentication
`model`	Config/Parameter	Model identifier (e.g., `gemini-pro`)

API Endpoint:

POST https://generativelanguage.googleapis.com/v1/models/{model}:generateContent

Sources: arksim/llms/chat/providers/google.py

Azure OpenAI Provider

The Azure OpenAI provider enables integration with Azure-hosted OpenAI models, supporting enterprise deployments with Azure-specific authentication and endpoint configuration.

Configuration Requirements:

Parameter	Source	Description
`AZURE_OPENAI_API_KEY`	Environment variable	API key for Azure authentication
`AZURE_OPENAI_ENDPOINT`	Environment variable	Azure endpoint URL
`AZURE_OPENAI_DEPLOYMENT`	Config	Deployment name in Azure
`AZURE_OPENAI_API_VERSION`	Config	Azure API version

Sources: arksim/llms/chat/providers/azure_openai.py

Configuration

YAML Configuration Structure

LLM providers are configured through the config.yaml file used by the simulator:

llm:
  provider: openai  # or anthropic, google, azure_openai
  model: gpt-4
  temperature: 0.7
  max_tokens: 2048
  
agent_config:
  # Agent-specific LLM settings
  provider: anthropic
  model: claude-3-opus-20240229

Environment Variables

Variable	Providers	Purpose
`OPENAI_API_KEY`	OpenAI, AutoGen, LangChain	OpenAI API authentication
`ANTHROPIC_API_KEY`	Claude Agent SDK	Anthropic API authentication
`GOOGLE_API_KEY`	Google	Google AI API authentication
`AZURE_OPENAI_API_KEY`	Azure OpenAI	Azure API authentication
`AZURE_OPENAI_ENDPOINT`	Azure OpenAI	Azure resource endpoint

Integration with Custom Agents

The LLM provider system integrates with the custom agent connector pattern used in ArkSim simulations. Custom agents can be configured to use any supported LLM provider.

graph LR
    A[Scenario JSON] --> B[Simulation Engine]
    B --> C[Custom Agent<br/>custom_agent.py]
    C --> D[LLM Provider]
    D --> E[External LLM API]

Integration Examples

ArkSim provides integration examples for various agent frameworks:

Framework	Example Path	LLM Provider Used
LangChain/LangGraph	`examples/integrations/langchain/`	OpenAI
Claude Agent SDK	`examples/integrations/claude-agent-sdk/`	Anthropic
LlamaIndex	`examples/integrations/llamaindex/`	OpenAI
CrewAI	`examples/integrations/crewai/`	OpenAI
AutoGen	`examples/integrations/autogen/`	OpenAI
Pydantic AI	`examples/integrations/pydantic-ai/`	OpenAI
Smolagents	`examples/integrations/smolagents/`	OpenAI
Dify	`examples/integrations/dify/`	Custom HTTP

Each integration demonstrates how to connect the framework's agent to ArkSim's simulation engine while delegating LLM calls to the appropriate provider.

Usage Flow

sequenceDiagram
    participant User
    participant Config as config.yaml
    participant LLM as LLM Manager
    participant Provider as Provider Class
    participant API as External LLM API
    
    User->>Config: Load configuration
    User->>LLM: Initialize with provider type
    LLM->>Provider: Create provider instance
    User->>LLM: chat(messages, model)
    LLM->>Provider: _chat_completion()
    Provider->>API: HTTP POST request
    API-->>Provider: Completion response
    Provider-->>LLM: Normalized response
    LLM-->>User: ChatCompletionResponse

Adding New Providers

To add support for a new LLM provider:

Create a new provider class inheriting from BaseLLM
Implement the required methods: chat(), chat_stream(), count_tokens()
Add the provider to the Provider enum in base_llm.py
Update the provider factory logic in llm.py to instantiate the new provider
Add integration tests and documentation

Sources: arksim/llms/chat/base/base_llm.py Sources: arksim/llms/chat/llm.py

Error Handling

Provider implementations handle common error scenarios:

Error Type	Handling Strategy
Authentication failures	Raise `AuthenticationError` with helpful message
Rate limiting	Implement automatic retry with backoff
Invalid request parameters	Raise `ValidationError` with parameter details
Network timeouts	Retry with exponential backoff
Model not found	Raise `ModelNotFoundError` listing available models

Best Practices

API Key Security: Store API keys in environment variables, never in configuration files committed to version control. ArkSim automatically loads keys from environment variables.

Token Tracking: Use the count_tokens() method to monitor token usage and estimate costs before running large-scale simulations.

Streaming for Large Responses: Enable streaming (chat_stream()) for scenarios expecting long agent responses to improve perceived responsiveness.

Provider Selection: Choose Azure OpenAI for enterprise deployments requiring compliance certifications and dedicated infrastructure.

Model Selection: Refer to the integration examples for recommended model configurations per provider and use case.

Sources: arksim/llms/chat/providers/openai.py

Tool Call Capture

Related topics: Evaluation System, Agent Types and Integration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section ToolCall

Continue reading this section for the full explanation and source context.

Section ToolCallSource

Continue reading this section for the full explanation and source context.

Section AgentResponse

Continue reading this section for the full explanation and source context.

Tool Call Capture

Tool Call Capture is a core mechanism in ArkSim that observes and records every tool or function invocation made by an agent during a simulation. This captured data feeds directly into the evaluator, enabling metrics like tool call accuracy, error detection, and trajectory analysis.

Overview

Tool Call Capture serves as the bridge between agent execution and evaluation. Without accurate tool call capture, the evaluator cannot determine whether an agent:

Invoked the correct tools
Passed valid arguments
Handled errors appropriately
Followed the expected execution trajectory

ArkSim supports three distinct capture mechanisms:

Explicit capture via AgentResponse.tool_calls
Automatic capture via the ArksimTracingProcessor
A2A protocol capture via task artifacts

All three methods produce the same underlying ToolCall data model, ensuring consistent evaluation regardless of how the agent is implemented.

Data Models

ToolCall

The ToolCall class represents a single tool invocation observed during a turn:

class ToolCall(BaseModel):
    model_config = ConfigDict(extra="ignore")
    
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None

Field	Type	Description
`id`	`str`	Unique identifier for this tool call
`name`	`str`	Name of the tool/function invoked
`arguments`	`dict[str, Any]`	Arguments passed to the tool
`result`	`str \	None`	Response returned by the tool
`error`	`str \	None`	Error message if the call failed
`source`	`ToolCallSource \	None`	Origin of the capture data

The extra="ignore" configuration ensures forward compatibility with future versions that add new fields.

ToolCallSource

The ToolCallSource enum indicates how tool call data was captured:

class ToolCallSource(str, Enum):
    AGENT_RESPONSE = "agent_response"
    TRACING_PROCESSOR = "tracing_processor"
    A2A_PROTOCOL = "a2a_protocol"

AgentResponse

For explicit capture, agents return structured responses:

class AgentResponse(BaseModel):
    model_config = ConfigDict(extra="ignore")
    
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)

Capture Methods

Explicit Capture (AgentResponse)

The default approach requires agents to explicitly return tool calls alongside their text response. This is the standard pattern for custom agents implementing BaseAgent.

graph TD
    A[Simulator invokes agent] --> B[Agent executes user_query]
    B --> C[Agent returns AgentResponse]
    C --> D[Simulator extracts tool_calls]
    D --> E[Evaluator scores trajectory]

Implementation pattern:

from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse, ToolCall

class MyAgent(BaseAgent):
    async def execute(self, user_query: str, **kwargs) -> str | AgentResponse:
        # Agent logic here...
        tool_calls = [
            ToolCall(
                id="call_1",
                name="get_order_status",
                arguments={"order_id": "ORD-1001"},
                source="agent_response"
            )
        ]
        return AgentResponse(
            content="The order status is shipped.",
            tool_calls=tool_calls
        )

Sources: examples/customer-service/custom_agent.py

Tracing Processor (Automatic Capture)

The ArksimTracingProcessor uses the OpenAI Agents SDK's tracing interface to capture tool calls automatically, without requiring explicit return data.

graph TD
    A[Simulator sets routing context] --> B[Agent executes normally]
    B --> C[SDK fires on_span_end]
    C --> D[ArksimTracingProcessor captures]
    D --> E[Evaluator scores trajectory]

Registration pattern:

from agents import Agent as SDKAgent
from arksim.tracing.openai import ArksimTracingProcessor

# Register once at module load
_tracing_processor = ArksimTracingProcessor()

The traced agent returns a plain str rather than AgentResponse:

async def execute(self, user_query: str, **kwargs) -> str:
    result = await Runner.run(self._sdk_agent, user_query)
    return result.final_output

Sources: examples/customer-service/traced_agent.py

A2A Protocol Capture

For agents implementing the Agent-to-Agent (A2A) protocol, tool calls are embedded in task artifacts using the A2AToolCaptureExtension.

graph TD
    A[Simulator sends task to A2A Agent] --> B[Agent processes request]
    B --> C[Agent returns Task with artifacts]
    C --> D[Simulator extracts from metadata]
    D --> E[Evaluator scores]

Tool call extraction from artifacts:

def _extract_tool_calls_from_artifact(self, artifact: Artifact) -> list[ToolCall]:
    metadata = artifact.metadata
    raw_calls = metadata.get("tool_calls", [])
    tool_calls = []
    for raw in raw_calls:
        arguments = raw.get("arguments", {})
        if not isinstance(arguments, dict):
            continue
        tool_calls.append(
            ToolCall(
                id=raw.get("id", ""),
                name=name,
                arguments=arguments,
                result=A2AAgent._coerce_to_string(raw.get("result")),
                error=A2AAgent._coerce_to_string(raw.get("error")),
                source=ToolCallSource.A2A_PROTOCOL,
            )
        )
    return tool_calls

Sources: arksim/simulation_engine/agent/clients/a2a.py

A2A agent card declaration:

from arksim.simulation_engine.tool_types import A2AToolCaptureExtension

_capabilities = AgentCapabilities(
    streaming=False,
    extensions=[A2AToolCaptureExtension],
)

Workflow Diagram

The following diagram shows the complete simulation pipeline with tool call capture:

flowchart LR
    subgraph Simulation
        A[User Query] --> B[Simulator]
        B --> C{Agent Type}
        C -->|Custom| D[explicit tool_calls]
        C -->|Traced| E[TracingProcessor]
        C -->|A2A| F[Artifact metadata]
        D --> G[Captured Tool Calls]
        E --> G
        F --> G
    end
    
    subgraph Evaluation
        G --> H[Evaluator]
        H --> I[Tool Call Metrics]
        H --> J[Error Detection]
        H --> K[Trajectory Analysis]
    end
    
    G --> L[Results/Report]
    I --> L
    J --> L
    K --> L

Comparison of Capture Methods

Aspect	Explicit (AgentResponse)	Traced (TracingProcessor)	A2A Protocol
Return type	`AgentResponse`	`str`	Via protocol
Implementation complexity	Medium	Low	High
Agent code changes	Required	Minimal	Protocol required
Best for	Custom Python agents	SDK-based agents	A2A-native agents
Tool call source field	`agent_response`	`tracing_processor`	`a2a_protocol`

Sources: examples/customer-service/README.md

Configuration

Trace Receiver Settings

For traced agents, enable the trace receiver in the simulation config:

simulation:
  max_turns: 10
  
trace_receiver:
  enabled: true
  wait_timeout: 5  # seconds to wait for traces

When Tracing is Disabled

When trace_receiver.enabled is false or omitted, ArkSim falls back to explicit AgentResponse capture:

When trace_receiver.enabled is false or omitted, arksim only captures tool calls from AgentResponse (the standard path).

Integration with Evaluation

Captured tool calls flow into the evaluator's scoring pipeline:

Trajectory matching - Compare actual tool sequence against expected
Argument validation - Verify tool arguments match scenario requirements
Error detection - Identify tool call failures and their handling
Coverage analysis - Determine if all required tools were invoked

The evaluator uses the source field to differentiate between capture methods when analyzing behavioral patterns.

Forward Compatibility

Both ToolCall and AgentResponse declare extra="ignore" in their Pydantic configuration:

Declares extra="ignore" explicitly so snapshots from future arksim versions that add new fields can still be loaded by older arksim without raising a ValidationError.

This ensures that simulation results captured with newer versions remain loadable by older versions of the evaluator.

Sources: examples/customer-service/custom_agent.py

Scenario Management

Related topics: Simulation Engine, Configuration System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Scenario Structure

Continue reading this section for the full explanation and source context.

Section User Profile Schema

Continue reading this section for the full explanation and source context.

Section Goal Structure

Continue reading this section for the full explanation and source context.

Related topics: Simulation Engine, Configuration System

Scenario Management

Overview

Scenario Management is the core system in ArkSim for defining, loading, validating, and executing test scenarios that simulate user interactions with an AI agent. A scenario represents a structured test case that defines a user's goal, knowledge base, behavioral characteristics, and expected outcomes.

Scenarios serve as the foundation for the simulation and evaluation pipeline, enabling reproducible testing of agent behavior across diverse conversation patterns and user profiles.

Scenario Data Model

Core Scenario Structure

Each scenario in ArkSim is a JSON object containing the following primary fields:

Field	Type	Required	Description
`scenario_id`	string	Yes	Unique identifier for the scenario
`name`	string	Yes	Human-readable scenario name
`description`	string	No	Detailed description of the scenario's purpose
`user_profile`	object	Yes	User persona characteristics
`goal`	object	Yes	The user's primary objective
`knowledge`	array	Yes	Contextual knowledge the user has access to
`expected_behavior`	object	No	Expected agent responses or behaviors
`metrics`	array	No	Custom evaluation criteria

User Profile Schema

{
  "name": "string",
  "age": "number",
  "personality": "string",
  "background": "string",
  "communication_style": "string"
}

Sources: arksim/scenario/entities.py

Goal Structure

{
  "primary": "string (main user objective)",
  "secondary": ["array of secondary objectives"],
  "constraints": ["array of constraints or boundaries"],
  "success_criteria": "string"
}

Scenario Management Workflow

graph TD
    A[Create/Load Scenarios] --> B[Validate Scenario Schema]
    B --> C{Valid?}
    C -->|Yes| D[Save to scenarios.json]
    C -->|No| E[Show Validation Errors]
    E --> A
    D --> F[Configure Simulation Parameters]
    F --> G[Run Simulation]
    G --> H[Generate Results]
    H --> I[Evaluation & Reporting]

Scenario Loading and Validation

The UI provides interactive scenario management through the "Build" page:

Load Existing - Users can load pre-existing scenario files via file path input
Auto-generate (PRO) - Automatic scenario generation from agent knowledge base
Manual Creation - Build scenarios through the UI interface

// Scenario file validation in UI
@input="validateScenarioFile()"
@blur="validateScenarioFile()"
@keydown.enter="loadScenarioFile()"

Sources: arksim/ui/frontend/index.html

Validation Rules

Scenario files must be valid JSON
Required fields must be present
scenario_id must be unique within the file
user_profile must contain at minimum a name field

Scenario File Format

ArkSim uses JSON format for scenario definitions. See the example structure:

{
  "scenarios": [
    {
      "scenario_id": "ecommerce-return-item-001",
      "name": "Return Defective Product",
      "description": "Customer wants to return a damaged item received last week",
      "user_profile": {
        "name": "John Smith",
        "age": 35,
        "personality": "patient but firm",
        "background": "Regular online shopper",
        "communication_style": "polite and direct"
      },
      "goal": {
        "primary": "Get a full refund for a damaged product",
        "secondary": ["Understand return process", "Know timeline for refund"],
        "constraints": ["Only willing to wait up to 14 days for refund"],
        "success_criteria": "Full refund issued or replacement offered"
      },
      "knowledge": [
        "Ordered product SKU-12345 on March 1, 2024",
        "Item arrived damaged with visible scratches",
        "Has original packaging and receipt",
        "Order number: ORD-987654"
      ],
      "expected_behavior": {
        "should_mention_order_number": true,
        "should_request_photo_evidence": false
      }
    }
  ]
}

Sources: examples/e-commerce/scenarios.json Sources: examples/bank-insurance/scenarios.json Sources: examples/customer-service/scenarios.json

Simulation Configuration

Scenarios are executed through the simulation engine with configurable parameters:

Parameter	Default	Description
`num_conversations_per_scenario`	5	Number of conversations to simulate per scenario
`max_turns`	5	Maximum turns per conversation
`num_workers`	50	Parallel workers (or 'auto')
`model`	gpt-4o	LLM model for simulation
`provider`	openai	LLM provider

model: str = Field(default=DEFAULT_MODEL, description="LLM model for simulation")
num_conversations_per_scenario: int = Field(
    default=5, 
    description="Number of conversations per scenario to simulate"
)
max_turns: int = Field(default=5, description="Maximum turns per conversation")

Sources: arksim/simulation_engine/entities.py:18-25

Configuration via config.yaml

simulation:
  scenarios_file: ./scenarios.json
  num_conversations_per_scenario: 5
  max_turns: 5
  num_workers: auto
  model: gpt-4o

Built-in Scenario Templates

ArkSim provides template scenarios for common use cases:

{
  "scenarios": [
    {
      "scenario_id": "template-basic-query",
      "name": "Basic Information Query",
      "description": "User asks a simple informational question",
      "user_profile": {...},
      "goal": {...},
      "knowledge": [...]
    }
  ]
}

Sources: arksim/templates/scenarios.json

Integration with CI/CD

Scenarios can be integrated into automated testing workflows:

# GitHub Actions workflow
steps:
  - name: Run Scenario Tests
    run: arksim simulate-evaluate config.yaml

Sources: examples/ci/README.md

Required CI Setup

Update TODO sections in arksim.yml (startup command and health-check URL)
Create tests/arksim/config.yaml pointing to your server endpoint
Create tests/arksim/scenarios.json with your test cases
Add custom metrics to tests/arksim/custom_metrics/ if needed
Configure OPENAI_API_KEY in GitHub secrets

Best Practices

Scenario Design

Distinct Goals: Each scenario should test a single, well-defined user goal
Realistic Profiles: User profiles should reflect actual customer demographics
Sufficient Knowledge: Include enough context for the simulated user to maintain coherent conversation

File Organization

project/
├── config.yaml
├── scenarios.json
├── custom_metrics/
│   └── my_metric.py
└── results/
    └── simulation/

Version Control

Commit scenarios.json with your agent code
Use descriptive scenario_id values with prefixes (e.g., ecommerce-, support-)
Document success criteria clearly in the goal.success_criteria field

Command Line Usage

Initialize with default scenarios

arksim init

Run simulation with scenarios

arksim simulate-evaluate config.yaml

Results are written to ./results/simulation/simulation.json.

Sources: README.md Sources: examples/integrations/dify/README.md

Configuration System

Related topics: Agent Types and Integration, LLM Provider Integration, Evaluation System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Main Configuration (config.yaml)

Continue reading this section for the full explanation and source context.

Section Separate Phase Configuration

Continue reading this section for the full explanation and source context.

Section Supported Agent Types

Continue reading this section for the full explanation and source context.

Configuration System

ArkSim provides a flexible, multi-layered configuration system that supports both YAML-based declarative configuration and programmatic Python class configuration. The system enables users to define simulation scenarios, evaluation metrics, agent connections, and runtime behavior through structured configuration files or direct Python implementations.

Architecture Overview

The configuration system is organized into several key modules that handle different aspects of simulation and evaluation:

graph TD
    A[User Configuration] --> B[YAML Files]
    A --> C[Python Classes]
    B --> D[config.yaml]
    B --> E[config_simulate.yaml]
    B --> F[config_evaluate.yaml]
    C --> G[BaseAgent Subclass]
    C --> H[Custom Metrics]
    D --> I[Config Loader]
    E --> I
    F --> I
    G --> J[Simulation Engine]
    H --> K[Evaluator]
    I --> J
    J --> K
    J --> L[Results/Reports]
    K --> L

The system separates configuration into three distinct phases: simulation, evaluation, and combined execution. This separation allows users to run simulations independently from evaluation, which is useful for debugging and iterative development workflows.

Configuration Files

ArkSim uses YAML configuration files to define simulation parameters, evaluation settings, and agent connections. The repository includes three primary configuration templates at the root level.

Main Configuration (config.yaml)

The main configuration file combines both simulation and evaluation settings into a single file. This is the recommended approach for simple use cases where both phases run together using the simulate-evaluate command. The file structure typically includes agent configuration, scenario definitions, metric selections, and evaluation parameters.

Separate Phase Configuration

For more complex workflows, ArkSim supports splitting configuration into separate files:

File	Purpose	CLI Command
`config_simulate.yaml`	Simulation-only parameters	`arksim simulate config_simulate.yaml`
`config_evaluate.yaml`	Evaluation-only parameters	`arksim evaluate config_evaluate.yaml`
`config.yaml`	Combined configuration	`arksim simulate-evaluate config.yaml`

Sources: examples/customer-service/README.md

This separation enables scenarios where simulation results are cached and evaluated multiple times with different metric configurations, or where the simulation phase runs in a different environment than evaluation.

Agent Configuration

The configuration system supports multiple agent connection types, defined through the agent_type field in the agent configuration section.

Supported Agent Types

Agent Type	Description	Configuration Style
`custom`	Custom Python agent class inheriting from BaseAgent	Python class file
`chat_completions`	HTTP endpoint implementing Chat Completions API	YAML endpoint config
`a2a`	Agent-to-Agent protocol endpoint	YAML endpoint config

Sources: arksim/cli.py

Custom Agent (Python Class)

The default agent type uses a Python class that extends BaseAgent. This approach provides maximum flexibility for integrating any agent framework. The agent class must implement two required methods:

from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"

Sources: README.md

The execute method can return either a plain string response or an AgentResponse object that includes both text content and tool calls for evaluation.

Chat Completions Agent

For agents exposing an HTTP endpoint compatible with the OpenAI Chat Completions format, configuration is declarative:

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

Sources: README.md

A2A Protocol Agent

Agents implementing the Agent-to-Agent protocol are configured similarly:

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

A2A agents can also surface tool calls for evaluation through the protocol, enabling comprehensive testing of tool-using agents.

Scenario Configuration

Scenarios define the test cases that the simulator executes against the agent. The scenarios.json file contains an array of scenario objects, each representing a simulated conversation or interaction.

Scenario Structure

Each scenario typically includes:

Scenario ID: Unique identifier for the scenario
User query: The initial prompt or question presented to the agent
Expected behavior: Criteria for successful completion
Knowledge/context: Information the agent should know or have access to during the scenario

Scenario Files Location

Scenarios are referenced from the main configuration file and can be shared across different configuration files:

Example Directory	Scenario Purpose
`examples/bank-insurance/`	Financial services agent testing
`examples/customer-service/`	Customer support scenarios
`examples/integrations/*/`	Framework-specific testing
`examples/ci/`	CI/CD integration testing

Sources: examples/ci/README.md

Evaluation Metrics Configuration

The evaluator component scores agent responses against defined metrics. Configuration specifies which metrics to run and how they are calculated.

Built-in Metrics

ArkSim includes standard evaluation metrics covering common agent quality dimensions. These metrics are automatically available without additional configuration.

Custom Metrics

Users can define custom evaluation metrics by creating a Python module that implements the metric interfaces. The custom metrics file must define metrics following the provided schema:

from pydantic import BaseModel
from arksim.evaluator import (
    QualitativeMetric,
    QualResult,
    QuantitativeMetric,
    QuantResult,
    ScoreInput,
    format_chat_history,
)

Sources: examples/customer-service/custom_metrics.py

Custom metrics are referenced in the configuration file using the custom_metrics_file_paths parameter, and optionally listed in metrics_to_run to include them in the evaluation:

custom_metrics_file_paths:
  - tests/arksim/custom_metrics/my_metric.py
metrics_to_run:
  - custom_metric_name

Sources: examples/ci/README.md

Custom Metric Schema

Custom quantitative metrics should return a Pydantic model defining the score components:

class VerificationComplianceSchema(BaseModel):
    identity_verification: float  # 0.0-1.0
    action_gating: float  # 0.0-1.0
    reason: str

Sources: examples/customer-service/custom_metrics.py

The system prompt for qualitative metrics defines the evaluation criteria that the LLM judge applies when scoring agent responses.

CLI Configuration Interface

The ArkSim CLI provides commands for initializing configurations, running simulations, and managing examples.

Initialization Command

The init command scaffolds starter files for agent testing:

arksim init --agent-type custom

Flag	Options	Default	Description
`--agent-type`	`custom`, `chat_completions`, `a2a`	`custom`	Agent connection type
`--force`	boolean	`false`	Overwrite existing files

Sources: arksim/cli.py

The --agent-type flag determines which template is generated:

custom generates a Python agent file (no server needed)
chat_completions generates YAML configuration for HTTP endpoints
a2a generates YAML configuration for Agent-to-Agent protocol

Simulation Commands

Command	Description
`arksim simulate-evaluate config.yaml`	Run simulation and evaluation in sequence
`arksim simulate config_simulate.yaml`	Run simulation only
`arksim evaluate config_evaluate.yaml`	Evaluate previously saved simulation results

Sources: examples/customer-service/README.md

Examples Command

The CLI also provides access to example projects:

arksim examples                    # Download all examples
arksim examples bank-insurance     # Download specific example
arksim examples --list            # List available examples

Integration Configuration Patterns

Each integration example follows a consistent pattern with three standard files.

File Structure

File	Purpose
`custom_agent.py`	Agent implementation connecting to the target framework
`config.yaml`	Simulation and evaluation settings
`scenarios.json`	Test scenarios for the example domain

Supported Framework Integrations

ArkSim provides example configurations for the following agent frameworks:

Framework	Installation	Example Directory
LangGraph	`pip install langgraph langchain-openai`	`examples/integrations/langgraph/`
LangChain	`pip install langgraph langchain-openai`	`examples/integrations/langchain/`
Claude Agent SDK	`pip install claude-agent-sdk`	`examples/integrations/claude-agent-sdk/`
AutoGen	`pip install autogen-agentchat autogen-ext[openai]`	`examples/integrations/autogen/`
CrewAI	`pip install crewai`	`examples/integrations/crewai/`
Pydantic AI	`pip install pydantic-ai`	`examples/integrations/pydantic-ai/`
Smolagents	`pip install smolagents`	`examples/integrations/smolagents/`
Dify	HTTP integration	`examples/integrations/dify/`
OpenClaw	Gateway token auth	`examples/openclaw/`

Sources: examples/integrations/*/README.md

Trace Receiver Configuration

For frameworks that support tracing, ArkSim includes a trace receiver component that captures tool calls automatically without requiring explicit AgentResponse returns.

Configuration Options

trace_receiver:
  enabled: true
  wait_timeout: 5

Parameter	Type	Default	Description
`enabled`	boolean	`false`	Enable trace-based tool call capture
`wait_timeout`	integer	`5`	Seconds to wait for trace data

Sources: examples/customer-service/README.md

When enabled is false or omitted, ArkSim only captures tool calls from AgentResponse objects returned by the agent's execute method.

Workflow Summary

The following diagram illustrates the configuration-driven workflow from specification to results:

graph LR
    A[config.yaml] --> B[Scenario Loading]
    C[scenarios.json] --> B
    B --> D[Simulation Engine]
    D --> E{Agent Type?}
    E -->|custom| F[Python Agent]
    E -->|chat_completions| G[HTTP Endpoint]
    E -->|a2a| H[A2A Agent]
    F --> I[Execute Scenarios]
    G --> I
    H --> I
    I --> J[Simulation Results]
    J --> K[Evaluator]
    J --> L[Conversation Viewer]
    K --> M[Evaluation Report]
    M --> N[scores.json]
    M --> O[failures.json]

Best Practices

Environment Variables: API keys should be set as environment variables rather than hardcoded in configuration files:

export OPENAI_API_KEY="<your-key>"
export ANTHROPIC_API_KEY="<your-key>"

Separation of Concerns: Use separate config_simulate.yaml and config_evaluate.yaml files when iterating on scenarios or metrics independently.

Custom Metrics Organization: Place custom metrics in dedicated directories (e.g., tests/arksim/custom_metrics/) and reference them from the main configuration.

Scenario Versioning: Keep scenarios in version-controlled JSON files that can be reviewed and updated as agent requirements evolve.

Sources: CONTRIBUTING.md

Sources: examples/customer-service/README.md

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium v0.1.0

First-time setup may fail or require extra isolation and rollback planning.

medium v0.0.6

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

medium v0.3.2

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

medium v0.3.4

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

Doramagic Pitfall Log

Doramagic extracted 15 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: v0.1.0

Severity: medium
Finding: Installation risk is backed by a source signal: v0.1.0. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.1.0

2. Configuration risk: v0.0.6

Severity: medium
Finding: Configuration risk is backed by a source signal: v0.0.6. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.0.6

3. Configuration risk: v0.3.2

Severity: medium
Finding: Configuration risk is backed by a source signal: v0.3.2. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.2

4. Configuration risk: v0.3.4

Severity: medium
Finding: Configuration risk is backed by a source signal: v0.3.4. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.4

5. Configuration risk: v0.3.5

Severity: medium
Finding: Configuration risk is backed by a source signal: v0.3.5. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.5

6. Capability assumption: v0.3.1

Severity: medium
Finding: Capability assumption is backed by a source signal: v0.3.1. Treat it as a review item until the current version is checked.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.1

7. Capability assumption: README/documentation is current enough for a first validation pass.

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: capability.assumptions | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | README/documentation is current enough for a first validation pass.

8. Maintenance risk: v0.3.3

Severity: medium
Finding: Maintenance risk is backed by a source signal: v0.3.3. Treat it as a review item until the current version is checked.
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.3

9. Maintenance risk: Maintainer activity is unknown

Severity: medium
Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: evidence.maintainer_signals | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | last_activity_observed missing

10. Security or permission risk: no_demo

Severity: medium
Finding: no_demo
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: downstream_validation.risk_items | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | no_demo; severity=medium

11. Security or permission risk: no_demo

Severity: medium
Finding: no_demo
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: risks.scoring_risks | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | no_demo; severity=medium

12. Security or permission risk: v0.2.0

Severity: medium
Finding: Security or permission risk is backed by a source signal: v0.2.0. Treat it as a review item until the current version is checked.
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.2.0

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using arksim with real data or production workflows.

v0.3.7 - github / github_release
v0.3.6 - github / github_release
v0.3.5 - github / github_release
v0.3.4 - github / github_release
v0.3.3 - github / github_release
v0.3.2 - github / github_release
v0.3.1 - github / github_release
v0.3.0 - github / github_release
v0.2.0 - github / github_release
v0.1.0 - github / github_release
v0.0.6 - github / github_release
Zhou Yu (@Zhou_Yu_AI) / Highlights / X - x / searxng_indexed

Source: Project Pack community evidence and pitfall evidence