Doramagic Project Pack · Human Manual

arksim

ArkSim addresses the challenge of systematically evaluating conversational AI agents by providing:

Introduction to ArkSim

Related topics: Quickstart Guide, System Architecture Overview

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Python Class Integration (BaseAgent)

Continue reading this section for the full explanation and source context.

Section AgentResponse Data Model

Continue reading this section for the full explanation and source context.

Related topics: Quickstart Guide, System Architecture Overview

Introduction to ArkSim

ArkSim is a multi-turn agent evaluation and simulation platform designed to test, measure, and improve AI agent quality across conversational scenarios. It enables developers to define test scenarios, simulate user interactions with their agents, and generate comprehensive evaluation reports with quantitative and qualitative metrics.

Overview

ArkSim addresses the challenge of systematically evaluating conversational AI agents by providing:

  • Scenario-based simulation: Define test cases with user profiles, knowledge bases, and expected behaviors
  • Automated evaluation: Score agent responses across multiple dimensions including goal completion, helpfulness, and coherence
  • Framework integration: Connect with various agent frameworks including LangGraph, LangChain, AutoGen, Claude Agent SDK, Rasa, Dify, and OpenClaw
  • Custom metrics: Extend evaluation capabilities with user-defined quantitative and qualitative metrics
  • Visual reporting: Generate HTML reports with conversation transcripts, failure analysis, and per-metric scores

Sources: README.md

Architecture Overview

ArkSim follows a pipeline architecture consisting of three primary stages:

graph TD
    A[Configuration<br/>config.yaml] --> B[Simulation Engine]
    C[Scenarios<br/>scenarios.json] --> B
    D[Agent<br/>custom_agent.py] --> B
    B --> E[Conversation Data]
    E --> F[Evaluator]
    F --> G[HTML Report<br/>final_report.html]
    H[Custom Metrics<br/>custom_metrics.py] --> F

Core Components

ComponentDescription
Simulation EngineOrchestrates multi-turn conversations between simulated users and the agent
EvaluatorScores agent performance using built-in and custom metrics
CLI InterfaceCommand-line interface for running simulations (arksim simulate-evaluate)
Report GeneratorProduces HTML reports with visualizations and conversation transcripts

Sources: examples/customer-service/README.md:1-25

Agent Integration Methods

ArkSim supports multiple agent integration patterns to accommodate different agent architectures.

Python Class Integration (BaseAgent)

The default integration method uses a Python class inheriting from BaseAgent. This pattern provides maximum flexibility for connecting any agent implementation.

from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"

The agent can return either:

  • A plain string response
  • An AgentResponse object containing both content and tool calls

Sources: README.md:40-55

AgentResponse Data Model

The AgentResponse model encapsulates structured agent outputs:

class AgentResponse(BaseModel):
    """Structured return from agent execution, carrying both text and tool calls."""
    model_config = ConfigDict(extra="ignore")
    
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)

Sources: arksim/simulation_engine/tool_types.py:32-42

ToolCall Model

Tool calls are captured using the ToolCall model:

class ToolCall(BaseModel):
    """A single tool/function call observed during a turn."""
    model_config = ConfigDict(extra="ignore")
    
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None

The extra="ignore" configuration ensures forward compatibility with future versions.

Sources: arksim/simulation_engine/tool_types.py:18-30

Chat Completions Endpoint Integration

For agents exposing a standard OpenAI-compatible API:

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

Sources: README.md:57-62

A2A Protocol Integration

For agents implementing the Agent-to-Agent (A2A) protocol:

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

Sources: README.md:64-70

Automatic Tool Call Capture (Tracing)

ArkSim supports automatic tool call capture through a tracing processor, eliminating the need for agents to explicitly return tool calls:

graph LR
    A[Simulator sets<br/>routing context] --> B[agent.execute()]
    B --> C[SDK fires<br/>TracingProcessor.on_span_end]
    C --> D[arksim captures<br/>tool calls]
    D --> E[Evaluator scores]
pip install -r requirements-traced.txt
arksim simulate-evaluate config_traced.yaml

Sources: examples/customer-service/README.md:35-55

Configuration System

ArkSim uses YAML configuration files to define simulation and evaluation parameters.

Configuration File Structure

# Simulation settings
simulation:
  max_turns: 10
  timeout: 300

# Agent configuration
agent_config:
  agent_type: chat_completions  # or 'a2a', 'custom'
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

# Trace receiver (optional)
trace_receiver:
  enabled: true
  wait_timeout: 5

# Custom metrics
custom_metrics_file_paths:
  - ./custom_metrics.py

metrics_to_run:
  - goal_completion
  - verification_compliance

Sources: examples/ci/README.md:1-25

Scenario Definition

Scenarios are defined in scenarios.json with the following structure:

FieldDescription
scenario_idUnique identifier for the scenario
user_profileDescription of the simulated user
knowledgeContext and information available to the user
goalsObjectives the user aims to achieve
expected_behaviorExpected agent behavior patterns

Sources: examples/integrations/dify/README.md:1-20

Evaluation Metrics

ArkSim provides built-in metrics and supports custom metric definitions.

Built-in Metrics

MetricDescriptionWeight
Goal CompletionWhether the agent achieved the user's objectives60%
Turn Success RatioRatio of successful turns to total turns40%
Final ScoreWeighted average of other metrics-

Sources: arksim/utils/html_report/report_template.html:80-95

Custom Metrics

Custom metrics can be implemented as either quantitative (scored numerically) or qualitative (evaluated via LLM).

#### Quantitative Metric Pattern

from arksim.evaluator import (
    QuantitativeMetric,
    QuantResult,
    ScoreInput,
    format_chat_history,
)

class VerificationComplianceMetric(Queresult):
    identity_verification: float  # 0.0-1.0
    action_gating: float  # 0.0-1.0
    reason: str

VERIFICATION_COMPLIANCE_SYSTEM_PROMPT = """\
You are an impartial evaluator for a customer service agent.
Score how well the agent followed identity verification protocols..."""

def verification_compliance_metric(
    input_data: ScoreInput,
) -> QuantResult:
    # Implementation
    return QuantResult(score=0.85, reason="Verification completed")

Sources: examples/customer-service/custom_metrics.py:15-40

#### Qualitative Metric Pattern

from arksim.evaluator import (
    QualitativeMetric,
    QualResult,
    ScoreInput,
    format_chat_history,
)

def helpfulness_metric(input_data: ScoreInput) -> QualResult:
    system_prompt = """Evaluate the helpfulness of the agent response..."""
    return QualResult(
        assessment="The agent provided comprehensive assistance",
        score=0.9,
        reason="Clear explanation with actionable steps"
    )

Metric Configuration

To use custom metrics:

  1. Implement the metric function in a Python file
  2. Reference the file path in custom_metrics_file_paths in config.yaml
  3. Add the metric name to metrics_to_run

Sources: examples/customer-service/custom_metrics.py:1-15

CLI Usage

ArkSim provides a command-line interface for running simulations and evaluations.

Primary Commands

CommandDescription
arksim simulate-evaluate config.yamlRun simulation and evaluation in one step
arksim simulate config_simulate.yamlRun simulation only
arksim evaluate config_evaluate.yamlEvaluate existing simulation results

Sources: examples/customer-service/README.md:15-22

Workflow Examples

#### Combined Simulation and Evaluation

arksim simulate-evaluate config.yaml

#### Separate Simulation and Evaluation

# Step 1: Simulate
arksim simulate config_simulate.yaml

# Step 2: Evaluate
arksim evaluate config_evaluate.yaml

Results are written to ./results/simulation/simulation.json. The evaluation report is printed to stdout with per-scenario metric scores and failure analysis.

Sources: examples/integrations/dify/README.md:15-25

Framework Integrations

ArkSim provides integration examples for popular agent frameworks.

Integration Matrix

FrameworkPackage RequiredDocumentation
LangGraphlanggraph langchain-openaiLink
LangChainlanggraph langchain-openaiLink
AutoGenautogen-agentchat autogen-ext[openai]Link
Claude Agent SDKclaude-agent-sdkLink
Rasarasa-proLink
Dify(custom agent)Link
OpenClaw(custom agent)Link

Sources: examples/integrations/langgraph/README.md, examples/integrations/autogen/README.md

General Integration Pattern

Regardless of the framework, the integration follows this pattern:

  1. Create a custom_agent.py file with a BaseAgent subclass
  2. Implement the execute() method to call the framework's agent
  3. Return either a string or AgentResponse with tool calls
  4. Configure config.yaml to use the custom agent
  5. Create scenarios.json with test cases
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class FrameworkAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-session-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Call your framework's agent here
        result = await your_agent.run(user_query)
        return str(result)

Report Generation

ArkSim generates comprehensive HTML evaluation reports containing:

Report Sections

SectionContent
SummaryOverall scores, pass/fail status
Per-Scenario ScoresIndividual metric scores for each scenario
Failure CategoriesGrouped failure patterns with counts
Conversation ViewerFull transcript of each simulated conversation
Score ExplanationsRationale for each score with references to conversation turns

Sources: README.md:25-30

Report Output

The report tells you where your agent is strong and where it breaks. You get per-metric scores, categorized failures, and full conversation transcripts so you can read the exact turns where things went wrong.

Sources: README.md:30-35

Development Guidelines

Project Setup

# Fork and clone
git clone https://github.com/<your-username>/arksim.git
cd arksim

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# Create a branch
git checkout -b my-feature

Code Quality

ArkSim uses Ruff for linting and formatting:

ruff check .    # Lint
ruff format .   # Format

Pre-commit hooks run both automatically on commit.

Code Style

  • Follow PEP 8 conventions
  • Code lines: maximum 120 characters
  • Comments and docstrings: maximum 80 characters
  • Type hints encouraged for function signatures
  • Use absolute imports over relative imports

Commit Message Format

<component>: <verb> <description>

Examples:

  • evaluator: add custom metric support
  • simulator: fix profile generation for empty attributes
  • cli: support verbose flag for streaming output

Keep the subject line under 72 characters, use lowercase, imperative mood.

Branch Naming

<type>/<short-description>

Examples: feat/retry-logic, fix/empty-list-handling, docs/update-quickstart.

Sources: CONTRIBUTING.md:1-50

CI/CD Integration

ArkSim can be integrated into GitHub Actions workflows for automated testing.

Workflow Options

WorkflowUse Case
arksim.ymlHTTP server endpoints requiring startup/shutdown
arksim-pytest.ymlPython-based pytest integration

Setup Requirements

  1. Update TODO sections in arksim.yml (startup command and health-check URL)
  2. Create tests/arksim/config.yaml pointing to your server endpoint
  3. Create tests/arksim/scenarios.json with test cases
  4. Add custom metrics to tests/arksim/custom_metrics/ if needed
  5. Add OPENAI_API_KEY (and optionally AGENT_API_KEY) to GitHub secrets

Sources: examples/ci/README.md:1-15

Summary

ArkSim provides a comprehensive framework for evaluating multi-turn conversational AI agents through:

  1. Flexible integration: Connect agents via Python classes, REST APIs, or framework-specific connectors
  2. Rich scenario definitions: Test agents with diverse user profiles and interaction goals
  3. Extensible evaluation: Implement custom metrics for domain-specific assessment
  4. Actionable reporting: Generate HTML reports that pinpoint exactly where and why agent behavior falls short

By standardizing agent evaluation, ArkSim enables data-driven improvements to conversational AI systems across different frameworks and architectures.

Sources: README.md

Quickstart Guide

Related topics: Introduction to ArkSim, Agent Types and Integration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Agent Connection Types

Continue reading this section for the full explanation and source context.

Section Command Options

Continue reading this section for the full explanation and source context.

Section Python Class (Recommended)

Continue reading this section for the full explanation and source context.

Related topics: Introduction to ArkSim, Agent Types and Integration

Quickstart Guide

This guide walks you through setting up and running your first agent simulation with ArkSim. By the end, you'll understand how to scaffold a project, connect your agent, define test scenarios, and execute simulation and evaluation workflows.

Prerequisites

Before starting, ensure you have:

  • Python 3.10 or higher
  • pip or pipx for package installation
  • An API key for your LLM provider (OpenAI, Anthropic, etc.) if your agent requires external API calls

Installation

Install ArkSim using pip:

pip install arksim

For development with test dependencies:

pip install -e ".[dev]"

Sources: README.md

Project Initialization

The arksim init command scaffolds a starter project with all necessary configuration files. Run it from your desired working directory:

arksim init

Agent Connection Types

ArkSim supports three agent connection types via the --agent-type flag:

TypeDescriptionUse Case
customPython class implementing BaseAgentFull control, no external server needed
chat_completionsHTTP endpoint compatible with OpenAI formatExisting REST APIs
a2aAgent-to-Agent protocolMulti-agent systems

Sources: arksim/cli.py:120-136

Command Options

arksim init --agent-type custom    # Default: Python agent class
arksim init --agent-type chat_completions  # HTTP endpoint
arksim init --agent-type a2a       # A2A protocol
arksim init --agent-type custom --force  # Overwrite existing files

Scaffolding generates these files:

FilePurpose
my_agent.pyAgent implementation stub
config.yamlSimulator configuration
scenarios.jsonTest scenario definitions
.envEnvironment variables template

Project Structure

your-project/
├── my_agent.py        # Your agent implementation
├── config.yaml         # Simulation & evaluation settings
├── scenarios.json     # Test scenarios
└── results/            # Output directory (created on run)
    ├── simulation/
    │   └── simulation.json
    └── evaluation/
        └── evaluation.json

Connecting Your Agent

Replace the generated my_agent.py with your agent logic. Your class must inherit from BaseAgent:

from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Your agent logic here
        return "agent response"

Sources: README.md

The execute method supports two return types:

Return TypeDescription
strPlain text response
AgentResponseStructured response with text and tool calls
from arksim.simulation_engine.tool_types import AgentResponse, ToolCall

async def execute(self, user_query: str, **kwargs: object) -> AgentResponse:
    tool_calls = [
        ToolCall(
            id="call_123",
            name="search_database",
            arguments={"query": user_query}
        )
    ]
    return AgentResponse(
        content="Found results for your query",
        tool_calls=tool_calls
    )

Sources: arksim/simulation_engine/tool_types.py:45-72

Chat Completions Endpoint

For HTTP-based agents, configure config.yaml:

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

A2A Protocol

For Agent-to-Agent protocol support:

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

A2A agents can surface tool calls for evaluation via the protocol extension.

Sources: README.md

Configuring Simulation

Edit config.yaml to control simulation behavior:

simulator:
  max_turns: 20
  timeouts:
    agent_response: 30
    tool_execution: 10
  model: gpt-4o-mini

Core Configuration Parameters

ParameterTypeDefaultDescription
max_turnsint20Maximum conversation turns per scenario
agent_responseint30Timeout for agent response (seconds)
tool_executionint10Timeout for tool execution (seconds)
modelstringgpt-4o-miniSimulator model for user simulation

Defining Test Scenarios

Edit scenarios.json to define your test cases:

[
  {
    "id": "scenario-001",
    "name": "Customer Inquiry",
    "category": "support",
    "user_profile": {
      "name": "Alice Johnson",
      "age": 34,
      "account_type": "premium"
    },
    "knowledge": [
      "Customer has a premium account",
      "Customer is inquiring about billing"
    ],
    "conversation": [
      {
        "turn": 1,
        "user": "I noticed a charge on my account that I don't recognize",
        "goal": "Customer wants to understand and dispute the unfamiliar charge"
      }
    ]
  }
]

Scenario Schema Fields

FieldRequiredDescription
idYesUnique scenario identifier
nameYesHuman-readable name
categoryNoGrouping category for filtering
user_profileNoSimulated user attributes
knowledgeNoFacts the simulated user knows
conversationYesMulti-turn conversation structure

Sources: examples/customer-service/README.md

Running Simulation and Evaluation

Combined Workflow

Run both simulation and evaluation in one command:

arksim simulate-evaluate config.yaml

Separate Steps

For more control, run simulation and evaluation separately:

# Step 1: Simulate
arksim simulate config_simulate.yaml

# Step 2: Evaluate
arksim evaluate config_evaluate.yaml

Sources: examples/customer-service/README.md

Command Reference

CommandDescription
arksim initScaffold new project
arksim simulate <config>Run simulation only
arksim evaluate <config>Run evaluation only
arksim simulate-evaluate <config>Run both steps
arksim uiLaunch web UI (port 8080)
arksim examplesDownload example projects
arksim promptsList available prompts

Sources: arksim/cli.py:89-152

Web UI

Launch the web-based control plane:

arksim ui --port 8080

The UI provides:

  • Scenario management
  • Real-time simulation progress
  • Log viewing
  • Dark/light mode toggle

Integration Examples

ArkSim supports integrations with popular agent frameworks:

FrameworkCommand
LangChain/LangGraphpip install langgraph langchain-openai
Claude Agent SDKpip install claude-agent-sdk
Microsoft AutoGenpip install autogen-agentchat autogen-ext[openai]
DifyHTTP-based integration

Example workflow for LangGraph:

cd examples/integrations/langgraph
pip install langgraph langchain-openai
export OPENAI_API_KEY="<your-key>"
arksim simulate-evaluate config.yaml

Sources: examples/integrations/langchain/README.md

Next Steps

  • Review the Evaluation Metrics documentation for customizing scoring
  • Explore the Custom Agent guide for advanced integration patterns
  • Download example projects with arksim examples

Sources: README.md

System Architecture Overview

Related topics: Simulation Engine, Evaluation System, LLM Provider Integration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Simulation Modes

Continue reading this section for the full explanation and source context.

Section Tool Call Capture

Continue reading this section for the full explanation and source context.

Section Traced Agent Flow

Continue reading this section for the full explanation and source context.

Related topics: Simulation Engine, Evaluation System, LLM Provider Integration

System Architecture Overview

ArkSim is a multi-turn agent evaluation framework that simulates user conversations with agents, captures tool calls, and scores agent performance against customizable metrics. This document provides a comprehensive overview of the system's architecture, core components, and data flows.

Core Components

ArkSim's architecture is organized into three primary subsystems that work together to simulate, capture, and evaluate agent behavior.

ComponentPurpose
Simulation EngineOrchestrates multi-turn conversations between user simulators and agents
EvaluatorScores agent responses against qualitative and quantitative metrics
LLM LayerPowers user simulation and metric evaluation via configurable model providers

High-Level Architecture

graph TD
    subgraph Simulation["Simulation Engine"]
        CLI[CLI Interface]
        Config[Config Loader]
        Simulator[Simulator Core]
        AgentConnector[Agent Connector]
        UserSimulator[User Simulator]
    end

    subgraph Evaluation["Evaluator"]
        Metrics[Custom Metrics]
        Scoring[Scoring Engine]
        Report[HTML Report Generator]
    end

    subgraph LLM["LLM Layer"]
        ChatLLM[Chat LLM]
        EvalLLM[Evaluation LLM]
    end

    subgraph External["External Systems"]
        CustomAgent[Custom Agent]
        HTTPEndpoint[HTTP Endpoint]
        A2AAgent[A2A Agent]
    end

    CLI --> Config
    Config --> Simulator
    Simulator --> UserSimulator
    Simulator --> AgentConnector
    AgentConnector --> CustomAgent
    AgentConnector --> HTTPEndpoint
    AgentConnector --> A2AAgent
    UserSimulator --> ChatLLM
    Metrics --> EvalLLM
    EvalLLM --> Scoring
    Scoring --> Report

    style CLI fill:#e1f5fe
    style Simulator fill:#fff3e0
    style Report fill:#e8f5e9

CLI Interface

The command-line interface provides multiple entry points for running simulations and evaluations. The CLI is implemented in arksim/cli.py and supports the following commands:

CommandDescription
arksim simulate-evaluate config.yamlRun simulation and evaluation in a single pipeline
arksim simulate config.yamlRun simulation only
arksim evaluate config.yamlRun evaluation on existing simulation results
arksim initScaffold starter files for agent testing
arksim uiLaunch web UI control plane
arksim examplesDownload example projects from GitHub

Sources: arksim/cli.py:1-100

Simulation Modes

ArkSim supports three distinct agent integration patterns:

graph LR
    subgraph AgentTypes["Agent Types"]
        Custom[Custom Agent<br/>Python class extending BaseAgent]
        HTTP[Chat Completions<br/>HTTP endpoint /v1/chat/completions]
        A2A[A2A Protocol<br/>Agent-to-Agent standard]
    end
  1. Custom Agent (Python class): Extend BaseAgent and implement the execute() method
  2. Chat Completions: Configure an HTTP endpoint for OpenAI-compatible chat completions API
  3. A2A Protocol: Connect via the Agent-to-Agent protocol standard

Sources: README.md

Simulation Engine

The simulation engine orchestrates multi-turn conversations between a user simulator and the agent under test. Each turn consists of:

  1. User simulator generates a response based on conversation history and scenario knowledge
  2. Agent executes the user query and returns a response (optionally with tool calls)
  3. Simulator captures the interaction for evaluation

Tool Call Capture

ArkSim captures tool/function calls in two ways:

Capture MethodMechanismConfiguration
AgentResponseAgent returns structured AgentResponse with tool_calls listDefault behavior
TracingProcessorSDK's TracingProcessor.on_span_end captures calls automaticallytrace_receiver.enabled: true

Sources: examples/customer-service/README.md

#### Tool Call Data Model

Tool calls are represented by the ToolCall class:

class ToolCall(BaseModel):
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None

The AgentResponse wraps both content and tool calls:

class AgentResponse(BaseModel):
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)

Sources: arksim/simulation_engine/tool_types.py:1-50

Traced Agent Flow

sequenceDiagram
    participant Simulator
    participant Agent
    participant Tracing as ArksimTracingProcessor
    participant Evaluator

    Simulator->>Agent: execute(user_query)
    Agent->>Tracing: on_span_end(span)
    Tracing->>Simulator: captured_tool_calls
    Simulator->>Agent: RunResult
    Agent->>Simulator: AgentResponse
    Simulator->>Evaluator: Tool calls + Conversation
    Evaluator->>Simulator: Metric Scores

Sources: examples/customer-service/README.md

Evaluator

The evaluator scores agent performance using both quantitative and qualitative metrics. The evaluation framework supports custom metrics defined via Python files.

Metric Types

TypeDescriptionScoring Method
QuantitativeMetricNumerical scores (0.0-1.0 scale)Structured JSON schema validation
QualitativeMetricFree-form evaluation with reasoningLLM-generated analysis

Sources: examples/customer-service/custom_metrics.py:1-60

Custom Metrics Structure

Custom metrics require:

  1. A Pydantic BaseModel defining the output schema
  2. A system prompt describing evaluation criteria
  3. Implementation of QuantitativeMetric or QualitativeMetric
class ConversionSchema(BaseModel):
    intent_strength: float
    conversion_outcome: float
    evidence: list[str]
    reason: str

CONVERSION_SYSTEM_PROMPT = """\
You are an impartial evaluator for an e-commerce shopping agent.
Your job is to score (1) the shopper's purchase intent and (2) whether the agent achieved a conversion outcome...
"""

Sources: examples/e-commerce/custom_metrics.py:1-40

Score Calculation

The evaluator computes a Final Score as a weighted average:

ComponentWeightDescription
Turn Success Ratio40%Ratio of successful turns to total turns
Goal Completion Score60%LLM-assessed goal achievement score
StatusCondition
DoneFinal score = 1.0
Partial Failure0.0 < Final score < 1.0
Complete FailureFinal score = 0.0

Sources: arksim/utils/html_report/report_template.html:1-50

Data Flow

graph LR
    subgraph Input["Input"]
        Config[config.yaml]
        Scenarios[scenarios.json]
        Metrics[custom_metrics.py]
    end

    subgraph Process["Processing"]
        Sim[Simulation]
        Eval[Evaluation]
        Score[Scoring]
    end

    subgraph Output["Output"]
        Results[simulation.json]
        Report[evaluation.html]
    end

    Config --> Sim
    Scenarios --> Sim
    Sim --> Eval
    Metrics --> Eval
    Eval --> Score
    Score --> Results
    Score --> Report

Agent Connector Types

ArkSim supports multiple agent integration patterns via the configuration system:

# Option 1: Custom Python Agent
agent_config:
  agent_type: custom
  agent_name: my-agent

# Option 2: Chat Completions HTTP Endpoint
agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

# Option 3: A2A Protocol
agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

Sources: README.md

User Simulator

The user simulator generates realistic multi-turn conversations based on:

  • Scenario definitions: Predefined conversation flows and expected behaviors
  • User profiles: Demographic and behavioral attributes for persona simulation
  • Knowledge bases: Domain-specific information the simulated user possesses
  • Conversation history: Full context of the multi-turn interaction

The simulator uses LLM-based generation to produce contextually appropriate user responses that evolve through the conversation based on agent actions and conversation state.

HTML Report Generation

After evaluation completes, ArkSim generates an interactive HTML report containing:

SectionContent
Summary StatisticsOverall scores, pass/fail rates
Per-Scenario MetricsIndividual metric scores with reasoning
Failure CategoriesGrouped analysis of failure types
Conversation TranscriptsFull turn-by-turn dialogue viewer

Sources: arksim/utils/html_report/report_template.html:1-100

Integration Examples

ArkSim provides example integrations for popular agent frameworks:

FrameworkExample LocationConnection Method
LangGraphexamples/integrations/langgraph/Custom agent connector
LangChainexamples/integrations/langchain/Custom agent connector
Claude Agent SDKexamples/integrations/claude-agent-sdk/Custom agent connector
AutoGenexamples/integrations/autogen/Custom agent connector
Pydantic AIexamples/integrations/pydantic-ai/Custom agent connector
Difyexamples/integrations/dify/HTTP client

Sources: examples/integrations/langgraph/README.md, examples/integrations/claude-agent-sdk/README.md, examples/integrations/autogen/README.md

Configuration Schema

The main configuration file (config.yaml) controls all aspects of simulation and evaluation:

SectionKey Options
agent_configagent_type, agent_name, api_config.endpoint
scenario_filePath to scenarios.json
metrics_to_runList of metric names to execute
custom_metrics_file_pathsPaths to custom metric Python files
trace_receiver.enabledEnable TracingProcessor capture (default: false)
trace_receiver.wait_timeoutTimeout for trace capture (seconds)

Sources: examples/customer-service/README.md

Web UI

ArkSim includes a web-based control plane for managing simulations:

graph TD
    subgraph UI["Web UI"]
        Build[Build Scenarios]
        Load[Load Existing]
        Run[Run Simulation]
        View[View Results]
    end

    subgraph Features["UI Features"]
        AutoGen[Auto-generate Scenarios PRO]
        Browse[File Browser]
        Refresh[Refresh Results]
    end

Features include:

  • Scenario building and loading
  • Auto-generate scenarios (PRO feature)
  • File browser integration
  • Results viewing and refresh

Sources: arksim/ui/frontend/index.html:1-80

Summary

ArkSim provides a comprehensive multi-turn agent evaluation framework with:

  • Flexible agent integration: Support for custom Python agents, HTTP endpoints, and A2A protocol
  • Tool call capture: Two mechanisms for capturing agent tool executions
  • Customizable metrics: Both quantitative and qualitative evaluation approaches
  • Interactive reporting: HTML-based results with conversation viewer
  • CLI and UI: Command-line and web-based interfaces for running evaluations
  • Framework integrations: Pre-built examples for LangGraph, LangChain, Claude SDK, AutoGen, Pydantic AI, and Dify

Sources: arksim/cli.py:1-100

Simulation Engine

Related topics: System Architecture Overview, Evaluation System, Scenario Management

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section ToolCall

Continue reading this section for the full explanation and source context.

Section AgentResponse

Continue reading this section for the full explanation and source context.

Related topics: System Architecture Overview, Evaluation System, Scenario Management

Simulation Engine

The Simulation Engine is the core component of ArkSim responsible for executing multi-turn conversations between simulated users and agent systems under test. It orchestrates the entire simulation lifecycle, from scenario loading to agent execution, while capturing tool calls, responses, and conversation state for downstream evaluation.

Architecture Overview

The Simulation Engine follows a layered architecture that separates concerns between scenario management, agent execution, tool call capture, and knowledge handling.

graph TD
    A[Scenarios JSON] --> B[Simulator]
    C[Config YAML] --> B
    B --> D[Agent Executor]
    D --> E[BaseAgent]
    E --> F[Custom Agent / Chat Completions / A2A]
    F --> G[Tool Calls Capture]
    G --> H[Simulation Results]
    D --> I[Multi-Knowledge Handler]
    I --> J[Knowledge Sources]
    
    style B fill:#e1f5fe
    style D fill:#fff3e0
    style E fill:#e8f5e9

Core Components

ComponentFilePurpose
Simulatorsimulator.pyOrchestrates the simulation lifecycle
Entitiesentities.pyData models for scenarios, profiles, and turns
Agent Baseagent/base.pyAbstract base class for all agent implementations
Tool Typestool_types.pyData models for tool calls and agent responses
Multi-Knowledge Handlercore/multi_knowledge_handling.pyManages multiple knowledge sources for user profiles
Prompt Utilitiesutils/prompts.pyGenerates prompts for simulated users

Data Models

ToolCall

The ToolCall class represents a single tool or function call observed during a conversation turn.

class ToolCall(BaseModel):
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None

Key characteristics:

  • Declares extra="ignore" for forward compatibility with future versions
  • Supports capturing arguments, results, and error states
  • Tracks the source of tool calls (e.g., from response parsing or tracing)

Sources: arksim/simulation_engine/tool_types.py:26-37

AgentResponse

The AgentResponse class provides a structured return from agent execution.

class AgentResponse(BaseModel):
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)

Sources: arksim/simulation_engine/tool_types.py:45-52

BaseAgent Interface

All agent implementations must inherit from BaseAgent and implement the required async methods.

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"

Sources: README.md:45-55

Required Methods

MethodReturn TypeDescription
get_chat_id()strReturns a unique identifier for the chat session
execute()`str \AgentResponse`Executes the agent with a user query and returns content with optional tool calls

Sources: README.md:45-55

Simulation Workflow

graph LR
    A[Load Config] --> B[Load Scenarios]
    B --> C[Initialize Agent]
    C --> D[For Each Scenario]
    D --> E[Generate User Profile]
    E --> F[Execute Turn]
    F --> G{Capture Tool Calls}
    G --> H[Record Response]
    H --> I{More Turns?}
    I -->|Yes| F
    I -->|No| J[Save Results]
    J --> K[Next Scenario]
    K -->|Yes| D
    K -->|No| L[Complete]

Step 0: Build Scenarios

Before running a simulation, users must create or load test scenarios. The UI provides options to:

  • Auto-generate Scenarios (Pro feature) - Automatically generate realistic test scenarios from the agent's knowledge base
  • Load Existing - Load scenario files from a specified path

Sources: arksim/ui/frontend/index.html:1-50

Step 1: Simulation Execution

The simulator executes each scenario by:

  1. Loading the scenario configuration and user profiles
  2. Generating multi-turn conversations based on the scenario goals
  3. Capturing all tool calls and responses
  4. Producing a simulation.json output file

Sources: examples/integrations/dify/README.md:18-20

Agent Types

ArkSim supports multiple agent integration patterns:

Agent TypeConfigurationUse Case
Custom Python Classagent_type: customFull control via subclassing BaseAgent
Chat Completionsagent_type: chat_completionsHTTP endpoint compatible with OpenAI format
A2A Protocolagent_type: a2aAgent-to-Agent protocol endpoints

Sources: README.md:57-68

Chat Completions Configuration

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

A2A Protocol Configuration

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

Sources: README.md:57-68

Tool Call Capture

ArkSim provides two mechanisms for capturing tool calls:

Response-Based Capture (Default)

The agent returns an AgentResponse containing explicit tool calls:

async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
    run_result = self.agent.run(user_query)
    return AgentResponse(
        content=run_result.text,
        tool_calls=extract_tool_calls(run_result)
    )

Tracing-Based Capture (Automatic)

For agents using the ArkSim tracing processor, tool calls are captured automatically without modifying the agent response:

# At module load
from arksim.simulation_engine.tracing import ArksimTracingProcessor
agent.register(ArksimTracingProcessor())

# Agent returns plain str
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
    return self.agent.run(user_query)  # Returns plain string

Sources: examples/customer-service/README.md:1-35

Configuration

CLI Usage

# Combined simulation and evaluation
arksim simulate-evaluate config.yaml

# Separate steps
arksim simulate config_simulate.yaml
arksim evaluate config_evaluate.yaml

Sources: examples/customer-service/README.md:8-14

Trace Receiver Configuration

trace_receiver:
  enabled: true
  wait_timeout: 5

When trace_receiver.enabled is false, ArkSim only captures tool calls from AgentResponse.

Sources: examples/customer-service/README.md:30-35

Integration Examples

ArkSim provides pre-built integrations for popular agent frameworks:

FrameworkPackageExample Path
LangGraphlanggraph, langchain-openaiexamples/integrations/langgraph/
AutoGenautogen-agentchat, autogen-ext[openai]examples/integrations/autogen/
Claude Agent SDKclaude-agent-sdkexamples/integrations/claude-agent-sdk/
CrewAIcrewaiexamples/integrations/crewai/
Pydantic AIpydantic-aiexamples/integrations/pydantic-ai/
LangChainlanggraph, langchain-openaiexamples/integrations/langchain/

Sources: examples/integrations/langgraph/README.md, examples/integrations/autogen/README.md, examples/integrations/claude-agent-sdk/README.md, examples/integrations/crewai/README.md, examples/integrations/pydantic-ai/README.md, examples/integrations/langchain/README.md

Output Format

Simulation results are written to ./results/simulation/simulation.json containing:

  • Complete conversation transcripts
  • Captured tool calls with arguments and results
  • Turn-by-turn timing information
  • User profile data
  • Scenario metadata

Sources: examples/integrations/dify/README.md:18-20

Custom Metrics Support

The simulation engine supports custom quantitative and qualitative metrics through the evaluator. Custom metric files can be added to custom_metrics/ directories and referenced in the configuration.

Sources: examples/customer-service/custom_metrics.py:1-20

Forward Compatibility

Both ToolCall and AgentResponse declare extra="ignore" in their Pydantic configuration. This ensures that snapshots from future ArkSim versions containing new fields can be loaded by older versions without raising ValidationError.

model_config = ConfigDict(extra="ignore")

Sources: arksim/simulation_engine/tool_types.py:26-28, arksim/simulation_engine/tool_types.py:45-47

Sources: arksim/simulation_engine/tool_types.py:26-37

Evaluation System

Related topics: System Architecture Overview, Simulation Engine

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Step-by-Step Process

Continue reading this section for the full explanation and source context.

Section Score Ranges

Continue reading this section for the full explanation and source context.

Related topics: System Architecture Overview, Simulation Engine

Evaluation System

The Evaluation System in ArkSim is responsible for scoring simulated conversations between users and agents. It measures goal completion, helpfulness, coherence, and other quality metrics to identify where agents succeed and where they fail.

Overview

After simulation generates conversation transcripts, the evaluator analyzes each conversation and assigns scores across multiple dimensions. The system supports both built-in metrics and custom-defined metrics.

Purpose:

  • Quantify agent performance objectively
  • Identify specific failure patterns
  • Generate actionable HTML reports for debugging

Scope:

  • Scores conversations on a 0.0-1.0 scale
  • Computes weighted final scores combining multiple metrics
  • Categorizes outcomes as done, partial failure, or failed
  • Supports LLM-based evaluation (using configured providers like OpenAI)

Sources: README.md

Architecture

graph TD
    A[Simulation Results] --> B[Evaluator]
    B --> C[Built-in Metrics]
    B --> D[Custom Metrics]
    B --> E[Tool Call Metrics]
    C --> F[Final Scores]
    D --> F
    E --> F
    F --> G[HTML Report Generator]
    G --> H[evaluation/final_report.html]

Core Components

ComponentFilePurpose
Evaluatorevaluator.pyMain orchestration of evaluation pipeline
Base Metricbase_metric.pyAbstract base classes for metrics
Built-in Metricsbuiltin_metrics.pyStandard metrics (faithfulness, helpfulness, etc.)
Tool Call Metricstool_call_metrics.pyEvaluation of tool usage patterns
Trajectory Matchingtrajectory_matching.pyCompare expected vs actual agent trajectories
Thresholdsthresholds.pyScore classification logic
HTML Reportgenerate_html_report.pyGenerate visual evaluation reports

Evaluation Workflow

sequenceDiagram
    participant S as Simulator
    participant E as Evaluator
    participant M as Metrics
    participant R as Report Generator
    
    S->>E: Simulation results (JSON)
    E->>M: For each conversation
    M->>M: Run quantitative metrics
    M->>M: Run qualitative metrics
    M->>E: Per-metric scores
    E->>E: Calculate final scores
    E->>E: Determine status
    E->>R: Score data
    R->>R: Generate HTML

Step-by-Step Process

  1. Load Simulation Data: Read conversation transcripts from simulation output
  2. Select Metrics: Determine which built-in and custom metrics to run
  3. Score Each Conversation: Apply metrics to score goal completion, behavior, etc.
  4. Compute Final Scores: Calculate weighted averages (goal_completion_weight=0.25, turn_success_ratio_weight=0.75)
  5. Classify Status: Assign done/partial_failure/failed based on thresholds
  6. Generate Report: Create HTML report with detailed breakdowns

Sources: arksim/utils/html_report/report_template.html

Scoring System

Score Ranges

Score TypeRangeDescription
Quantitative0.0 - 1.0Numeric scores for measurable criteria
QualitativeLabel-basedCategorical results (compliant, professional, pass, etc.)
Goal Completion0.0 - 1.0Whether agent completed user goal
Final Score0.0 - 1.0Weighted combination of metrics

Status Classification

StatusConditionDescription
Donefinal_score == 1.0Perfect performance, goal completed
Partial Failurefinal_score >= 0.6Acceptable but with some failures
Failedfinal_score < 0.6Poor performance requiring attention

Sources: arksim/utils/html_report/report_template.html:85-87

Built-in Metrics

The system provides seven built-in metrics that can be selected via configuration:

MetricPurpose
faithfulnessDid the agent provide factually accurate information?
helpfulnessWas the agent's response useful to the user?
coherenceWere responses logically connected and consistent?
verbosityDid the agent maintain appropriate response length?
relevanceDid responses address the user's actual query?
goal_completionDid the agent help the user accomplish their goal?
agent_behavior_failureDid the agent exhibit any problematic behaviors?

Metric Selection

From the frontend, users can select which built-in metrics to run:

<template x-for="m in ['faithfulness', 'helpfulness', 'coherence', 'verbosity', 'relevance', 'goal_completion', 'agent_behavior_failure']" :key="m">

If no metrics are selected, all built-in metrics run by default.

Sources: arksim/ui/frontend/index.html

Custom Metrics

Developers can define custom evaluation metrics by creating a Python module.

Creating Custom Metrics

from arksim.evaluator import (
    QualitativeMetric,
    QualResult,
    QuantitativeMetric,
    QuantResult,
    ScoreInput,
)

class MyMetric(QuantitativeMetric):
    name = "my_custom_metric"
    schema = MySchema  # Pydantic model
    
    async def evaluate(self, input: ScoreInput) -> QuantResult:
        # Evaluation logic
        return QuantResult(...)

Quantitative vs Qualitative Metrics

TypeBase ClassOutput
QuantitativeQuantitativeMetricQuantResult with float value and reason
QualitativeQualitativeMetricQualResult with categorical label

Configuration

Add custom metrics to config.yaml:

custom_metrics_file_paths:
  - /path/to/custom_metrics.py

metrics_to_run:
  - my_custom_metric  # optional; runs all if omitted

Sources: examples/customer-service/custom_metrics.py

Example: Verification Compliance

class VerificationComplianceSchema(BaseModel):
    identity_verification: float  # 0.0-1.0
    action_gating: float  # 0.0-1.0
    reason: str

The evaluation prompt instructs the LLM to score:

  • Identity Verification: Did the agent verify customer identity before actions?
  • Action Gating: Did the agent gate sensitive actions behind verification?

Sources: examples/customer-service/custom_metrics.py:25-35

Example: E-commerce Conversion

class ConversionSchema(BaseModel):
    intent_strength: float
    conversion_outcome: float
    evidence: list[str]
    reason: str

Metrics track:

  • Intent Strength: How ready the shopper is to buy
  • Conversion Outcome: Whether the agent achieved a purchase decision

Sources: examples/e-commerce/custom_metrics.py

HTML Report

The evaluation system generates a detailed HTML report (evaluation/final_report.html) containing:

Report Sections

SectionContent
Summary StatisticsOverall pass/fail rates, average scores
Conversations TablePer-conversation scores, status badges
Detailed BreakdownGoal completion, final scores, failure reasons
Score ReasonsLLM-generated explanations for each metric

Report Features

  • Interactive Table: Sort and filter conversations
  • Score Details: Expandable sections showing metric-by-metric breakdown
  • Status Badges: Visual indicators (done/partial/failed)
  • Tooltip Explanations: Hover info for column headers

Score Display Logic

const POSITIVE_LABELS = ['compliant', 'professional', 'pass', 'good', 'complete', 'no failure', 'ok'];
const NEGATIVE_LABELS = ['flagged', 'unprofessional', 'fail', 'error', 'poor', 'missing', 'violated', 'partial'];

Qualitative scores are automatically classified as positive, negative, or neutral based on label matching.

Sources: arksim/utils/html_report/report_template.html:78-81

Configuration

Evaluation Configuration Options

ParameterTypeDefaultDescription
evalProviderstring-LLM provider for evaluation
evalModelstring-Model to use for scoring
metricsToRunlistallWhich built-in metrics to execute
customMetricsFilePathslist[]Paths to custom metric modules
evalNumWorkersintautoParallel evaluation workers

Provider Selection

The evaluator supports multiple LLM providers:

  • OpenAI
  • Azure OpenAI
  • Custom endpoints (via chat_completions type)
  • A2A protocol agents

Sources: arksim/ui/frontend/index.html

Integration with Simulation

The evaluation system is typically invoked via the CLI:

# Combined simulation and evaluation
arksim simulate-evaluate config.yaml

# Separate steps
arksim simulate config_simulate.yaml
arksim evaluate config_evaluate.yaml

Pipeline Flow

graph LR
    A[config.yaml] --> B[Simulator]
    B --> C[simulation.json]
    C --> D[Evaluator]
    D --> E[evaluation/]
    E --> F[final_report.html]
    E --> F --> G[scores.json]

Results are written to ./results/simulation/simulation.json for simulation and ./results/evaluation/ for evaluation output.

Sources: examples/integrations/dify/README.md

Advanced Features

Tool Call Capture

ArkSim can automatically capture tool calls via tracing:

trace_receiver:
  enabled: true
  wait_timeout: 5

When enabled, tool calls are captured automatically without requiring explicit return in AgentResponse.

Trajectory Matching

The trajectory_matching.py module compares expected agent trajectories against actual behavior, useful for validating that agents follow prescribed action sequences.

Thresholds

The thresholds.py module defines score boundaries and classification logic for determining pass/fail conditions.

Summary

The Evaluation System provides comprehensive, configurable scoring of agent conversations:

  • Flexible Metric System: Built-in metrics cover common quality dimensions; custom metrics extend evaluation to domain-specific criteria
  • LLM-based Scoring: Uses configurable language models to generate nuanced, explainable scores
  • Visual Reporting: HTML reports make results easy to understand and act upon
  • Status Classification: Automatic categorization into done/partial_failure/failed for quick assessment
  • Integration-ready: Works with Python agents, chat completions endpoints, and A2A protocol agents

Sources: README.md

Agent Types and Integration

Related topics: LLM Provider Integration, Tool Call Capture, Configuration System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Implementation Pattern

Continue reading this section for the full explanation and source context.

Section Returning Tool Calls

Continue reading this section for the full explanation and source context.

Section Traced Agent Variant

Continue reading this section for the full explanation and source context.

Related topics: LLM Provider Integration, Tool Call Capture, Configuration System

Agent Types and Integration

ArkSim supports multiple agent connection types to accommodate different architectures and deployment scenarios. This page documents the available agent types, their configuration methods, and integration patterns.

Overview

ArkSim provides a flexible agent integration system that enables testing of various agent implementations through a unified simulation interface. The simulator communicates with agents via standardized protocols and captures responses for evaluation.

The agent integration system consists of:

  • A BaseAgent abstract class that defines the agent interface
  • Multiple client implementations for different connection types
  • A factory pattern for agent instantiation based on configuration
  • Support for tool call capture through both explicit responses and tracing

Sources: README.md

Supported Agent Types

ArkSim supports three primary agent types, each suited for different deployment scenarios.

Agent TypeDescriptionUse Case
customPython class extending BaseAgentCustom agent logic, no external server required
chat_completionsHTTP endpoint with OpenAI-compatible APIExisting REST APIs, external agent services
a2aAgent-to-Agent protocol endpointMulti-agent systems, A2A-compliant agents

Sources: arksim/cli.py:100-109

Custom Agent (Python Class)

The custom agent type is the default integration method. It requires implementing a Python class that extends BaseAgent and implements the execute method.

Implementation Pattern

from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"

Sources: README.md

Returning Tool Calls

To enable tool call evaluation, return an AgentResponse object instead of a plain string. This allows the evaluator to assess whether the agent correctly invoked required tools.

from arksim.simulation_engine.tool_types import AgentResponse, ToolCall

async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
    tool_calls = [
        ToolCall(
            name="search_knowledge_base",
            arguments={"query": user_query}
        )
    ]
    return AgentResponse(
        content="Found relevant information in the knowledge base.",
        tool_calls=tool_calls
    )

Traced Agent Variant

For agents using OpenTelemetry-based instrumentation, ArkSim supports automatic tool call capture through the TracingProcessor interface. This eliminates the need to explicitly return tool calls in AgentResponse.

Simulator sets routing context -> agent.execute() runs normally
-> SDK fires TracingProcessor.on_span_end -> arksim captures -> evaluator scores

Sources: examples/customer-service/README.md

Chat Completions Agent (HTTP API)

The chat_completions agent type connects to any HTTP endpoint implementing an OpenAI-compatible chat completions interface.

Configuration

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

Sources: README.md

Request Format

ArkSim sends requests following the OpenAI chat completions format:

{
  "model": "agent-model",
  "messages": [
    {"role": "user", "content": "user query"}
  ]
}

Response Handling

The endpoint should return responses in the standard chat completions format:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "agent response"
      }
    }
  ]
}

Sources: examples/integrations/dify/README.md

A2A Protocol Agent

The a2a agent type connects to agents implementing the Agent-to-Agent (A2A) protocol, enabling integration with multi-agent systems.

Configuration

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

Sources: README.md

A2A with Tool Calls

A2A agents can also surface tool calls for evaluation. The agent returns both the response content and tool call information in its A2A-formatted response.

CLI Initialization

ArkSim provides a CLI command to scaffold agent implementations:

arksim init --agent-type <type>

The --agent-type flag accepts the following values:

  • custom (default) - Generates a Python agent file
  • chat_completions - Configures HTTP endpoint connection
  • a2a - Configures A2A protocol connection

The --force flag overwrites existing files.

Sources: arksim/cli.py:94-118

Integration Architecture

The following diagram illustrates how ArkSim communicates with different agent types:

graph TD
    subgraph ArkSim["ArkSim"]
        Simulator["Simulator Engine"]
        Evaluator["Evaluator"]
    end
    
    subgraph AgentTypes["Agent Implementations"]
        CustomAgent["Custom Agent (Python)"]
        HTTPAgent["Chat Completions API"]
        A2AAgent["A2A Protocol Agent"]
    end
    
    Simulator -->|execute| CustomAgent
    Simulator -->|HTTP POST| HTTPAgent
    Simulator -->|A2A Protocol| A2AAgent
    
    CustomAgent -->|AgentResponse| Evaluator
    HTTPAgent -->|JSON Response| Evaluator
    A2AAgent -->|A2A Response| Evaluator

Agent Configuration in UI

The ArkSim web UI provides an "Agent Config" section for configuring agent connections. This section is dynamically loaded from the configuration YAML file.

<!-- Agent Config -->
<div class="t-surface rounded-xl border t-border p-5 mb-4">
    <h2 class="font-semibold t-heading mb-1">Agent Config</h2>
    <p class="text-xs t-caption mb-3">How arksim connects to your agent.</p>
</div>

Sources: arksim/ui/frontend/index.html

Framework Integrations

ArkSim includes pre-built integrations for popular agent frameworks through example projects:

FrameworkIntegration FileProtocol
LangChain/LangGraphcustom_agent.pyPython class
CrewAIcustom_agent.pyPython class
AutoGencustom_agent.pyPython class
Claude Agent SDKcustom_agent.pyPython class
Pydantic AIcustom_agent.pyPython class
LlamaIndexcustom_agent.pyPython class
Smolagentscustom_agent.pyPython class
OpenAI Agents SDKcustom_agent.pyPython class
Difycustom_agent.pyHTTP API
Rasacustom_agent.pyHTTP API
OpenClawconfig.yamlChat Completions

Sources: examples/integrations/*/README.md

Running Simulations with Different Agent Types

Single Command

arksim simulate-evaluate config.yaml

Separate Steps

# Step 1: Simulate
arksim simulate config_simulate.yaml

# Step 2: Evaluate
arksim evaluate config_evaluate.yaml

Sources: examples/customer-service/README.md

Best Practices

  1. Tool Call Capture: For accurate evaluation, ensure your agent returns tool calls either explicitly via AgentResponse or implicitly through tracing instrumentation.
  1. Async Implementation: All agent implementations should use async/await patterns for proper integration with the simulation engine.
  1. Error Handling: Agents should handle errors gracefully and return meaningful error messages that can be evaluated by the system.
  1. Configuration Management: Use environment variables for sensitive configuration values like API keys and endpoints.

Sources: README.md

LLM Provider Integration

Related topics: Agent Types and Integration, Configuration System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section System Components

Continue reading this section for the full explanation and source context.

Section Provider Class Hierarchy

Continue reading this section for the full explanation and source context.

Section Core Methods

Continue reading this section for the full explanation and source context.

Related topics: Agent Types and Integration, Configuration System

LLM Provider Integration

Overview

The LLM Provider Integration system in ArkSim provides a unified abstraction layer for connecting to various Large Language Model (LLM) providers. This modular architecture enables the simulator to interact with different AI backends while maintaining a consistent interface for chat completions, streaming responses, and token usage tracking.

The system follows a provider-based pattern where each supported LLM service (OpenAI, Anthropic, Google, Azure OpenAI) implements a common interface defined in the base class. This design allows users to swap between providers without changing the core simulation logic.

Architecture

System Components

graph TD
    A[Simulation Engine] --> B[LLM Manager<br/>arksim/llms/chat/llm.py]
    B --> C[Base LLM<br/>arksim/llms/chat/base/base_llm.py]
    C --> D[OpenAI Provider<br/>providers/openai.py]
    C --> E[Anthropic Provider<br/>providers/anthropic.py]
    C --> F[Google Provider<br/>providers/google.py]
    C --> G[Azure OpenAI Provider<br/>providers/azure_openai.py]
    
    H[Configuration YAML] --> B
    I[Environment Variables] --> D
    I --> E
    I --> F
    I --> G

Provider Class Hierarchy

graph TD
    A[BaseLLM<br/>base/base_llm.py] --> B[OpenAIProvider<br/>providers/openai.py]
    A --> C[AnthropicProvider<br/>providers/anthropic.py]
    A --> D[GoogleProvider<br/>providers/google.py]
    A --> E[AzureOpenAIProvider<br/>providers/azure_openai.py]
    
    F[Provider Enum] --> A
    G[ChatMessage Model] --> A
    H[ChatCompletionResponse] --> A

Supported Providers

ArkSim supports the following LLM providers through dedicated provider implementations:

ProviderProvider ClassAPI StyleStreaming Support
OpenAIOpenAIProviderOpenAI APIYes
AnthropicAnthropicProviderAnthropic APIYes
GoogleGoogleProviderGoogle AI APIYes
Azure OpenAIAzureOpenAIProviderAzure APIYes

Each provider class inherits from BaseLLM and implements provider-specific API call logic while adhering to the common interface contract.

Base LLM Interface

The BaseLLM class defines the contract that all provider implementations must follow. This ensures consistent behavior across different LLM backends.

Core Methods

MethodPurposeParameters
chat()Send a chat completion requestmessages, model, temperature, max_tokens, **kwargs
chat_stream()Stream chat completion responsesmessages, model, temperature, max_tokens, **kwargs
count_tokens()Calculate token usage for messagesmessages, model

Data Models

The base module defines essential data structures used across all providers:

ModelPurpose
ChatMessageRepresents a single message with role and content
ChatCompletionResponseWraps the API response from providers
ProviderEnum identifying supported LLM providers
ModelInfoMetadata about available models per provider

Provider Implementations

OpenAI Provider

The OpenAI provider connects to OpenAI's API endpoints for chat completions. It supports both standard and streaming responses.

Configuration Requirements:

ParameterSourceDescription
OPENAI_API_KEYEnvironment variableAPI key for authentication
modelConfig/ParameterModel identifier (e.g., gpt-4, gpt-4-turbo)

API Endpoint:

POST https://api.openai.com/v1/chat/completions

Sources: arksim/llms/chat/providers/openai.py

Anthropic Provider

The Anthropic provider integrates with Anthropic's Claude models through their API. It handles the distinct message format and API conventions used by Anthropic.

Configuration Requirements:

ParameterSourceDescription
ANTHROPIC_API_KEYEnvironment variableAPI key for authentication
modelConfig/ParameterModel identifier (e.g., claude-3-opus-20240229)

API Endpoint:

POST https://api.anthropic.com/v1/messages

Sources: arksim/llms/chat/providers/anthropic.py

Google Provider

The Google provider connects to Google's Gemini models via the Google AI API.

Configuration Requirements:

ParameterSourceDescription
GOOGLE_API_KEYEnvironment variableAPI key for authentication
modelConfig/ParameterModel identifier (e.g., gemini-pro)

API Endpoint:

POST https://generativelanguage.googleapis.com/v1/models/{model}:generateContent

Sources: arksim/llms/chat/providers/google.py

Azure OpenAI Provider

The Azure OpenAI provider enables integration with Azure-hosted OpenAI models, supporting enterprise deployments with Azure-specific authentication and endpoint configuration.

Configuration Requirements:

ParameterSourceDescription
AZURE_OPENAI_API_KEYEnvironment variableAPI key for Azure authentication
AZURE_OPENAI_ENDPOINTEnvironment variableAzure endpoint URL
AZURE_OPENAI_DEPLOYMENTConfigDeployment name in Azure
AZURE_OPENAI_API_VERSIONConfigAzure API version

Sources: arksim/llms/chat/providers/azure_openai.py

Configuration

YAML Configuration Structure

LLM providers are configured through the config.yaml file used by the simulator:

llm:
  provider: openai  # or anthropic, google, azure_openai
  model: gpt-4
  temperature: 0.7
  max_tokens: 2048
  
agent_config:
  # Agent-specific LLM settings
  provider: anthropic
  model: claude-3-opus-20240229

Environment Variables

VariableProvidersPurpose
OPENAI_API_KEYOpenAI, AutoGen, LangChainOpenAI API authentication
ANTHROPIC_API_KEYClaude Agent SDKAnthropic API authentication
GOOGLE_API_KEYGoogleGoogle AI API authentication
AZURE_OPENAI_API_KEYAzure OpenAIAzure API authentication
AZURE_OPENAI_ENDPOINTAzure OpenAIAzure resource endpoint

Integration with Custom Agents

The LLM provider system integrates with the custom agent connector pattern used in ArkSim simulations. Custom agents can be configured to use any supported LLM provider.

graph LR
    A[Scenario JSON] --> B[Simulation Engine]
    B --> C[Custom Agent<br/>custom_agent.py]
    C --> D[LLM Provider]
    D --> E[External LLM API]

Integration Examples

ArkSim provides integration examples for various agent frameworks:

FrameworkExample PathLLM Provider Used
LangChain/LangGraphexamples/integrations/langchain/OpenAI
Claude Agent SDKexamples/integrations/claude-agent-sdk/Anthropic
LlamaIndexexamples/integrations/llamaindex/OpenAI
CrewAIexamples/integrations/crewai/OpenAI
AutoGenexamples/integrations/autogen/OpenAI
Pydantic AIexamples/integrations/pydantic-ai/OpenAI
Smolagentsexamples/integrations/smolagents/OpenAI
Difyexamples/integrations/dify/Custom HTTP

Each integration demonstrates how to connect the framework's agent to ArkSim's simulation engine while delegating LLM calls to the appropriate provider.

Usage Flow

sequenceDiagram
    participant User
    participant Config as config.yaml
    participant LLM as LLM Manager
    participant Provider as Provider Class
    participant API as External LLM API
    
    User->>Config: Load configuration
    User->>LLM: Initialize with provider type
    LLM->>Provider: Create provider instance
    User->>LLM: chat(messages, model)
    LLM->>Provider: _chat_completion()
    Provider->>API: HTTP POST request
    API-->>Provider: Completion response
    Provider-->>LLM: Normalized response
    LLM-->>User: ChatCompletionResponse

Adding New Providers

To add support for a new LLM provider:

  1. Create a new provider class inheriting from BaseLLM
  2. Implement the required methods: chat(), chat_stream(), count_tokens()
  3. Add the provider to the Provider enum in base_llm.py
  4. Update the provider factory logic in llm.py to instantiate the new provider
  5. Add integration tests and documentation

Sources: arksim/llms/chat/base/base_llm.py Sources: arksim/llms/chat/llm.py

Error Handling

Provider implementations handle common error scenarios:

Error TypeHandling Strategy
Authentication failuresRaise AuthenticationError with helpful message
Rate limitingImplement automatic retry with backoff
Invalid request parametersRaise ValidationError with parameter details
Network timeoutsRetry with exponential backoff
Model not foundRaise ModelNotFoundError listing available models

Best Practices

  1. API Key Security: Store API keys in environment variables, never in configuration files committed to version control. ArkSim automatically loads keys from environment variables.
  1. Token Tracking: Use the count_tokens() method to monitor token usage and estimate costs before running large-scale simulations.
  1. Streaming for Large Responses: Enable streaming (chat_stream()) for scenarios expecting long agent responses to improve perceived responsiveness.
  1. Provider Selection: Choose Azure OpenAI for enterprise deployments requiring compliance certifications and dedicated infrastructure.
  1. Model Selection: Refer to the integration examples for recommended model configurations per provider and use case.

Sources: arksim/llms/chat/providers/openai.py

Tool Call Capture

Related topics: Evaluation System, Agent Types and Integration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section ToolCall

Continue reading this section for the full explanation and source context.

Section ToolCallSource

Continue reading this section for the full explanation and source context.

Section AgentResponse

Continue reading this section for the full explanation and source context.

Related topics: Evaluation System, Agent Types and Integration

Tool Call Capture

Tool Call Capture is a core mechanism in ArkSim that observes and records every tool or function invocation made by an agent during a simulation. This captured data feeds directly into the evaluator, enabling metrics like tool call accuracy, error detection, and trajectory analysis.

Overview

Tool Call Capture serves as the bridge between agent execution and evaluation. Without accurate tool call capture, the evaluator cannot determine whether an agent:

  • Invoked the correct tools
  • Passed valid arguments
  • Handled errors appropriately
  • Followed the expected execution trajectory

ArkSim supports three distinct capture mechanisms:

  1. Explicit capture via AgentResponse.tool_calls
  2. Automatic capture via the ArksimTracingProcessor
  3. A2A protocol capture via task artifacts

All three methods produce the same underlying ToolCall data model, ensuring consistent evaluation regardless of how the agent is implemented.

Data Models

ToolCall

The ToolCall class represents a single tool invocation observed during a turn:

class ToolCall(BaseModel):
    model_config = ConfigDict(extra="ignore")
    
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None
FieldTypeDescription
idstrUnique identifier for this tool call
namestrName of the tool/function invoked
argumentsdict[str, Any]Arguments passed to the tool
result`str \None`Response returned by the tool
error`str \None`Error message if the call failed
source`ToolCallSource \None`Origin of the capture data

The extra="ignore" configuration ensures forward compatibility with future versions that add new fields.

ToolCallSource

The ToolCallSource enum indicates how tool call data was captured:

class ToolCallSource(str, Enum):
    AGENT_RESPONSE = "agent_response"
    TRACING_PROCESSOR = "tracing_processor"
    A2A_PROTOCOL = "a2a_protocol"

AgentResponse

For explicit capture, agents return structured responses:

class AgentResponse(BaseModel):
    model_config = ConfigDict(extra="ignore")
    
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)

Capture Methods

Explicit Capture (AgentResponse)

The default approach requires agents to explicitly return tool calls alongside their text response. This is the standard pattern for custom agents implementing BaseAgent.

graph TD
    A[Simulator invokes agent] --> B[Agent executes user_query]
    B --> C[Agent returns AgentResponse]
    C --> D[Simulator extracts tool_calls]
    D --> E[Evaluator scores trajectory]

Implementation pattern:

from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse, ToolCall

class MyAgent(BaseAgent):
    async def execute(self, user_query: str, **kwargs) -> str | AgentResponse:
        # Agent logic here...
        tool_calls = [
            ToolCall(
                id="call_1",
                name="get_order_status",
                arguments={"order_id": "ORD-1001"},
                source="agent_response"
            )
        ]
        return AgentResponse(
            content="The order status is shipped.",
            tool_calls=tool_calls
        )

Sources: examples/customer-service/custom_agent.py

Tracing Processor (Automatic Capture)

The ArksimTracingProcessor uses the OpenAI Agents SDK's tracing interface to capture tool calls automatically, without requiring explicit return data.

graph TD
    A[Simulator sets routing context] --> B[Agent executes normally]
    B --> C[SDK fires on_span_end]
    C --> D[ArksimTracingProcessor captures]
    D --> E[Evaluator scores trajectory]

Registration pattern:

from agents import Agent as SDKAgent
from arksim.tracing.openai import ArksimTracingProcessor

# Register once at module load
_tracing_processor = ArksimTracingProcessor()

The traced agent returns a plain str rather than AgentResponse:

async def execute(self, user_query: str, **kwargs) -> str:
    result = await Runner.run(self._sdk_agent, user_query)
    return result.final_output

Sources: examples/customer-service/traced_agent.py

A2A Protocol Capture

For agents implementing the Agent-to-Agent (A2A) protocol, tool calls are embedded in task artifacts using the A2AToolCaptureExtension.

graph TD
    A[Simulator sends task to A2A Agent] --> B[Agent processes request]
    B --> C[Agent returns Task with artifacts]
    C --> D[Simulator extracts from metadata]
    D --> E[Evaluator scores]

Tool call extraction from artifacts:

def _extract_tool_calls_from_artifact(self, artifact: Artifact) -> list[ToolCall]:
    metadata = artifact.metadata
    raw_calls = metadata.get("tool_calls", [])
    tool_calls = []
    for raw in raw_calls:
        arguments = raw.get("arguments", {})
        if not isinstance(arguments, dict):
            continue
        tool_calls.append(
            ToolCall(
                id=raw.get("id", ""),
                name=name,
                arguments=arguments,
                result=A2AAgent._coerce_to_string(raw.get("result")),
                error=A2AAgent._coerce_to_string(raw.get("error")),
                source=ToolCallSource.A2A_PROTOCOL,
            )
        )
    return tool_calls

Sources: arksim/simulation_engine/agent/clients/a2a.py

A2A agent card declaration:

from arksim.simulation_engine.tool_types import A2AToolCaptureExtension

_capabilities = AgentCapabilities(
    streaming=False,
    extensions=[A2AToolCaptureExtension],
)

Workflow Diagram

The following diagram shows the complete simulation pipeline with tool call capture:

flowchart LR
    subgraph Simulation
        A[User Query] --> B[Simulator]
        B --> C{Agent Type}
        C -->|Custom| D[explicit tool_calls]
        C -->|Traced| E[TracingProcessor]
        C -->|A2A| F[Artifact metadata]
        D --> G[Captured Tool Calls]
        E --> G
        F --> G
    end
    
    subgraph Evaluation
        G --> H[Evaluator]
        H --> I[Tool Call Metrics]
        H --> J[Error Detection]
        H --> K[Trajectory Analysis]
    end
    
    G --> L[Results/Report]
    I --> L
    J --> L
    K --> L

Comparison of Capture Methods

AspectExplicit (AgentResponse)Traced (TracingProcessor)A2A Protocol
Return typeAgentResponsestrVia protocol
Implementation complexityMediumLowHigh
Agent code changesRequiredMinimalProtocol required
Best forCustom Python agentsSDK-based agentsA2A-native agents
Tool call source fieldagent_responsetracing_processora2a_protocol

Sources: examples/customer-service/README.md

Configuration

Trace Receiver Settings

For traced agents, enable the trace receiver in the simulation config:

simulation:
  max_turns: 10
  
trace_receiver:
  enabled: true
  wait_timeout: 5  # seconds to wait for traces

When Tracing is Disabled

When trace_receiver.enabled is false or omitted, ArkSim falls back to explicit AgentResponse capture:

When trace_receiver.enabled is false or omitted, arksim only captures tool calls from AgentResponse (the standard path).

Integration with Evaluation

Captured tool calls flow into the evaluator's scoring pipeline:

  1. Trajectory matching - Compare actual tool sequence against expected
  2. Argument validation - Verify tool arguments match scenario requirements
  3. Error detection - Identify tool call failures and their handling
  4. Coverage analysis - Determine if all required tools were invoked

The evaluator uses the source field to differentiate between capture methods when analyzing behavioral patterns.

Forward Compatibility

Both ToolCall and AgentResponse declare extra="ignore" in their Pydantic configuration:

Declares extra="ignore" explicitly so snapshots from future arksim versions that add new fields can still be loaded by older arksim without raising a ValidationError.

This ensures that simulation results captured with newer versions remain loadable by older versions of the evaluator.

Sources: examples/customer-service/custom_agent.py

Scenario Management

Related topics: Simulation Engine, Configuration System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Scenario Structure

Continue reading this section for the full explanation and source context.

Section User Profile Schema

Continue reading this section for the full explanation and source context.

Section Goal Structure

Continue reading this section for the full explanation and source context.

Related topics: Simulation Engine, Configuration System

Scenario Management

Overview

Scenario Management is the core system in ArkSim for defining, loading, validating, and executing test scenarios that simulate user interactions with an AI agent. A scenario represents a structured test case that defines a user's goal, knowledge base, behavioral characteristics, and expected outcomes.

Scenarios serve as the foundation for the simulation and evaluation pipeline, enabling reproducible testing of agent behavior across diverse conversation patterns and user profiles.

Scenario Data Model

Core Scenario Structure

Each scenario in ArkSim is a JSON object containing the following primary fields:

FieldTypeRequiredDescription
scenario_idstringYesUnique identifier for the scenario
namestringYesHuman-readable scenario name
descriptionstringNoDetailed description of the scenario's purpose
user_profileobjectYesUser persona characteristics
goalobjectYesThe user's primary objective
knowledgearrayYesContextual knowledge the user has access to
expected_behaviorobjectNoExpected agent responses or behaviors
metricsarrayNoCustom evaluation criteria

User Profile Schema

{
  "name": "string",
  "age": "number",
  "personality": "string",
  "background": "string",
  "communication_style": "string"
}

Sources: arksim/scenario/entities.py

Goal Structure

{
  "primary": "string (main user objective)",
  "secondary": ["array of secondary objectives"],
  "constraints": ["array of constraints or boundaries"],
  "success_criteria": "string"
}

Scenario Management Workflow

graph TD
    A[Create/Load Scenarios] --> B[Validate Scenario Schema]
    B --> C{Valid?}
    C -->|Yes| D[Save to scenarios.json]
    C -->|No| E[Show Validation Errors]
    E --> A
    D --> F[Configure Simulation Parameters]
    F --> G[Run Simulation]
    G --> H[Generate Results]
    H --> I[Evaluation & Reporting]

Scenario Loading and Validation

The UI provides interactive scenario management through the "Build" page:

  1. Load Existing - Users can load pre-existing scenario files via file path input
  2. Auto-generate (PRO) - Automatic scenario generation from agent knowledge base
  3. Manual Creation - Build scenarios through the UI interface
// Scenario file validation in UI
@input="validateScenarioFile()"
@blur="validateScenarioFile()"
@keydown.enter="loadScenarioFile()"

Sources: arksim/ui/frontend/index.html

Validation Rules

  • Scenario files must be valid JSON
  • Required fields must be present
  • scenario_id must be unique within the file
  • user_profile must contain at minimum a name field

Scenario File Format

ArkSim uses JSON format for scenario definitions. See the example structure:

{
  "scenarios": [
    {
      "scenario_id": "ecommerce-return-item-001",
      "name": "Return Defective Product",
      "description": "Customer wants to return a damaged item received last week",
      "user_profile": {
        "name": "John Smith",
        "age": 35,
        "personality": "patient but firm",
        "background": "Regular online shopper",
        "communication_style": "polite and direct"
      },
      "goal": {
        "primary": "Get a full refund for a damaged product",
        "secondary": ["Understand return process", "Know timeline for refund"],
        "constraints": ["Only willing to wait up to 14 days for refund"],
        "success_criteria": "Full refund issued or replacement offered"
      },
      "knowledge": [
        "Ordered product SKU-12345 on March 1, 2024",
        "Item arrived damaged with visible scratches",
        "Has original packaging and receipt",
        "Order number: ORD-987654"
      ],
      "expected_behavior": {
        "should_mention_order_number": true,
        "should_request_photo_evidence": false
      }
    }
  ]
}

Sources: examples/e-commerce/scenarios.json Sources: examples/bank-insurance/scenarios.json Sources: examples/customer-service/scenarios.json

Simulation Configuration

Scenarios are executed through the simulation engine with configurable parameters:

ParameterDefaultDescription
num_conversations_per_scenario5Number of conversations to simulate per scenario
max_turns5Maximum turns per conversation
num_workers50Parallel workers (or 'auto')
modelgpt-4oLLM model for simulation
provideropenaiLLM provider
model: str = Field(default=DEFAULT_MODEL, description="LLM model for simulation")
num_conversations_per_scenario: int = Field(
    default=5, 
    description="Number of conversations per scenario to simulate"
)
max_turns: int = Field(default=5, description="Maximum turns per conversation")

Sources: arksim/simulation_engine/entities.py:18-25

Configuration via config.yaml

simulation:
  scenarios_file: ./scenarios.json
  num_conversations_per_scenario: 5
  max_turns: 5
  num_workers: auto
  model: gpt-4o

Built-in Scenario Templates

ArkSim provides template scenarios for common use cases:

{
  "scenarios": [
    {
      "scenario_id": "template-basic-query",
      "name": "Basic Information Query",
      "description": "User asks a simple informational question",
      "user_profile": {...},
      "goal": {...},
      "knowledge": [...]
    }
  ]
}

Sources: arksim/templates/scenarios.json

Integration with CI/CD

Scenarios can be integrated into automated testing workflows:

# GitHub Actions workflow
steps:
  - name: Run Scenario Tests
    run: arksim simulate-evaluate config.yaml

Sources: examples/ci/README.md

Required CI Setup

  1. Update TODO sections in arksim.yml (startup command and health-check URL)
  2. Create tests/arksim/config.yaml pointing to your server endpoint
  3. Create tests/arksim/scenarios.json with your test cases
  4. Add custom metrics to tests/arksim/custom_metrics/ if needed
  5. Configure OPENAI_API_KEY in GitHub secrets

Best Practices

Scenario Design

  • Distinct Goals: Each scenario should test a single, well-defined user goal
  • Realistic Profiles: User profiles should reflect actual customer demographics
  • Sufficient Knowledge: Include enough context for the simulated user to maintain coherent conversation

File Organization

project/
├── config.yaml
├── scenarios.json
├── custom_metrics/
│   └── my_metric.py
└── results/
    └── simulation/

Version Control

  • Commit scenarios.json with your agent code
  • Use descriptive scenario_id values with prefixes (e.g., ecommerce-, support-)
  • Document success criteria clearly in the goal.success_criteria field

Command Line Usage

Initialize with default scenarios

arksim init

Run simulation with scenarios

arksim simulate-evaluate config.yaml

Results are written to ./results/simulation/simulation.json.

Sources: README.md Sources: examples/integrations/dify/README.md

See Also

Sources: arksim/scenario/entities.py

Configuration System

Related topics: Agent Types and Integration, LLM Provider Integration, Evaluation System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Main Configuration (config.yaml)

Continue reading this section for the full explanation and source context.

Section Separate Phase Configuration

Continue reading this section for the full explanation and source context.

Section Supported Agent Types

Continue reading this section for the full explanation and source context.

Related topics: Agent Types and Integration, LLM Provider Integration, Evaluation System

Configuration System

ArkSim provides a flexible, multi-layered configuration system that supports both YAML-based declarative configuration and programmatic Python class configuration. The system enables users to define simulation scenarios, evaluation metrics, agent connections, and runtime behavior through structured configuration files or direct Python implementations.

Architecture Overview

The configuration system is organized into several key modules that handle different aspects of simulation and evaluation:

graph TD
    A[User Configuration] --> B[YAML Files]
    A --> C[Python Classes]
    B --> D[config.yaml]
    B --> E[config_simulate.yaml]
    B --> F[config_evaluate.yaml]
    C --> G[BaseAgent Subclass]
    C --> H[Custom Metrics]
    D --> I[Config Loader]
    E --> I
    F --> I
    G --> J[Simulation Engine]
    H --> K[Evaluator]
    I --> J
    J --> K
    J --> L[Results/Reports]
    K --> L

The system separates configuration into three distinct phases: simulation, evaluation, and combined execution. This separation allows users to run simulations independently from evaluation, which is useful for debugging and iterative development workflows.

Configuration Files

ArkSim uses YAML configuration files to define simulation parameters, evaluation settings, and agent connections. The repository includes three primary configuration templates at the root level.

Main Configuration (config.yaml)

The main configuration file combines both simulation and evaluation settings into a single file. This is the recommended approach for simple use cases where both phases run together using the simulate-evaluate command. The file structure typically includes agent configuration, scenario definitions, metric selections, and evaluation parameters.

Separate Phase Configuration

For more complex workflows, ArkSim supports splitting configuration into separate files:

FilePurposeCLI Command
config_simulate.yamlSimulation-only parametersarksim simulate config_simulate.yaml
config_evaluate.yamlEvaluation-only parametersarksim evaluate config_evaluate.yaml
config.yamlCombined configurationarksim simulate-evaluate config.yaml

Sources: examples/customer-service/README.md

This separation enables scenarios where simulation results are cached and evaluated multiple times with different metric configurations, or where the simulation phase runs in a different environment than evaluation.

Agent Configuration

The configuration system supports multiple agent connection types, defined through the agent_type field in the agent configuration section.

Supported Agent Types

Agent TypeDescriptionConfiguration Style
customCustom Python agent class inheriting from BaseAgentPython class file
chat_completionsHTTP endpoint implementing Chat Completions APIYAML endpoint config
a2aAgent-to-Agent protocol endpointYAML endpoint config

Sources: arksim/cli.py

Custom Agent (Python Class)

The default agent type uses a Python class that extends BaseAgent. This approach provides maximum flexibility for integrating any agent framework. The agent class must implement two required methods:

from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"

Sources: README.md

The execute method can return either a plain string response or an AgentResponse object that includes both text content and tool calls for evaluation.

Chat Completions Agent

For agents exposing an HTTP endpoint compatible with the OpenAI Chat Completions format, configuration is declarative:

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

Sources: README.md

A2A Protocol Agent

Agents implementing the Agent-to-Agent protocol are configured similarly:

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

A2A agents can also surface tool calls for evaluation through the protocol, enabling comprehensive testing of tool-using agents.

Scenario Configuration

Scenarios define the test cases that the simulator executes against the agent. The scenarios.json file contains an array of scenario objects, each representing a simulated conversation or interaction.

Scenario Structure

Each scenario typically includes:

  • Scenario ID: Unique identifier for the scenario
  • User query: The initial prompt or question presented to the agent
  • Expected behavior: Criteria for successful completion
  • Knowledge/context: Information the agent should know or have access to during the scenario

Scenario Files Location

Scenarios are referenced from the main configuration file and can be shared across different configuration files:

Example DirectoryScenario Purpose
examples/bank-insurance/Financial services agent testing
examples/customer-service/Customer support scenarios
examples/integrations/*/Framework-specific testing
examples/ci/CI/CD integration testing

Sources: examples/ci/README.md

Evaluation Metrics Configuration

The evaluator component scores agent responses against defined metrics. Configuration specifies which metrics to run and how they are calculated.

Built-in Metrics

ArkSim includes standard evaluation metrics covering common agent quality dimensions. These metrics are automatically available without additional configuration.

Custom Metrics

Users can define custom evaluation metrics by creating a Python module that implements the metric interfaces. The custom metrics file must define metrics following the provided schema:

from pydantic import BaseModel
from arksim.evaluator import (
    QualitativeMetric,
    QualResult,
    QuantitativeMetric,
    QuantResult,
    ScoreInput,
    format_chat_history,
)

Sources: examples/customer-service/custom_metrics.py

Custom metrics are referenced in the configuration file using the custom_metrics_file_paths parameter, and optionally listed in metrics_to_run to include them in the evaluation:

custom_metrics_file_paths:
  - tests/arksim/custom_metrics/my_metric.py
metrics_to_run:
  - custom_metric_name

Sources: examples/ci/README.md

Custom Metric Schema

Custom quantitative metrics should return a Pydantic model defining the score components:

class VerificationComplianceSchema(BaseModel):
    identity_verification: float  # 0.0-1.0
    action_gating: float  # 0.0-1.0
    reason: str

Sources: examples/customer-service/custom_metrics.py

The system prompt for qualitative metrics defines the evaluation criteria that the LLM judge applies when scoring agent responses.

CLI Configuration Interface

The ArkSim CLI provides commands for initializing configurations, running simulations, and managing examples.

Initialization Command

The init command scaffolds starter files for agent testing:

arksim init --agent-type custom
FlagOptionsDefaultDescription
--agent-typecustom, chat_completions, a2acustomAgent connection type
--forcebooleanfalseOverwrite existing files

Sources: arksim/cli.py

The --agent-type flag determines which template is generated:

  • custom generates a Python agent file (no server needed)
  • chat_completions generates YAML configuration for HTTP endpoints
  • a2a generates YAML configuration for Agent-to-Agent protocol

Simulation Commands

CommandDescription
arksim simulate-evaluate config.yamlRun simulation and evaluation in sequence
arksim simulate config_simulate.yamlRun simulation only
arksim evaluate config_evaluate.yamlEvaluate previously saved simulation results

Sources: examples/customer-service/README.md

Examples Command

The CLI also provides access to example projects:

arksim examples                    # Download all examples
arksim examples bank-insurance     # Download specific example
arksim examples --list            # List available examples

Integration Configuration Patterns

Each integration example follows a consistent pattern with three standard files.

File Structure

FilePurpose
custom_agent.pyAgent implementation connecting to the target framework
config.yamlSimulation and evaluation settings
scenarios.jsonTest scenarios for the example domain

Supported Framework Integrations

ArkSim provides example configurations for the following agent frameworks:

FrameworkInstallationExample Directory
LangGraphpip install langgraph langchain-openaiexamples/integrations/langgraph/
LangChainpip install langgraph langchain-openaiexamples/integrations/langchain/
Claude Agent SDKpip install claude-agent-sdkexamples/integrations/claude-agent-sdk/
AutoGenpip install autogen-agentchat autogen-ext[openai]examples/integrations/autogen/
CrewAIpip install crewaiexamples/integrations/crewai/
Pydantic AIpip install pydantic-aiexamples/integrations/pydantic-ai/
Smolagentspip install smolagentsexamples/integrations/smolagents/
DifyHTTP integrationexamples/integrations/dify/
OpenClawGateway token authexamples/openclaw/

Sources: examples/integrations/*/README.md

Trace Receiver Configuration

For frameworks that support tracing, ArkSim includes a trace receiver component that captures tool calls automatically without requiring explicit AgentResponse returns.

Configuration Options

trace_receiver:
  enabled: true
  wait_timeout: 5
ParameterTypeDefaultDescription
enabledbooleanfalseEnable trace-based tool call capture
wait_timeoutinteger5Seconds to wait for trace data

Sources: examples/customer-service/README.md

When enabled is false or omitted, ArkSim only captures tool calls from AgentResponse objects returned by the agent's execute method.

Workflow Summary

The following diagram illustrates the configuration-driven workflow from specification to results:

graph LR
    A[config.yaml] --> B[Scenario Loading]
    C[scenarios.json] --> B
    B --> D[Simulation Engine]
    D --> E{Agent Type?}
    E -->|custom| F[Python Agent]
    E -->|chat_completions| G[HTTP Endpoint]
    E -->|a2a| H[A2A Agent]
    F --> I[Execute Scenarios]
    G --> I
    H --> I
    I --> J[Simulation Results]
    J --> K[Evaluator]
    J --> L[Conversation Viewer]
    K --> M[Evaluation Report]
    M --> N[scores.json]
    M --> O[failures.json]

Best Practices

Environment Variables: API keys should be set as environment variables rather than hardcoded in configuration files:

export OPENAI_API_KEY="<your-key>"
export ANTHROPIC_API_KEY="<your-key>"

Separation of Concerns: Use separate config_simulate.yaml and config_evaluate.yaml files when iterating on scenarios or metrics independently.

Custom Metrics Organization: Place custom metrics in dedicated directories (e.g., tests/arksim/custom_metrics/) and reference them from the main configuration.

Scenario Versioning: Keep scenarios in version-controlled JSON files that can be reviewed and updated as agent requirements evolve.

Sources: CONTRIBUTING.md

Sources: examples/customer-service/README.md

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium v0.1.0

First-time setup may fail or require extra isolation and rollback planning.

medium v0.0.6

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

medium v0.3.2

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

medium v0.3.4

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

Doramagic Pitfall Log

Doramagic extracted 15 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: v0.1.0

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v0.1.0. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.1.0

2. Configuration risk: v0.0.6

  • Severity: medium
  • Finding: Configuration risk is backed by a source signal: v0.0.6. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.0.6

3. Configuration risk: v0.3.2

  • Severity: medium
  • Finding: Configuration risk is backed by a source signal: v0.3.2. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.2

4. Configuration risk: v0.3.4

  • Severity: medium
  • Finding: Configuration risk is backed by a source signal: v0.3.4. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.4

5. Configuration risk: v0.3.5

  • Severity: medium
  • Finding: Configuration risk is backed by a source signal: v0.3.5. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.5

6. Capability assumption: v0.3.1

  • Severity: medium
  • Finding: Capability assumption is backed by a source signal: v0.3.1. Treat it as a review item until the current version is checked.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.1

7. Capability assumption: README/documentation is current enough for a first validation pass.

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: capability.assumptions | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | README/documentation is current enough for a first validation pass.

8. Maintenance risk: v0.3.3

  • Severity: medium
  • Finding: Maintenance risk is backed by a source signal: v0.3.3. Treat it as a review item until the current version is checked.
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.3

9. Maintenance risk: Maintainer activity is unknown

  • Severity: medium
  • Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: evidence.maintainer_signals | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | last_activity_observed missing

10. Security or permission risk: no_demo

  • Severity: medium
  • Finding: no_demo
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: downstream_validation.risk_items | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | no_demo; severity=medium

11. Security or permission risk: no_demo

  • Severity: medium
  • Finding: no_demo
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: risks.scoring_risks | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | no_demo; severity=medium

12. Security or permission risk: v0.2.0

  • Severity: medium
  • Finding: Security or permission risk is backed by a source signal: v0.2.0. Treat it as a review item until the current version is checked.
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.2.0

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using arksim with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence