Doramagic Project Pack · Human Manual
arksim
ArkSim addresses the challenge of systematically evaluating conversational AI agents by providing:
Introduction to ArkSim
Related topics: Quickstart Guide, System Architecture Overview
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Quickstart Guide, System Architecture Overview
Introduction to ArkSim
ArkSim is a multi-turn agent evaluation and simulation platform designed to test, measure, and improve AI agent quality across conversational scenarios. It enables developers to define test scenarios, simulate user interactions with their agents, and generate comprehensive evaluation reports with quantitative and qualitative metrics.
Overview
ArkSim addresses the challenge of systematically evaluating conversational AI agents by providing:
- Scenario-based simulation: Define test cases with user profiles, knowledge bases, and expected behaviors
- Automated evaluation: Score agent responses across multiple dimensions including goal completion, helpfulness, and coherence
- Framework integration: Connect with various agent frameworks including LangGraph, LangChain, AutoGen, Claude Agent SDK, Rasa, Dify, and OpenClaw
- Custom metrics: Extend evaluation capabilities with user-defined quantitative and qualitative metrics
- Visual reporting: Generate HTML reports with conversation transcripts, failure analysis, and per-metric scores
Sources: README.md
Architecture Overview
ArkSim follows a pipeline architecture consisting of three primary stages:
graph TD
A[Configuration<br/>config.yaml] --> B[Simulation Engine]
C[Scenarios<br/>scenarios.json] --> B
D[Agent<br/>custom_agent.py] --> B
B --> E[Conversation Data]
E --> F[Evaluator]
F --> G[HTML Report<br/>final_report.html]
H[Custom Metrics<br/>custom_metrics.py] --> FCore Components
| Component | Description |
|---|---|
| Simulation Engine | Orchestrates multi-turn conversations between simulated users and the agent |
| Evaluator | Scores agent performance using built-in and custom metrics |
| CLI Interface | Command-line interface for running simulations (arksim simulate-evaluate) |
| Report Generator | Produces HTML reports with visualizations and conversation transcripts |
Sources: examples/customer-service/README.md:1-25
Agent Integration Methods
ArkSim supports multiple agent integration patterns to accommodate different agent architectures.
Python Class Integration (BaseAgent)
The default integration method uses a Python class inheriting from BaseAgent. This pattern provides maximum flexibility for connecting any agent implementation.
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse
class MyAgent(BaseAgent):
async def get_chat_id(self) -> str:
return "unique-id"
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
# Replace with your agent logic
return "agent response"
The agent can return either:
- A plain string response
- An
AgentResponseobject containing both content and tool calls
Sources: README.md:40-55
AgentResponse Data Model
The AgentResponse model encapsulates structured agent outputs:
class AgentResponse(BaseModel):
"""Structured return from agent execution, carrying both text and tool calls."""
model_config = ConfigDict(extra="ignore")
content: str
tool_calls: list[ToolCall] = Field(default_factory=list)
Sources: arksim/simulation_engine/tool_types.py:32-42
ToolCall Model
Tool calls are captured using the ToolCall model:
class ToolCall(BaseModel):
"""A single tool/function call observed during a turn."""
model_config = ConfigDict(extra="ignore")
id: str
name: str
arguments: dict[str, Any] = Field(default_factory=dict)
result: str | None = None
error: str | None = None
source: ToolCallSource | None = None
The extra="ignore" configuration ensures forward compatibility with future versions.
Sources: arksim/simulation_engine/tool_types.py:18-30
Chat Completions Endpoint Integration
For agents exposing a standard OpenAI-compatible API:
agent_config:
agent_type: chat_completions
agent_name: my-agent
api_config:
endpoint: http://localhost:8000/v1/chat/completions
Sources: README.md:57-62
A2A Protocol Integration
For agents implementing the Agent-to-Agent (A2A) protocol:
agent_config:
agent_type: a2a
agent_name: my-agent
api_config:
endpoint: http://localhost:9999/agent
Sources: README.md:64-70
Automatic Tool Call Capture (Tracing)
ArkSim supports automatic tool call capture through a tracing processor, eliminating the need for agents to explicitly return tool calls:
graph LR
A[Simulator sets<br/>routing context] --> B[agent.execute()]
B --> C[SDK fires<br/>TracingProcessor.on_span_end]
C --> D[arksim captures<br/>tool calls]
D --> E[Evaluator scores]pip install -r requirements-traced.txt
arksim simulate-evaluate config_traced.yaml
Sources: examples/customer-service/README.md:35-55
Configuration System
ArkSim uses YAML configuration files to define simulation and evaluation parameters.
Configuration File Structure
# Simulation settings
simulation:
max_turns: 10
timeout: 300
# Agent configuration
agent_config:
agent_type: chat_completions # or 'a2a', 'custom'
agent_name: my-agent
api_config:
endpoint: http://localhost:8000/v1/chat/completions
# Trace receiver (optional)
trace_receiver:
enabled: true
wait_timeout: 5
# Custom metrics
custom_metrics_file_paths:
- ./custom_metrics.py
metrics_to_run:
- goal_completion
- verification_compliance
Sources: examples/ci/README.md:1-25
Scenario Definition
Scenarios are defined in scenarios.json with the following structure:
| Field | Description |
|---|---|
scenario_id | Unique identifier for the scenario |
user_profile | Description of the simulated user |
knowledge | Context and information available to the user |
goals | Objectives the user aims to achieve |
expected_behavior | Expected agent behavior patterns |
Sources: examples/integrations/dify/README.md:1-20
Evaluation Metrics
ArkSim provides built-in metrics and supports custom metric definitions.
Built-in Metrics
| Metric | Description | Weight |
|---|---|---|
| Goal Completion | Whether the agent achieved the user's objectives | 60% |
| Turn Success Ratio | Ratio of successful turns to total turns | 40% |
| Final Score | Weighted average of other metrics | - |
Sources: arksim/utils/html_report/report_template.html:80-95
Custom Metrics
Custom metrics can be implemented as either quantitative (scored numerically) or qualitative (evaluated via LLM).
#### Quantitative Metric Pattern
from arksim.evaluator import (
QuantitativeMetric,
QuantResult,
ScoreInput,
format_chat_history,
)
class VerificationComplianceMetric(Queresult):
identity_verification: float # 0.0-1.0
action_gating: float # 0.0-1.0
reason: str
VERIFICATION_COMPLIANCE_SYSTEM_PROMPT = """\
You are an impartial evaluator for a customer service agent.
Score how well the agent followed identity verification protocols..."""
def verification_compliance_metric(
input_data: ScoreInput,
) -> QuantResult:
# Implementation
return QuantResult(score=0.85, reason="Verification completed")
Sources: examples/customer-service/custom_metrics.py:15-40
#### Qualitative Metric Pattern
from arksim.evaluator import (
QualitativeMetric,
QualResult,
ScoreInput,
format_chat_history,
)
def helpfulness_metric(input_data: ScoreInput) -> QualResult:
system_prompt = """Evaluate the helpfulness of the agent response..."""
return QualResult(
assessment="The agent provided comprehensive assistance",
score=0.9,
reason="Clear explanation with actionable steps"
)
Metric Configuration
To use custom metrics:
- Implement the metric function in a Python file
- Reference the file path in
custom_metrics_file_pathsin config.yaml - Add the metric name to
metrics_to_run
Sources: examples/customer-service/custom_metrics.py:1-15
CLI Usage
ArkSim provides a command-line interface for running simulations and evaluations.
Primary Commands
| Command | Description |
|---|---|
arksim simulate-evaluate config.yaml | Run simulation and evaluation in one step |
arksim simulate config_simulate.yaml | Run simulation only |
arksim evaluate config_evaluate.yaml | Evaluate existing simulation results |
Sources: examples/customer-service/README.md:15-22
Workflow Examples
#### Combined Simulation and Evaluation
arksim simulate-evaluate config.yaml
#### Separate Simulation and Evaluation
# Step 1: Simulate
arksim simulate config_simulate.yaml
# Step 2: Evaluate
arksim evaluate config_evaluate.yaml
Results are written to ./results/simulation/simulation.json. The evaluation report is printed to stdout with per-scenario metric scores and failure analysis.
Sources: examples/integrations/dify/README.md:15-25
Framework Integrations
ArkSim provides integration examples for popular agent frameworks.
Integration Matrix
| Framework | Package Required | Documentation |
|---|---|---|
| LangGraph | langgraph langchain-openai | Link |
| LangChain | langgraph langchain-openai | Link |
| AutoGen | autogen-agentchat autogen-ext[openai] | Link |
| Claude Agent SDK | claude-agent-sdk | Link |
| Rasa | rasa-pro | Link |
| Dify | (custom agent) | Link |
| OpenClaw | (custom agent) | Link |
Sources: examples/integrations/langgraph/README.md, examples/integrations/autogen/README.md
General Integration Pattern
Regardless of the framework, the integration follows this pattern:
- Create a
custom_agent.pyfile with aBaseAgentsubclass - Implement the
execute()method to call the framework's agent - Return either a string or
AgentResponsewith tool calls - Configure
config.yamlto use the custom agent - Create
scenarios.jsonwith test cases
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse
class FrameworkAgent(BaseAgent):
async def get_chat_id(self) -> str:
return "unique-session-id"
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
# Call your framework's agent here
result = await your_agent.run(user_query)
return str(result)
Report Generation
ArkSim generates comprehensive HTML evaluation reports containing:
Report Sections
| Section | Content |
|---|---|
| Summary | Overall scores, pass/fail status |
| Per-Scenario Scores | Individual metric scores for each scenario |
| Failure Categories | Grouped failure patterns with counts |
| Conversation Viewer | Full transcript of each simulated conversation |
| Score Explanations | Rationale for each score with references to conversation turns |
Sources: README.md:25-30
Report Output
The report tells you where your agent is strong and where it breaks. You get per-metric scores, categorized failures, and full conversation transcripts so you can read the exact turns where things went wrong.
Sources: README.md:30-35
Development Guidelines
Project Setup
# Fork and clone
git clone https://github.com/<your-username>/arksim.git
cd arksim
# Install in editable mode with dev dependencies
pip install -e ".[dev]"
# Create a branch
git checkout -b my-feature
Code Quality
ArkSim uses Ruff for linting and formatting:
ruff check . # Lint
ruff format . # Format
Pre-commit hooks run both automatically on commit.
Code Style
- Follow PEP 8 conventions
- Code lines: maximum 120 characters
- Comments and docstrings: maximum 80 characters
- Type hints encouraged for function signatures
- Use absolute imports over relative imports
Commit Message Format
<component>: <verb> <description>
Examples:
evaluator: add custom metric supportsimulator: fix profile generation for empty attributescli: support verbose flag for streaming output
Keep the subject line under 72 characters, use lowercase, imperative mood.
Branch Naming
<type>/<short-description>
Examples: feat/retry-logic, fix/empty-list-handling, docs/update-quickstart.
Sources: CONTRIBUTING.md:1-50
CI/CD Integration
ArkSim can be integrated into GitHub Actions workflows for automated testing.
Workflow Options
| Workflow | Use Case |
|---|---|
arksim.yml | HTTP server endpoints requiring startup/shutdown |
arksim-pytest.yml | Python-based pytest integration |
Setup Requirements
- Update
TODOsections inarksim.yml(startup command and health-check URL) - Create
tests/arksim/config.yamlpointing to your server endpoint - Create
tests/arksim/scenarios.jsonwith test cases - Add custom metrics to
tests/arksim/custom_metrics/if needed - Add
OPENAI_API_KEY(and optionallyAGENT_API_KEY) to GitHub secrets
Sources: examples/ci/README.md:1-15
Summary
ArkSim provides a comprehensive framework for evaluating multi-turn conversational AI agents through:
- Flexible integration: Connect agents via Python classes, REST APIs, or framework-specific connectors
- Rich scenario definitions: Test agents with diverse user profiles and interaction goals
- Extensible evaluation: Implement custom metrics for domain-specific assessment
- Actionable reporting: Generate HTML reports that pinpoint exactly where and why agent behavior falls short
By standardizing agent evaluation, ArkSim enables data-driven improvements to conversational AI systems across different frameworks and architectures.
Sources: README.md
Quickstart Guide
Related topics: Introduction to ArkSim, Agent Types and Integration
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Introduction to ArkSim, Agent Types and Integration
Quickstart Guide
This guide walks you through setting up and running your first agent simulation with ArkSim. By the end, you'll understand how to scaffold a project, connect your agent, define test scenarios, and execute simulation and evaluation workflows.
Prerequisites
Before starting, ensure you have:
- Python 3.10 or higher
- pip or pipx for package installation
- An API key for your LLM provider (OpenAI, Anthropic, etc.) if your agent requires external API calls
Installation
Install ArkSim using pip:
pip install arksim
For development with test dependencies:
pip install -e ".[dev]"
Sources: README.md
Project Initialization
The arksim init command scaffolds a starter project with all necessary configuration files. Run it from your desired working directory:
arksim init
Agent Connection Types
ArkSim supports three agent connection types via the --agent-type flag:
| Type | Description | Use Case |
|---|---|---|
custom | Python class implementing BaseAgent | Full control, no external server needed |
chat_completions | HTTP endpoint compatible with OpenAI format | Existing REST APIs |
a2a | Agent-to-Agent protocol | Multi-agent systems |
Sources: arksim/cli.py:120-136
Command Options
arksim init --agent-type custom # Default: Python agent class
arksim init --agent-type chat_completions # HTTP endpoint
arksim init --agent-type a2a # A2A protocol
arksim init --agent-type custom --force # Overwrite existing files
Scaffolding generates these files:
| File | Purpose |
|---|---|
my_agent.py | Agent implementation stub |
config.yaml | Simulator configuration |
scenarios.json | Test scenario definitions |
.env | Environment variables template |
Project Structure
your-project/
├── my_agent.py # Your agent implementation
├── config.yaml # Simulation & evaluation settings
├── scenarios.json # Test scenarios
└── results/ # Output directory (created on run)
├── simulation/
│ └── simulation.json
└── evaluation/
└── evaluation.json
Connecting Your Agent
Python Class (Recommended)
Replace the generated my_agent.py with your agent logic. Your class must inherit from BaseAgent:
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse
class MyAgent(BaseAgent):
async def get_chat_id(self) -> str:
return "unique-id"
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
# Your agent logic here
return "agent response"
Sources: README.md
The execute method supports two return types:
| Return Type | Description |
|---|---|
str | Plain text response |
AgentResponse | Structured response with text and tool calls |
from arksim.simulation_engine.tool_types import AgentResponse, ToolCall
async def execute(self, user_query: str, **kwargs: object) -> AgentResponse:
tool_calls = [
ToolCall(
id="call_123",
name="search_database",
arguments={"query": user_query}
)
]
return AgentResponse(
content="Found results for your query",
tool_calls=tool_calls
)
Sources: arksim/simulation_engine/tool_types.py:45-72
Chat Completions Endpoint
For HTTP-based agents, configure config.yaml:
agent_config:
agent_type: chat_completions
agent_name: my-agent
api_config:
endpoint: http://localhost:8000/v1/chat/completions
A2A Protocol
For Agent-to-Agent protocol support:
agent_config:
agent_type: a2a
agent_name: my-agent
api_config:
endpoint: http://localhost:9999/agent
A2A agents can surface tool calls for evaluation via the protocol extension.
Sources: README.md
Configuring Simulation
Edit config.yaml to control simulation behavior:
simulator:
max_turns: 20
timeouts:
agent_response: 30
tool_execution: 10
model: gpt-4o-mini
Core Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
max_turns | int | 20 | Maximum conversation turns per scenario |
agent_response | int | 30 | Timeout for agent response (seconds) |
tool_execution | int | 10 | Timeout for tool execution (seconds) |
model | string | gpt-4o-mini | Simulator model for user simulation |
Defining Test Scenarios
Edit scenarios.json to define your test cases:
[
{
"id": "scenario-001",
"name": "Customer Inquiry",
"category": "support",
"user_profile": {
"name": "Alice Johnson",
"age": 34,
"account_type": "premium"
},
"knowledge": [
"Customer has a premium account",
"Customer is inquiring about billing"
],
"conversation": [
{
"turn": 1,
"user": "I noticed a charge on my account that I don't recognize",
"goal": "Customer wants to understand and dispute the unfamiliar charge"
}
]
}
]
Scenario Schema Fields
| Field | Required | Description |
|---|---|---|
id | Yes | Unique scenario identifier |
name | Yes | Human-readable name |
category | No | Grouping category for filtering |
user_profile | No | Simulated user attributes |
knowledge | No | Facts the simulated user knows |
conversation | Yes | Multi-turn conversation structure |
Sources: examples/customer-service/README.md
Running Simulation and Evaluation
Combined Workflow
Run both simulation and evaluation in one command:
arksim simulate-evaluate config.yaml
Separate Steps
For more control, run simulation and evaluation separately:
# Step 1: Simulate
arksim simulate config_simulate.yaml
# Step 2: Evaluate
arksim evaluate config_evaluate.yaml
Sources: examples/customer-service/README.md
Command Reference
| Command | Description |
|---|---|
arksim init | Scaffold new project |
arksim simulate <config> | Run simulation only |
arksim evaluate <config> | Run evaluation only |
arksim simulate-evaluate <config> | Run both steps |
arksim ui | Launch web UI (port 8080) |
arksim examples | Download example projects |
arksim prompts | List available prompts |
Sources: arksim/cli.py:89-152
Web UI
Launch the web-based control plane:
arksim ui --port 8080
The UI provides:
- Scenario management
- Real-time simulation progress
- Log viewing
- Dark/light mode toggle
Integration Examples
ArkSim supports integrations with popular agent frameworks:
| Framework | Command |
|---|---|
| LangChain/LangGraph | pip install langgraph langchain-openai |
| Claude Agent SDK | pip install claude-agent-sdk |
| Microsoft AutoGen | pip install autogen-agentchat autogen-ext[openai] |
| Dify | HTTP-based integration |
Example workflow for LangGraph:
cd examples/integrations/langgraph
pip install langgraph langchain-openai
export OPENAI_API_KEY="<your-key>"
arksim simulate-evaluate config.yaml
Sources: examples/integrations/langchain/README.md
Next Steps
- Review the Evaluation Metrics documentation for customizing scoring
- Explore the Custom Agent guide for advanced integration patterns
- Download example projects with
arksim examples
Sources: README.md
System Architecture Overview
Related topics: Simulation Engine, Evaluation System, LLM Provider Integration
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Simulation Engine, Evaluation System, LLM Provider Integration
System Architecture Overview
ArkSim is a multi-turn agent evaluation framework that simulates user conversations with agents, captures tool calls, and scores agent performance against customizable metrics. This document provides a comprehensive overview of the system's architecture, core components, and data flows.
Core Components
ArkSim's architecture is organized into three primary subsystems that work together to simulate, capture, and evaluate agent behavior.
| Component | Purpose |
|---|---|
| Simulation Engine | Orchestrates multi-turn conversations between user simulators and agents |
| Evaluator | Scores agent responses against qualitative and quantitative metrics |
| LLM Layer | Powers user simulation and metric evaluation via configurable model providers |
High-Level Architecture
graph TD
subgraph Simulation["Simulation Engine"]
CLI[CLI Interface]
Config[Config Loader]
Simulator[Simulator Core]
AgentConnector[Agent Connector]
UserSimulator[User Simulator]
end
subgraph Evaluation["Evaluator"]
Metrics[Custom Metrics]
Scoring[Scoring Engine]
Report[HTML Report Generator]
end
subgraph LLM["LLM Layer"]
ChatLLM[Chat LLM]
EvalLLM[Evaluation LLM]
end
subgraph External["External Systems"]
CustomAgent[Custom Agent]
HTTPEndpoint[HTTP Endpoint]
A2AAgent[A2A Agent]
end
CLI --> Config
Config --> Simulator
Simulator --> UserSimulator
Simulator --> AgentConnector
AgentConnector --> CustomAgent
AgentConnector --> HTTPEndpoint
AgentConnector --> A2AAgent
UserSimulator --> ChatLLM
Metrics --> EvalLLM
EvalLLM --> Scoring
Scoring --> Report
style CLI fill:#e1f5fe
style Simulator fill:#fff3e0
style Report fill:#e8f5e9CLI Interface
The command-line interface provides multiple entry points for running simulations and evaluations. The CLI is implemented in arksim/cli.py and supports the following commands:
| Command | Description |
|---|---|
arksim simulate-evaluate config.yaml | Run simulation and evaluation in a single pipeline |
arksim simulate config.yaml | Run simulation only |
arksim evaluate config.yaml | Run evaluation on existing simulation results |
arksim init | Scaffold starter files for agent testing |
arksim ui | Launch web UI control plane |
arksim examples | Download example projects from GitHub |
Sources: arksim/cli.py:1-100
Simulation Modes
ArkSim supports three distinct agent integration patterns:
graph LR
subgraph AgentTypes["Agent Types"]
Custom[Custom Agent<br/>Python class extending BaseAgent]
HTTP[Chat Completions<br/>HTTP endpoint /v1/chat/completions]
A2A[A2A Protocol<br/>Agent-to-Agent standard]
end- Custom Agent (Python class): Extend
BaseAgentand implement theexecute()method - Chat Completions: Configure an HTTP endpoint for OpenAI-compatible chat completions API
- A2A Protocol: Connect via the Agent-to-Agent protocol standard
Sources: README.md
Simulation Engine
The simulation engine orchestrates multi-turn conversations between a user simulator and the agent under test. Each turn consists of:
- User simulator generates a response based on conversation history and scenario knowledge
- Agent executes the user query and returns a response (optionally with tool calls)
- Simulator captures the interaction for evaluation
Tool Call Capture
ArkSim captures tool/function calls in two ways:
| Capture Method | Mechanism | Configuration |
|---|---|---|
| AgentResponse | Agent returns structured AgentResponse with tool_calls list | Default behavior |
| TracingProcessor | SDK's TracingProcessor.on_span_end captures calls automatically | trace_receiver.enabled: true |
Sources: examples/customer-service/README.md
#### Tool Call Data Model
Tool calls are represented by the ToolCall class:
class ToolCall(BaseModel):
id: str
name: str
arguments: dict[str, Any] = Field(default_factory=dict)
result: str | None = None
error: str | None = None
source: ToolCallSource | None = None
The AgentResponse wraps both content and tool calls:
class AgentResponse(BaseModel):
content: str
tool_calls: list[ToolCall] = Field(default_factory=list)
Sources: arksim/simulation_engine/tool_types.py:1-50
Traced Agent Flow
sequenceDiagram
participant Simulator
participant Agent
participant Tracing as ArksimTracingProcessor
participant Evaluator
Simulator->>Agent: execute(user_query)
Agent->>Tracing: on_span_end(span)
Tracing->>Simulator: captured_tool_calls
Simulator->>Agent: RunResult
Agent->>Simulator: AgentResponse
Simulator->>Evaluator: Tool calls + Conversation
Evaluator->>Simulator: Metric ScoresSources: examples/customer-service/README.md
Evaluator
The evaluator scores agent performance using both quantitative and qualitative metrics. The evaluation framework supports custom metrics defined via Python files.
Metric Types
| Type | Description | Scoring Method |
|---|---|---|
| QuantitativeMetric | Numerical scores (0.0-1.0 scale) | Structured JSON schema validation |
| QualitativeMetric | Free-form evaluation with reasoning | LLM-generated analysis |
Sources: examples/customer-service/custom_metrics.py:1-60
Custom Metrics Structure
Custom metrics require:
- A Pydantic
BaseModeldefining the output schema - A system prompt describing evaluation criteria
- Implementation of
QuantitativeMetricorQualitativeMetric
class ConversionSchema(BaseModel):
intent_strength: float
conversion_outcome: float
evidence: list[str]
reason: str
CONVERSION_SYSTEM_PROMPT = """\
You are an impartial evaluator for an e-commerce shopping agent.
Your job is to score (1) the shopper's purchase intent and (2) whether the agent achieved a conversion outcome...
"""
Sources: examples/e-commerce/custom_metrics.py:1-40
Score Calculation
The evaluator computes a Final Score as a weighted average:
| Component | Weight | Description |
|---|---|---|
| Turn Success Ratio | 40% | Ratio of successful turns to total turns |
| Goal Completion Score | 60% | LLM-assessed goal achievement score |
| Status | Condition |
|---|---|
| Done | Final score = 1.0 |
| Partial Failure | 0.0 < Final score < 1.0 |
| Complete Failure | Final score = 0.0 |
Sources: arksim/utils/html_report/report_template.html:1-50
Data Flow
graph LR
subgraph Input["Input"]
Config[config.yaml]
Scenarios[scenarios.json]
Metrics[custom_metrics.py]
end
subgraph Process["Processing"]
Sim[Simulation]
Eval[Evaluation]
Score[Scoring]
end
subgraph Output["Output"]
Results[simulation.json]
Report[evaluation.html]
end
Config --> Sim
Scenarios --> Sim
Sim --> Eval
Metrics --> Eval
Eval --> Score
Score --> Results
Score --> ReportAgent Connector Types
ArkSim supports multiple agent integration patterns via the configuration system:
# Option 1: Custom Python Agent
agent_config:
agent_type: custom
agent_name: my-agent
# Option 2: Chat Completions HTTP Endpoint
agent_config:
agent_type: chat_completions
agent_name: my-agent
api_config:
endpoint: http://localhost:8000/v1/chat/completions
# Option 3: A2A Protocol
agent_config:
agent_type: a2a
agent_name: my-agent
api_config:
endpoint: http://localhost:9999/agent
Sources: README.md
User Simulator
The user simulator generates realistic multi-turn conversations based on:
- Scenario definitions: Predefined conversation flows and expected behaviors
- User profiles: Demographic and behavioral attributes for persona simulation
- Knowledge bases: Domain-specific information the simulated user possesses
- Conversation history: Full context of the multi-turn interaction
The simulator uses LLM-based generation to produce contextually appropriate user responses that evolve through the conversation based on agent actions and conversation state.
HTML Report Generation
After evaluation completes, ArkSim generates an interactive HTML report containing:
| Section | Content |
|---|---|
| Summary Statistics | Overall scores, pass/fail rates |
| Per-Scenario Metrics | Individual metric scores with reasoning |
| Failure Categories | Grouped analysis of failure types |
| Conversation Transcripts | Full turn-by-turn dialogue viewer |
Sources: arksim/utils/html_report/report_template.html:1-100
Integration Examples
ArkSim provides example integrations for popular agent frameworks:
| Framework | Example Location | Connection Method |
|---|---|---|
| LangGraph | examples/integrations/langgraph/ | Custom agent connector |
| LangChain | examples/integrations/langchain/ | Custom agent connector |
| Claude Agent SDK | examples/integrations/claude-agent-sdk/ | Custom agent connector |
| AutoGen | examples/integrations/autogen/ | Custom agent connector |
| Pydantic AI | examples/integrations/pydantic-ai/ | Custom agent connector |
| Dify | examples/integrations/dify/ | HTTP client |
Sources: examples/integrations/langgraph/README.md, examples/integrations/claude-agent-sdk/README.md, examples/integrations/autogen/README.md
Configuration Schema
The main configuration file (config.yaml) controls all aspects of simulation and evaluation:
| Section | Key Options |
|---|---|
agent_config | agent_type, agent_name, api_config.endpoint |
scenario_file | Path to scenarios.json |
metrics_to_run | List of metric names to execute |
custom_metrics_file_paths | Paths to custom metric Python files |
trace_receiver.enabled | Enable TracingProcessor capture (default: false) |
trace_receiver.wait_timeout | Timeout for trace capture (seconds) |
Sources: examples/customer-service/README.md
Web UI
ArkSim includes a web-based control plane for managing simulations:
graph TD
subgraph UI["Web UI"]
Build[Build Scenarios]
Load[Load Existing]
Run[Run Simulation]
View[View Results]
end
subgraph Features["UI Features"]
AutoGen[Auto-generate Scenarios PRO]
Browse[File Browser]
Refresh[Refresh Results]
endFeatures include:
- Scenario building and loading
- Auto-generate scenarios (PRO feature)
- File browser integration
- Results viewing and refresh
Sources: arksim/ui/frontend/index.html:1-80
Summary
ArkSim provides a comprehensive multi-turn agent evaluation framework with:
- Flexible agent integration: Support for custom Python agents, HTTP endpoints, and A2A protocol
- Tool call capture: Two mechanisms for capturing agent tool executions
- Customizable metrics: Both quantitative and qualitative evaluation approaches
- Interactive reporting: HTML-based results with conversation viewer
- CLI and UI: Command-line and web-based interfaces for running evaluations
- Framework integrations: Pre-built examples for LangGraph, LangChain, Claude SDK, AutoGen, Pydantic AI, and Dify
Sources: arksim/cli.py:1-100
Simulation Engine
Related topics: System Architecture Overview, Evaluation System, Scenario Management
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Architecture Overview, Evaluation System, Scenario Management
Simulation Engine
The Simulation Engine is the core component of ArkSim responsible for executing multi-turn conversations between simulated users and agent systems under test. It orchestrates the entire simulation lifecycle, from scenario loading to agent execution, while capturing tool calls, responses, and conversation state for downstream evaluation.
Architecture Overview
The Simulation Engine follows a layered architecture that separates concerns between scenario management, agent execution, tool call capture, and knowledge handling.
graph TD
A[Scenarios JSON] --> B[Simulator]
C[Config YAML] --> B
B --> D[Agent Executor]
D --> E[BaseAgent]
E --> F[Custom Agent / Chat Completions / A2A]
F --> G[Tool Calls Capture]
G --> H[Simulation Results]
D --> I[Multi-Knowledge Handler]
I --> J[Knowledge Sources]
style B fill:#e1f5fe
style D fill:#fff3e0
style E fill:#e8f5e9Core Components
| Component | File | Purpose |
|---|---|---|
| Simulator | simulator.py | Orchestrates the simulation lifecycle |
| Entities | entities.py | Data models for scenarios, profiles, and turns |
| Agent Base | agent/base.py | Abstract base class for all agent implementations |
| Tool Types | tool_types.py | Data models for tool calls and agent responses |
| Multi-Knowledge Handler | core/multi_knowledge_handling.py | Manages multiple knowledge sources for user profiles |
| Prompt Utilities | utils/prompts.py | Generates prompts for simulated users |
Data Models
ToolCall
The ToolCall class represents a single tool or function call observed during a conversation turn.
class ToolCall(BaseModel):
id: str
name: str
arguments: dict[str, Any] = Field(default_factory=dict)
result: str | None = None
error: str | None = None
source: ToolCallSource | None = None
Key characteristics:
- Declares
extra="ignore"for forward compatibility with future versions - Supports capturing arguments, results, and error states
- Tracks the source of tool calls (e.g., from response parsing or tracing)
Sources: arksim/simulation_engine/tool_types.py:26-37
AgentResponse
The AgentResponse class provides a structured return from agent execution.
class AgentResponse(BaseModel):
content: str
tool_calls: list[ToolCall] = Field(default_factory=list)
Sources: arksim/simulation_engine/tool_types.py:45-52
BaseAgent Interface
All agent implementations must inherit from BaseAgent and implement the required async methods.
class MyAgent(BaseAgent):
async def get_chat_id(self) -> str:
return "unique-id"
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
# Replace with your agent logic
return "agent response"
Sources: README.md:45-55
Required Methods
| Method | Return Type | Description | |
|---|---|---|---|
get_chat_id() | str | Returns a unique identifier for the chat session | |
execute() | `str \ | AgentResponse` | Executes the agent with a user query and returns content with optional tool calls |
Sources: README.md:45-55
Simulation Workflow
graph LR
A[Load Config] --> B[Load Scenarios]
B --> C[Initialize Agent]
C --> D[For Each Scenario]
D --> E[Generate User Profile]
E --> F[Execute Turn]
F --> G{Capture Tool Calls}
G --> H[Record Response]
H --> I{More Turns?}
I -->|Yes| F
I -->|No| J[Save Results]
J --> K[Next Scenario]
K -->|Yes| D
K -->|No| L[Complete]Step 0: Build Scenarios
Before running a simulation, users must create or load test scenarios. The UI provides options to:
- Auto-generate Scenarios (Pro feature) - Automatically generate realistic test scenarios from the agent's knowledge base
- Load Existing - Load scenario files from a specified path
Sources: arksim/ui/frontend/index.html:1-50
Step 1: Simulation Execution
The simulator executes each scenario by:
- Loading the scenario configuration and user profiles
- Generating multi-turn conversations based on the scenario goals
- Capturing all tool calls and responses
- Producing a
simulation.jsonoutput file
Sources: examples/integrations/dify/README.md:18-20
Agent Types
ArkSim supports multiple agent integration patterns:
| Agent Type | Configuration | Use Case |
|---|---|---|
| Custom Python Class | agent_type: custom | Full control via subclassing BaseAgent |
| Chat Completions | agent_type: chat_completions | HTTP endpoint compatible with OpenAI format |
| A2A Protocol | agent_type: a2a | Agent-to-Agent protocol endpoints |
Sources: README.md:57-68
Chat Completions Configuration
agent_config:
agent_type: chat_completions
agent_name: my-agent
api_config:
endpoint: http://localhost:8000/v1/chat/completions
A2A Protocol Configuration
agent_config:
agent_type: a2a
agent_name: my-agent
api_config:
endpoint: http://localhost:9999/agent
Sources: README.md:57-68
Tool Call Capture
ArkSim provides two mechanisms for capturing tool calls:
Response-Based Capture (Default)
The agent returns an AgentResponse containing explicit tool calls:
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
run_result = self.agent.run(user_query)
return AgentResponse(
content=run_result.text,
tool_calls=extract_tool_calls(run_result)
)
Tracing-Based Capture (Automatic)
For agents using the ArkSim tracing processor, tool calls are captured automatically without modifying the agent response:
# At module load
from arksim.simulation_engine.tracing import ArksimTracingProcessor
agent.register(ArksimTracingProcessor())
# Agent returns plain str
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
return self.agent.run(user_query) # Returns plain string
Sources: examples/customer-service/README.md:1-35
Configuration
CLI Usage
# Combined simulation and evaluation
arksim simulate-evaluate config.yaml
# Separate steps
arksim simulate config_simulate.yaml
arksim evaluate config_evaluate.yaml
Sources: examples/customer-service/README.md:8-14
Trace Receiver Configuration
trace_receiver:
enabled: true
wait_timeout: 5
When trace_receiver.enabled is false, ArkSim only captures tool calls from AgentResponse.
Sources: examples/customer-service/README.md:30-35
Integration Examples
ArkSim provides pre-built integrations for popular agent frameworks:
| Framework | Package | Example Path |
|---|---|---|
| LangGraph | langgraph, langchain-openai | examples/integrations/langgraph/ |
| AutoGen | autogen-agentchat, autogen-ext[openai] | examples/integrations/autogen/ |
| Claude Agent SDK | claude-agent-sdk | examples/integrations/claude-agent-sdk/ |
| CrewAI | crewai | examples/integrations/crewai/ |
| Pydantic AI | pydantic-ai | examples/integrations/pydantic-ai/ |
| LangChain | langgraph, langchain-openai | examples/integrations/langchain/ |
Sources: examples/integrations/langgraph/README.md, examples/integrations/autogen/README.md, examples/integrations/claude-agent-sdk/README.md, examples/integrations/crewai/README.md, examples/integrations/pydantic-ai/README.md, examples/integrations/langchain/README.md
Output Format
Simulation results are written to ./results/simulation/simulation.json containing:
- Complete conversation transcripts
- Captured tool calls with arguments and results
- Turn-by-turn timing information
- User profile data
- Scenario metadata
Sources: examples/integrations/dify/README.md:18-20
Custom Metrics Support
The simulation engine supports custom quantitative and qualitative metrics through the evaluator. Custom metric files can be added to custom_metrics/ directories and referenced in the configuration.
Sources: examples/customer-service/custom_metrics.py:1-20
Forward Compatibility
Both ToolCall and AgentResponse declare extra="ignore" in their Pydantic configuration. This ensures that snapshots from future ArkSim versions containing new fields can be loaded by older versions without raising ValidationError.
model_config = ConfigDict(extra="ignore")
Sources: arksim/simulation_engine/tool_types.py:26-28, arksim/simulation_engine/tool_types.py:45-47
Evaluation System
Related topics: System Architecture Overview, Simulation Engine
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Architecture Overview, Simulation Engine
Evaluation System
The Evaluation System in ArkSim is responsible for scoring simulated conversations between users and agents. It measures goal completion, helpfulness, coherence, and other quality metrics to identify where agents succeed and where they fail.
Overview
After simulation generates conversation transcripts, the evaluator analyzes each conversation and assigns scores across multiple dimensions. The system supports both built-in metrics and custom-defined metrics.
Purpose:
- Quantify agent performance objectively
- Identify specific failure patterns
- Generate actionable HTML reports for debugging
Scope:
- Scores conversations on a 0.0-1.0 scale
- Computes weighted final scores combining multiple metrics
- Categorizes outcomes as done, partial failure, or failed
- Supports LLM-based evaluation (using configured providers like OpenAI)
Sources: README.md
Architecture
graph TD
A[Simulation Results] --> B[Evaluator]
B --> C[Built-in Metrics]
B --> D[Custom Metrics]
B --> E[Tool Call Metrics]
C --> F[Final Scores]
D --> F
E --> F
F --> G[HTML Report Generator]
G --> H[evaluation/final_report.html]Core Components
| Component | File | Purpose |
|---|---|---|
| Evaluator | evaluator.py | Main orchestration of evaluation pipeline |
| Base Metric | base_metric.py | Abstract base classes for metrics |
| Built-in Metrics | builtin_metrics.py | Standard metrics (faithfulness, helpfulness, etc.) |
| Tool Call Metrics | tool_call_metrics.py | Evaluation of tool usage patterns |
| Trajectory Matching | trajectory_matching.py | Compare expected vs actual agent trajectories |
| Thresholds | thresholds.py | Score classification logic |
| HTML Report | generate_html_report.py | Generate visual evaluation reports |
Evaluation Workflow
sequenceDiagram
participant S as Simulator
participant E as Evaluator
participant M as Metrics
participant R as Report Generator
S->>E: Simulation results (JSON)
E->>M: For each conversation
M->>M: Run quantitative metrics
M->>M: Run qualitative metrics
M->>E: Per-metric scores
E->>E: Calculate final scores
E->>E: Determine status
E->>R: Score data
R->>R: Generate HTMLStep-by-Step Process
- Load Simulation Data: Read conversation transcripts from simulation output
- Select Metrics: Determine which built-in and custom metrics to run
- Score Each Conversation: Apply metrics to score goal completion, behavior, etc.
- Compute Final Scores: Calculate weighted averages (goal_completion_weight=0.25, turn_success_ratio_weight=0.75)
- Classify Status: Assign done/partial_failure/failed based on thresholds
- Generate Report: Create HTML report with detailed breakdowns
Sources: arksim/utils/html_report/report_template.html
Scoring System
Score Ranges
| Score Type | Range | Description |
|---|---|---|
| Quantitative | 0.0 - 1.0 | Numeric scores for measurable criteria |
| Qualitative | Label-based | Categorical results (compliant, professional, pass, etc.) |
| Goal Completion | 0.0 - 1.0 | Whether agent completed user goal |
| Final Score | 0.0 - 1.0 | Weighted combination of metrics |
Status Classification
| Status | Condition | Description |
|---|---|---|
| Done | final_score == 1.0 | Perfect performance, goal completed |
| Partial Failure | final_score >= 0.6 | Acceptable but with some failures |
| Failed | final_score < 0.6 | Poor performance requiring attention |
Sources: arksim/utils/html_report/report_template.html:85-87
Built-in Metrics
The system provides seven built-in metrics that can be selected via configuration:
| Metric | Purpose |
|---|---|
faithfulness | Did the agent provide factually accurate information? |
helpfulness | Was the agent's response useful to the user? |
coherence | Were responses logically connected and consistent? |
verbosity | Did the agent maintain appropriate response length? |
relevance | Did responses address the user's actual query? |
goal_completion | Did the agent help the user accomplish their goal? |
agent_behavior_failure | Did the agent exhibit any problematic behaviors? |
Metric Selection
From the frontend, users can select which built-in metrics to run:
<template x-for="m in ['faithfulness', 'helpfulness', 'coherence', 'verbosity', 'relevance', 'goal_completion', 'agent_behavior_failure']" :key="m">
If no metrics are selected, all built-in metrics run by default.
Sources: arksim/ui/frontend/index.html
Custom Metrics
Developers can define custom evaluation metrics by creating a Python module.
Creating Custom Metrics
from arksim.evaluator import (
QualitativeMetric,
QualResult,
QuantitativeMetric,
QuantResult,
ScoreInput,
)
class MyMetric(QuantitativeMetric):
name = "my_custom_metric"
schema = MySchema # Pydantic model
async def evaluate(self, input: ScoreInput) -> QuantResult:
# Evaluation logic
return QuantResult(...)
Quantitative vs Qualitative Metrics
| Type | Base Class | Output |
|---|---|---|
| Quantitative | QuantitativeMetric | QuantResult with float value and reason |
| Qualitative | QualitativeMetric | QualResult with categorical label |
Configuration
Add custom metrics to config.yaml:
custom_metrics_file_paths:
- /path/to/custom_metrics.py
metrics_to_run:
- my_custom_metric # optional; runs all if omitted
Sources: examples/customer-service/custom_metrics.py
Example: Verification Compliance
class VerificationComplianceSchema(BaseModel):
identity_verification: float # 0.0-1.0
action_gating: float # 0.0-1.0
reason: str
The evaluation prompt instructs the LLM to score:
- Identity Verification: Did the agent verify customer identity before actions?
- Action Gating: Did the agent gate sensitive actions behind verification?
Sources: examples/customer-service/custom_metrics.py:25-35
Example: E-commerce Conversion
class ConversionSchema(BaseModel):
intent_strength: float
conversion_outcome: float
evidence: list[str]
reason: str
Metrics track:
- Intent Strength: How ready the shopper is to buy
- Conversion Outcome: Whether the agent achieved a purchase decision
Sources: examples/e-commerce/custom_metrics.py
HTML Report
The evaluation system generates a detailed HTML report (evaluation/final_report.html) containing:
Report Sections
| Section | Content |
|---|---|
| Summary Statistics | Overall pass/fail rates, average scores |
| Conversations Table | Per-conversation scores, status badges |
| Detailed Breakdown | Goal completion, final scores, failure reasons |
| Score Reasons | LLM-generated explanations for each metric |
Report Features
- Interactive Table: Sort and filter conversations
- Score Details: Expandable sections showing metric-by-metric breakdown
- Status Badges: Visual indicators (done/partial/failed)
- Tooltip Explanations: Hover info for column headers
Score Display Logic
const POSITIVE_LABELS = ['compliant', 'professional', 'pass', 'good', 'complete', 'no failure', 'ok'];
const NEGATIVE_LABELS = ['flagged', 'unprofessional', 'fail', 'error', 'poor', 'missing', 'violated', 'partial'];
Qualitative scores are automatically classified as positive, negative, or neutral based on label matching.
Sources: arksim/utils/html_report/report_template.html:78-81
Configuration
Evaluation Configuration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
evalProvider | string | - | LLM provider for evaluation |
evalModel | string | - | Model to use for scoring |
metricsToRun | list | all | Which built-in metrics to execute |
customMetricsFilePaths | list | [] | Paths to custom metric modules |
evalNumWorkers | int | auto | Parallel evaluation workers |
Provider Selection
The evaluator supports multiple LLM providers:
- OpenAI
- Azure OpenAI
- Custom endpoints (via
chat_completionstype) - A2A protocol agents
Sources: arksim/ui/frontend/index.html
Integration with Simulation
The evaluation system is typically invoked via the CLI:
# Combined simulation and evaluation
arksim simulate-evaluate config.yaml
# Separate steps
arksim simulate config_simulate.yaml
arksim evaluate config_evaluate.yaml
Pipeline Flow
graph LR
A[config.yaml] --> B[Simulator]
B --> C[simulation.json]
C --> D[Evaluator]
D --> E[evaluation/]
E --> F[final_report.html]
E --> F --> G[scores.json]Results are written to ./results/simulation/simulation.json for simulation and ./results/evaluation/ for evaluation output.
Sources: examples/integrations/dify/README.md
Advanced Features
Tool Call Capture
ArkSim can automatically capture tool calls via tracing:
trace_receiver:
enabled: true
wait_timeout: 5
When enabled, tool calls are captured automatically without requiring explicit return in AgentResponse.
Trajectory Matching
The trajectory_matching.py module compares expected agent trajectories against actual behavior, useful for validating that agents follow prescribed action sequences.
Thresholds
The thresholds.py module defines score boundaries and classification logic for determining pass/fail conditions.
Summary
The Evaluation System provides comprehensive, configurable scoring of agent conversations:
- Flexible Metric System: Built-in metrics cover common quality dimensions; custom metrics extend evaluation to domain-specific criteria
- LLM-based Scoring: Uses configurable language models to generate nuanced, explainable scores
- Visual Reporting: HTML reports make results easy to understand and act upon
- Status Classification: Automatic categorization into done/partial_failure/failed for quick assessment
- Integration-ready: Works with Python agents, chat completions endpoints, and A2A protocol agents
Sources: README.md
Agent Types and Integration
Related topics: LLM Provider Integration, Tool Call Capture, Configuration System
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: LLM Provider Integration, Tool Call Capture, Configuration System
Agent Types and Integration
ArkSim supports multiple agent connection types to accommodate different architectures and deployment scenarios. This page documents the available agent types, their configuration methods, and integration patterns.
Overview
ArkSim provides a flexible agent integration system that enables testing of various agent implementations through a unified simulation interface. The simulator communicates with agents via standardized protocols and captures responses for evaluation.
The agent integration system consists of:
- A
BaseAgentabstract class that defines the agent interface - Multiple client implementations for different connection types
- A factory pattern for agent instantiation based on configuration
- Support for tool call capture through both explicit responses and tracing
Sources: README.md
Supported Agent Types
ArkSim supports three primary agent types, each suited for different deployment scenarios.
| Agent Type | Description | Use Case |
|---|---|---|
custom | Python class extending BaseAgent | Custom agent logic, no external server required |
chat_completions | HTTP endpoint with OpenAI-compatible API | Existing REST APIs, external agent services |
a2a | Agent-to-Agent protocol endpoint | Multi-agent systems, A2A-compliant agents |
Sources: arksim/cli.py:100-109
Custom Agent (Python Class)
The custom agent type is the default integration method. It requires implementing a Python class that extends BaseAgent and implements the execute method.
Implementation Pattern
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse
class MyAgent(BaseAgent):
async def get_chat_id(self) -> str:
return "unique-id"
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
# Replace with your agent logic
return "agent response"
Sources: README.md
Returning Tool Calls
To enable tool call evaluation, return an AgentResponse object instead of a plain string. This allows the evaluator to assess whether the agent correctly invoked required tools.
from arksim.simulation_engine.tool_types import AgentResponse, ToolCall
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
tool_calls = [
ToolCall(
name="search_knowledge_base",
arguments={"query": user_query}
)
]
return AgentResponse(
content="Found relevant information in the knowledge base.",
tool_calls=tool_calls
)
Traced Agent Variant
For agents using OpenTelemetry-based instrumentation, ArkSim supports automatic tool call capture through the TracingProcessor interface. This eliminates the need to explicitly return tool calls in AgentResponse.
Simulator sets routing context -> agent.execute() runs normally
-> SDK fires TracingProcessor.on_span_end -> arksim captures -> evaluator scores
Sources: examples/customer-service/README.md
Chat Completions Agent (HTTP API)
The chat_completions agent type connects to any HTTP endpoint implementing an OpenAI-compatible chat completions interface.
Configuration
agent_config:
agent_type: chat_completions
agent_name: my-agent
api_config:
endpoint: http://localhost:8000/v1/chat/completions
Sources: README.md
Request Format
ArkSim sends requests following the OpenAI chat completions format:
{
"model": "agent-model",
"messages": [
{"role": "user", "content": "user query"}
]
}
Response Handling
The endpoint should return responses in the standard chat completions format:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "agent response"
}
}
]
}
Sources: examples/integrations/dify/README.md
A2A Protocol Agent
The a2a agent type connects to agents implementing the Agent-to-Agent (A2A) protocol, enabling integration with multi-agent systems.
Configuration
agent_config:
agent_type: a2a
agent_name: my-agent
api_config:
endpoint: http://localhost:9999/agent
Sources: README.md
A2A with Tool Calls
A2A agents can also surface tool calls for evaluation. The agent returns both the response content and tool call information in its A2A-formatted response.
CLI Initialization
ArkSim provides a CLI command to scaffold agent implementations:
arksim init --agent-type <type>
The --agent-type flag accepts the following values:
custom(default) - Generates a Python agent filechat_completions- Configures HTTP endpoint connectiona2a- Configures A2A protocol connection
The --force flag overwrites existing files.
Sources: arksim/cli.py:94-118
Integration Architecture
The following diagram illustrates how ArkSim communicates with different agent types:
graph TD
subgraph ArkSim["ArkSim"]
Simulator["Simulator Engine"]
Evaluator["Evaluator"]
end
subgraph AgentTypes["Agent Implementations"]
CustomAgent["Custom Agent (Python)"]
HTTPAgent["Chat Completions API"]
A2AAgent["A2A Protocol Agent"]
end
Simulator -->|execute| CustomAgent
Simulator -->|HTTP POST| HTTPAgent
Simulator -->|A2A Protocol| A2AAgent
CustomAgent -->|AgentResponse| Evaluator
HTTPAgent -->|JSON Response| Evaluator
A2AAgent -->|A2A Response| EvaluatorAgent Configuration in UI
The ArkSim web UI provides an "Agent Config" section for configuring agent connections. This section is dynamically loaded from the configuration YAML file.
<!-- Agent Config -->
<div class="t-surface rounded-xl border t-border p-5 mb-4">
<h2 class="font-semibold t-heading mb-1">Agent Config</h2>
<p class="text-xs t-caption mb-3">How arksim connects to your agent.</p>
</div>
Sources: arksim/ui/frontend/index.html
Framework Integrations
ArkSim includes pre-built integrations for popular agent frameworks through example projects:
| Framework | Integration File | Protocol |
|---|---|---|
| LangChain/LangGraph | custom_agent.py | Python class |
| CrewAI | custom_agent.py | Python class |
| AutoGen | custom_agent.py | Python class |
| Claude Agent SDK | custom_agent.py | Python class |
| Pydantic AI | custom_agent.py | Python class |
| LlamaIndex | custom_agent.py | Python class |
| Smolagents | custom_agent.py | Python class |
| OpenAI Agents SDK | custom_agent.py | Python class |
| Dify | custom_agent.py | HTTP API |
| Rasa | custom_agent.py | HTTP API |
| OpenClaw | config.yaml | Chat Completions |
Sources: examples/integrations/*/README.md
Running Simulations with Different Agent Types
Single Command
arksim simulate-evaluate config.yaml
Separate Steps
# Step 1: Simulate
arksim simulate config_simulate.yaml
# Step 2: Evaluate
arksim evaluate config_evaluate.yaml
Sources: examples/customer-service/README.md
Best Practices
- Tool Call Capture: For accurate evaluation, ensure your agent returns tool calls either explicitly via
AgentResponseor implicitly through tracing instrumentation.
- Async Implementation: All agent implementations should use async/await patterns for proper integration with the simulation engine.
- Error Handling: Agents should handle errors gracefully and return meaningful error messages that can be evaluated by the system.
- Configuration Management: Use environment variables for sensitive configuration values like API keys and endpoints.
Sources: README.md
LLM Provider Integration
Related topics: Agent Types and Integration, Configuration System
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Agent Types and Integration, Configuration System
LLM Provider Integration
Overview
The LLM Provider Integration system in ArkSim provides a unified abstraction layer for connecting to various Large Language Model (LLM) providers. This modular architecture enables the simulator to interact with different AI backends while maintaining a consistent interface for chat completions, streaming responses, and token usage tracking.
The system follows a provider-based pattern where each supported LLM service (OpenAI, Anthropic, Google, Azure OpenAI) implements a common interface defined in the base class. This design allows users to swap between providers without changing the core simulation logic.
Architecture
System Components
graph TD
A[Simulation Engine] --> B[LLM Manager<br/>arksim/llms/chat/llm.py]
B --> C[Base LLM<br/>arksim/llms/chat/base/base_llm.py]
C --> D[OpenAI Provider<br/>providers/openai.py]
C --> E[Anthropic Provider<br/>providers/anthropic.py]
C --> F[Google Provider<br/>providers/google.py]
C --> G[Azure OpenAI Provider<br/>providers/azure_openai.py]
H[Configuration YAML] --> B
I[Environment Variables] --> D
I --> E
I --> F
I --> GProvider Class Hierarchy
graph TD
A[BaseLLM<br/>base/base_llm.py] --> B[OpenAIProvider<br/>providers/openai.py]
A --> C[AnthropicProvider<br/>providers/anthropic.py]
A --> D[GoogleProvider<br/>providers/google.py]
A --> E[AzureOpenAIProvider<br/>providers/azure_openai.py]
F[Provider Enum] --> A
G[ChatMessage Model] --> A
H[ChatCompletionResponse] --> ASupported Providers
ArkSim supports the following LLM providers through dedicated provider implementations:
| Provider | Provider Class | API Style | Streaming Support |
|---|---|---|---|
| OpenAI | OpenAIProvider | OpenAI API | Yes |
| Anthropic | AnthropicProvider | Anthropic API | Yes |
GoogleProvider | Google AI API | Yes | |
| Azure OpenAI | AzureOpenAIProvider | Azure API | Yes |
Each provider class inherits from BaseLLM and implements provider-specific API call logic while adhering to the common interface contract.
Base LLM Interface
The BaseLLM class defines the contract that all provider implementations must follow. This ensures consistent behavior across different LLM backends.
Core Methods
| Method | Purpose | Parameters |
|---|---|---|
chat() | Send a chat completion request | messages, model, temperature, max_tokens, **kwargs |
chat_stream() | Stream chat completion responses | messages, model, temperature, max_tokens, **kwargs |
count_tokens() | Calculate token usage for messages | messages, model |
Data Models
The base module defines essential data structures used across all providers:
| Model | Purpose |
|---|---|
ChatMessage | Represents a single message with role and content |
ChatCompletionResponse | Wraps the API response from providers |
Provider | Enum identifying supported LLM providers |
ModelInfo | Metadata about available models per provider |
Provider Implementations
OpenAI Provider
The OpenAI provider connects to OpenAI's API endpoints for chat completions. It supports both standard and streaming responses.
Configuration Requirements:
| Parameter | Source | Description |
|---|---|---|
OPENAI_API_KEY | Environment variable | API key for authentication |
model | Config/Parameter | Model identifier (e.g., gpt-4, gpt-4-turbo) |
API Endpoint:
POST https://api.openai.com/v1/chat/completions
Sources: arksim/llms/chat/providers/openai.py
Anthropic Provider
The Anthropic provider integrates with Anthropic's Claude models through their API. It handles the distinct message format and API conventions used by Anthropic.
Configuration Requirements:
| Parameter | Source | Description |
|---|---|---|
ANTHROPIC_API_KEY | Environment variable | API key for authentication |
model | Config/Parameter | Model identifier (e.g., claude-3-opus-20240229) |
API Endpoint:
POST https://api.anthropic.com/v1/messages
Sources: arksim/llms/chat/providers/anthropic.py
Google Provider
The Google provider connects to Google's Gemini models via the Google AI API.
Configuration Requirements:
| Parameter | Source | Description |
|---|---|---|
GOOGLE_API_KEY | Environment variable | API key for authentication |
model | Config/Parameter | Model identifier (e.g., gemini-pro) |
API Endpoint:
POST https://generativelanguage.googleapis.com/v1/models/{model}:generateContent
Sources: arksim/llms/chat/providers/google.py
Azure OpenAI Provider
The Azure OpenAI provider enables integration with Azure-hosted OpenAI models, supporting enterprise deployments with Azure-specific authentication and endpoint configuration.
Configuration Requirements:
| Parameter | Source | Description |
|---|---|---|
AZURE_OPENAI_API_KEY | Environment variable | API key for Azure authentication |
AZURE_OPENAI_ENDPOINT | Environment variable | Azure endpoint URL |
AZURE_OPENAI_DEPLOYMENT | Config | Deployment name in Azure |
AZURE_OPENAI_API_VERSION | Config | Azure API version |
Sources: arksim/llms/chat/providers/azure_openai.py
Configuration
YAML Configuration Structure
LLM providers are configured through the config.yaml file used by the simulator:
llm:
provider: openai # or anthropic, google, azure_openai
model: gpt-4
temperature: 0.7
max_tokens: 2048
agent_config:
# Agent-specific LLM settings
provider: anthropic
model: claude-3-opus-20240229
Environment Variables
| Variable | Providers | Purpose |
|---|---|---|
OPENAI_API_KEY | OpenAI, AutoGen, LangChain | OpenAI API authentication |
ANTHROPIC_API_KEY | Claude Agent SDK | Anthropic API authentication |
GOOGLE_API_KEY | Google AI API authentication | |
AZURE_OPENAI_API_KEY | Azure OpenAI | Azure API authentication |
AZURE_OPENAI_ENDPOINT | Azure OpenAI | Azure resource endpoint |
Integration with Custom Agents
The LLM provider system integrates with the custom agent connector pattern used in ArkSim simulations. Custom agents can be configured to use any supported LLM provider.
graph LR
A[Scenario JSON] --> B[Simulation Engine]
B --> C[Custom Agent<br/>custom_agent.py]
C --> D[LLM Provider]
D --> E[External LLM API]Integration Examples
ArkSim provides integration examples for various agent frameworks:
| Framework | Example Path | LLM Provider Used |
|---|---|---|
| LangChain/LangGraph | examples/integrations/langchain/ | OpenAI |
| Claude Agent SDK | examples/integrations/claude-agent-sdk/ | Anthropic |
| LlamaIndex | examples/integrations/llamaindex/ | OpenAI |
| CrewAI | examples/integrations/crewai/ | OpenAI |
| AutoGen | examples/integrations/autogen/ | OpenAI |
| Pydantic AI | examples/integrations/pydantic-ai/ | OpenAI |
| Smolagents | examples/integrations/smolagents/ | OpenAI |
| Dify | examples/integrations/dify/ | Custom HTTP |
Each integration demonstrates how to connect the framework's agent to ArkSim's simulation engine while delegating LLM calls to the appropriate provider.
Usage Flow
sequenceDiagram
participant User
participant Config as config.yaml
participant LLM as LLM Manager
participant Provider as Provider Class
participant API as External LLM API
User->>Config: Load configuration
User->>LLM: Initialize with provider type
LLM->>Provider: Create provider instance
User->>LLM: chat(messages, model)
LLM->>Provider: _chat_completion()
Provider->>API: HTTP POST request
API-->>Provider: Completion response
Provider-->>LLM: Normalized response
LLM-->>User: ChatCompletionResponseAdding New Providers
To add support for a new LLM provider:
- Create a new provider class inheriting from
BaseLLM - Implement the required methods:
chat(),chat_stream(),count_tokens() - Add the provider to the
Providerenum inbase_llm.py - Update the provider factory logic in
llm.pyto instantiate the new provider - Add integration tests and documentation
Sources: arksim/llms/chat/base/base_llm.py Sources: arksim/llms/chat/llm.py
Error Handling
Provider implementations handle common error scenarios:
| Error Type | Handling Strategy |
|---|---|
| Authentication failures | Raise AuthenticationError with helpful message |
| Rate limiting | Implement automatic retry with backoff |
| Invalid request parameters | Raise ValidationError with parameter details |
| Network timeouts | Retry with exponential backoff |
| Model not found | Raise ModelNotFoundError listing available models |
Best Practices
- API Key Security: Store API keys in environment variables, never in configuration files committed to version control. ArkSim automatically loads keys from environment variables.
- Token Tracking: Use the
count_tokens()method to monitor token usage and estimate costs before running large-scale simulations.
- Streaming for Large Responses: Enable streaming (
chat_stream()) for scenarios expecting long agent responses to improve perceived responsiveness.
- Provider Selection: Choose Azure OpenAI for enterprise deployments requiring compliance certifications and dedicated infrastructure.
- Model Selection: Refer to the integration examples for recommended model configurations per provider and use case.
Sources: arksim/llms/chat/providers/openai.py
Tool Call Capture
Related topics: Evaluation System, Agent Types and Integration
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Evaluation System, Agent Types and Integration
Tool Call Capture
Tool Call Capture is a core mechanism in ArkSim that observes and records every tool or function invocation made by an agent during a simulation. This captured data feeds directly into the evaluator, enabling metrics like tool call accuracy, error detection, and trajectory analysis.
Overview
Tool Call Capture serves as the bridge between agent execution and evaluation. Without accurate tool call capture, the evaluator cannot determine whether an agent:
- Invoked the correct tools
- Passed valid arguments
- Handled errors appropriately
- Followed the expected execution trajectory
ArkSim supports three distinct capture mechanisms:
- Explicit capture via
AgentResponse.tool_calls - Automatic capture via the
ArksimTracingProcessor - A2A protocol capture via task artifacts
All three methods produce the same underlying ToolCall data model, ensuring consistent evaluation regardless of how the agent is implemented.
Data Models
ToolCall
The ToolCall class represents a single tool invocation observed during a turn:
class ToolCall(BaseModel):
model_config = ConfigDict(extra="ignore")
id: str
name: str
arguments: dict[str, Any] = Field(default_factory=dict)
result: str | None = None
error: str | None = None
source: ToolCallSource | None = None
| Field | Type | Description | |
|---|---|---|---|
id | str | Unique identifier for this tool call | |
name | str | Name of the tool/function invoked | |
arguments | dict[str, Any] | Arguments passed to the tool | |
result | `str \ | None` | Response returned by the tool |
error | `str \ | None` | Error message if the call failed |
source | `ToolCallSource \ | None` | Origin of the capture data |
The extra="ignore" configuration ensures forward compatibility with future versions that add new fields.
ToolCallSource
The ToolCallSource enum indicates how tool call data was captured:
class ToolCallSource(str, Enum):
AGENT_RESPONSE = "agent_response"
TRACING_PROCESSOR = "tracing_processor"
A2A_PROTOCOL = "a2a_protocol"
AgentResponse
For explicit capture, agents return structured responses:
class AgentResponse(BaseModel):
model_config = ConfigDict(extra="ignore")
content: str
tool_calls: list[ToolCall] = Field(default_factory=list)
Capture Methods
Explicit Capture (AgentResponse)
The default approach requires agents to explicitly return tool calls alongside their text response. This is the standard pattern for custom agents implementing BaseAgent.
graph TD
A[Simulator invokes agent] --> B[Agent executes user_query]
B --> C[Agent returns AgentResponse]
C --> D[Simulator extracts tool_calls]
D --> E[Evaluator scores trajectory]Implementation pattern:
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse, ToolCall
class MyAgent(BaseAgent):
async def execute(self, user_query: str, **kwargs) -> str | AgentResponse:
# Agent logic here...
tool_calls = [
ToolCall(
id="call_1",
name="get_order_status",
arguments={"order_id": "ORD-1001"},
source="agent_response"
)
]
return AgentResponse(
content="The order status is shipped.",
tool_calls=tool_calls
)
Sources: examples/customer-service/custom_agent.py
Tracing Processor (Automatic Capture)
The ArksimTracingProcessor uses the OpenAI Agents SDK's tracing interface to capture tool calls automatically, without requiring explicit return data.
graph TD
A[Simulator sets routing context] --> B[Agent executes normally]
B --> C[SDK fires on_span_end]
C --> D[ArksimTracingProcessor captures]
D --> E[Evaluator scores trajectory]Registration pattern:
from agents import Agent as SDKAgent
from arksim.tracing.openai import ArksimTracingProcessor
# Register once at module load
_tracing_processor = ArksimTracingProcessor()
The traced agent returns a plain str rather than AgentResponse:
async def execute(self, user_query: str, **kwargs) -> str:
result = await Runner.run(self._sdk_agent, user_query)
return result.final_output
Sources: examples/customer-service/traced_agent.py
A2A Protocol Capture
For agents implementing the Agent-to-Agent (A2A) protocol, tool calls are embedded in task artifacts using the A2AToolCaptureExtension.
graph TD
A[Simulator sends task to A2A Agent] --> B[Agent processes request]
B --> C[Agent returns Task with artifacts]
C --> D[Simulator extracts from metadata]
D --> E[Evaluator scores]Tool call extraction from artifacts:
def _extract_tool_calls_from_artifact(self, artifact: Artifact) -> list[ToolCall]:
metadata = artifact.metadata
raw_calls = metadata.get("tool_calls", [])
tool_calls = []
for raw in raw_calls:
arguments = raw.get("arguments", {})
if not isinstance(arguments, dict):
continue
tool_calls.append(
ToolCall(
id=raw.get("id", ""),
name=name,
arguments=arguments,
result=A2AAgent._coerce_to_string(raw.get("result")),
error=A2AAgent._coerce_to_string(raw.get("error")),
source=ToolCallSource.A2A_PROTOCOL,
)
)
return tool_calls
Sources: arksim/simulation_engine/agent/clients/a2a.py
A2A agent card declaration:
from arksim.simulation_engine.tool_types import A2AToolCaptureExtension
_capabilities = AgentCapabilities(
streaming=False,
extensions=[A2AToolCaptureExtension],
)
Workflow Diagram
The following diagram shows the complete simulation pipeline with tool call capture:
flowchart LR
subgraph Simulation
A[User Query] --> B[Simulator]
B --> C{Agent Type}
C -->|Custom| D[explicit tool_calls]
C -->|Traced| E[TracingProcessor]
C -->|A2A| F[Artifact metadata]
D --> G[Captured Tool Calls]
E --> G
F --> G
end
subgraph Evaluation
G --> H[Evaluator]
H --> I[Tool Call Metrics]
H --> J[Error Detection]
H --> K[Trajectory Analysis]
end
G --> L[Results/Report]
I --> L
J --> L
K --> LComparison of Capture Methods
| Aspect | Explicit (AgentResponse) | Traced (TracingProcessor) | A2A Protocol |
|---|---|---|---|
| Return type | AgentResponse | str | Via protocol |
| Implementation complexity | Medium | Low | High |
| Agent code changes | Required | Minimal | Protocol required |
| Best for | Custom Python agents | SDK-based agents | A2A-native agents |
| Tool call source field | agent_response | tracing_processor | a2a_protocol |
Sources: examples/customer-service/README.md
Configuration
Trace Receiver Settings
For traced agents, enable the trace receiver in the simulation config:
simulation:
max_turns: 10
trace_receiver:
enabled: true
wait_timeout: 5 # seconds to wait for traces
When Tracing is Disabled
When trace_receiver.enabled is false or omitted, ArkSim falls back to explicit AgentResponse capture:
Whentrace_receiver.enabledis false or omitted, arksim only captures tool calls fromAgentResponse(the standard path).
Integration with Evaluation
Captured tool calls flow into the evaluator's scoring pipeline:
- Trajectory matching - Compare actual tool sequence against expected
- Argument validation - Verify tool arguments match scenario requirements
- Error detection - Identify tool call failures and their handling
- Coverage analysis - Determine if all required tools were invoked
The evaluator uses the source field to differentiate between capture methods when analyzing behavioral patterns.
Forward Compatibility
Both ToolCall and AgentResponse declare extra="ignore" in their Pydantic configuration:
Declaresextra="ignore"explicitly so snapshots from future arksim versions that add new fields can still be loaded by older arksim without raising aValidationError.
This ensures that simulation results captured with newer versions remain loadable by older versions of the evaluator.
Scenario Management
Related topics: Simulation Engine, Configuration System
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Simulation Engine, Configuration System
Scenario Management
Overview
Scenario Management is the core system in ArkSim for defining, loading, validating, and executing test scenarios that simulate user interactions with an AI agent. A scenario represents a structured test case that defines a user's goal, knowledge base, behavioral characteristics, and expected outcomes.
Scenarios serve as the foundation for the simulation and evaluation pipeline, enabling reproducible testing of agent behavior across diverse conversation patterns and user profiles.
Scenario Data Model
Core Scenario Structure
Each scenario in ArkSim is a JSON object containing the following primary fields:
| Field | Type | Required | Description |
|---|---|---|---|
scenario_id | string | Yes | Unique identifier for the scenario |
name | string | Yes | Human-readable scenario name |
description | string | No | Detailed description of the scenario's purpose |
user_profile | object | Yes | User persona characteristics |
goal | object | Yes | The user's primary objective |
knowledge | array | Yes | Contextual knowledge the user has access to |
expected_behavior | object | No | Expected agent responses or behaviors |
metrics | array | No | Custom evaluation criteria |
User Profile Schema
{
"name": "string",
"age": "number",
"personality": "string",
"background": "string",
"communication_style": "string"
}
Sources: arksim/scenario/entities.py
Goal Structure
{
"primary": "string (main user objective)",
"secondary": ["array of secondary objectives"],
"constraints": ["array of constraints or boundaries"],
"success_criteria": "string"
}
Scenario Management Workflow
graph TD
A[Create/Load Scenarios] --> B[Validate Scenario Schema]
B --> C{Valid?}
C -->|Yes| D[Save to scenarios.json]
C -->|No| E[Show Validation Errors]
E --> A
D --> F[Configure Simulation Parameters]
F --> G[Run Simulation]
G --> H[Generate Results]
H --> I[Evaluation & Reporting]Scenario Loading and Validation
The UI provides interactive scenario management through the "Build" page:
- Load Existing - Users can load pre-existing scenario files via file path input
- Auto-generate (PRO) - Automatic scenario generation from agent knowledge base
- Manual Creation - Build scenarios through the UI interface
// Scenario file validation in UI
@input="validateScenarioFile()"
@blur="validateScenarioFile()"
@keydown.enter="loadScenarioFile()"
Sources: arksim/ui/frontend/index.html
Validation Rules
- Scenario files must be valid JSON
- Required fields must be present
scenario_idmust be unique within the fileuser_profilemust contain at minimum anamefield
Scenario File Format
ArkSim uses JSON format for scenario definitions. See the example structure:
{
"scenarios": [
{
"scenario_id": "ecommerce-return-item-001",
"name": "Return Defective Product",
"description": "Customer wants to return a damaged item received last week",
"user_profile": {
"name": "John Smith",
"age": 35,
"personality": "patient but firm",
"background": "Regular online shopper",
"communication_style": "polite and direct"
},
"goal": {
"primary": "Get a full refund for a damaged product",
"secondary": ["Understand return process", "Know timeline for refund"],
"constraints": ["Only willing to wait up to 14 days for refund"],
"success_criteria": "Full refund issued or replacement offered"
},
"knowledge": [
"Ordered product SKU-12345 on March 1, 2024",
"Item arrived damaged with visible scratches",
"Has original packaging and receipt",
"Order number: ORD-987654"
],
"expected_behavior": {
"should_mention_order_number": true,
"should_request_photo_evidence": false
}
}
]
}
Sources: examples/e-commerce/scenarios.json Sources: examples/bank-insurance/scenarios.json Sources: examples/customer-service/scenarios.json
Simulation Configuration
Scenarios are executed through the simulation engine with configurable parameters:
| Parameter | Default | Description |
|---|---|---|
num_conversations_per_scenario | 5 | Number of conversations to simulate per scenario |
max_turns | 5 | Maximum turns per conversation |
num_workers | 50 | Parallel workers (or 'auto') |
model | gpt-4o | LLM model for simulation |
provider | openai | LLM provider |
model: str = Field(default=DEFAULT_MODEL, description="LLM model for simulation")
num_conversations_per_scenario: int = Field(
default=5,
description="Number of conversations per scenario to simulate"
)
max_turns: int = Field(default=5, description="Maximum turns per conversation")
Sources: arksim/simulation_engine/entities.py:18-25
Configuration via config.yaml
simulation:
scenarios_file: ./scenarios.json
num_conversations_per_scenario: 5
max_turns: 5
num_workers: auto
model: gpt-4o
Built-in Scenario Templates
ArkSim provides template scenarios for common use cases:
{
"scenarios": [
{
"scenario_id": "template-basic-query",
"name": "Basic Information Query",
"description": "User asks a simple informational question",
"user_profile": {...},
"goal": {...},
"knowledge": [...]
}
]
}
Sources: arksim/templates/scenarios.json
Integration with CI/CD
Scenarios can be integrated into automated testing workflows:
# GitHub Actions workflow
steps:
- name: Run Scenario Tests
run: arksim simulate-evaluate config.yaml
Sources: examples/ci/README.md
Required CI Setup
- Update
TODOsections inarksim.yml(startup command and health-check URL) - Create
tests/arksim/config.yamlpointing to your server endpoint - Create
tests/arksim/scenarios.jsonwith your test cases - Add custom metrics to
tests/arksim/custom_metrics/if needed - Configure
OPENAI_API_KEYin GitHub secrets
Best Practices
Scenario Design
- Distinct Goals: Each scenario should test a single, well-defined user goal
- Realistic Profiles: User profiles should reflect actual customer demographics
- Sufficient Knowledge: Include enough context for the simulated user to maintain coherent conversation
File Organization
project/
├── config.yaml
├── scenarios.json
├── custom_metrics/
│ └── my_metric.py
└── results/
└── simulation/
Version Control
- Commit
scenarios.jsonwith your agent code - Use descriptive
scenario_idvalues with prefixes (e.g.,ecommerce-,support-) - Document success criteria clearly in the
goal.success_criteriafield
Command Line Usage
Initialize with default scenarios
arksim init
Run simulation with scenarios
arksim simulate-evaluate config.yaml
Results are written to ./results/simulation/simulation.json.
Sources: README.md Sources: examples/integrations/dify/README.md
See Also
Sources: arksim/scenario/entities.py
Configuration System
Related topics: Agent Types and Integration, LLM Provider Integration, Evaluation System
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Agent Types and Integration, LLM Provider Integration, Evaluation System
Configuration System
ArkSim provides a flexible, multi-layered configuration system that supports both YAML-based declarative configuration and programmatic Python class configuration. The system enables users to define simulation scenarios, evaluation metrics, agent connections, and runtime behavior through structured configuration files or direct Python implementations.
Architecture Overview
The configuration system is organized into several key modules that handle different aspects of simulation and evaluation:
graph TD
A[User Configuration] --> B[YAML Files]
A --> C[Python Classes]
B --> D[config.yaml]
B --> E[config_simulate.yaml]
B --> F[config_evaluate.yaml]
C --> G[BaseAgent Subclass]
C --> H[Custom Metrics]
D --> I[Config Loader]
E --> I
F --> I
G --> J[Simulation Engine]
H --> K[Evaluator]
I --> J
J --> K
J --> L[Results/Reports]
K --> LThe system separates configuration into three distinct phases: simulation, evaluation, and combined execution. This separation allows users to run simulations independently from evaluation, which is useful for debugging and iterative development workflows.
Configuration Files
ArkSim uses YAML configuration files to define simulation parameters, evaluation settings, and agent connections. The repository includes three primary configuration templates at the root level.
Main Configuration (config.yaml)
The main configuration file combines both simulation and evaluation settings into a single file. This is the recommended approach for simple use cases where both phases run together using the simulate-evaluate command. The file structure typically includes agent configuration, scenario definitions, metric selections, and evaluation parameters.
Separate Phase Configuration
For more complex workflows, ArkSim supports splitting configuration into separate files:
| File | Purpose | CLI Command |
|---|---|---|
config_simulate.yaml | Simulation-only parameters | arksim simulate config_simulate.yaml |
config_evaluate.yaml | Evaluation-only parameters | arksim evaluate config_evaluate.yaml |
config.yaml | Combined configuration | arksim simulate-evaluate config.yaml |
Sources: examples/customer-service/README.md
This separation enables scenarios where simulation results are cached and evaluated multiple times with different metric configurations, or where the simulation phase runs in a different environment than evaluation.
Agent Configuration
The configuration system supports multiple agent connection types, defined through the agent_type field in the agent configuration section.
Supported Agent Types
| Agent Type | Description | Configuration Style |
|---|---|---|
custom | Custom Python agent class inheriting from BaseAgent | Python class file |
chat_completions | HTTP endpoint implementing Chat Completions API | YAML endpoint config |
a2a | Agent-to-Agent protocol endpoint | YAML endpoint config |
Sources: arksim/cli.py
Custom Agent (Python Class)
The default agent type uses a Python class that extends BaseAgent. This approach provides maximum flexibility for integrating any agent framework. The agent class must implement two required methods:
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse
class MyAgent(BaseAgent):
async def get_chat_id(self) -> str:
return "unique-id"
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
# Replace with your agent logic
return "agent response"
Sources: README.md
The execute method can return either a plain string response or an AgentResponse object that includes both text content and tool calls for evaluation.
Chat Completions Agent
For agents exposing an HTTP endpoint compatible with the OpenAI Chat Completions format, configuration is declarative:
agent_config:
agent_type: chat_completions
agent_name: my-agent
api_config:
endpoint: http://localhost:8000/v1/chat/completions
Sources: README.md
A2A Protocol Agent
Agents implementing the Agent-to-Agent protocol are configured similarly:
agent_config:
agent_type: a2a
agent_name: my-agent
api_config:
endpoint: http://localhost:9999/agent
A2A agents can also surface tool calls for evaluation through the protocol, enabling comprehensive testing of tool-using agents.
Scenario Configuration
Scenarios define the test cases that the simulator executes against the agent. The scenarios.json file contains an array of scenario objects, each representing a simulated conversation or interaction.
Scenario Structure
Each scenario typically includes:
- Scenario ID: Unique identifier for the scenario
- User query: The initial prompt or question presented to the agent
- Expected behavior: Criteria for successful completion
- Knowledge/context: Information the agent should know or have access to during the scenario
Scenario Files Location
Scenarios are referenced from the main configuration file and can be shared across different configuration files:
| Example Directory | Scenario Purpose |
|---|---|
examples/bank-insurance/ | Financial services agent testing |
examples/customer-service/ | Customer support scenarios |
examples/integrations/*/ | Framework-specific testing |
examples/ci/ | CI/CD integration testing |
Sources: examples/ci/README.md
Evaluation Metrics Configuration
The evaluator component scores agent responses against defined metrics. Configuration specifies which metrics to run and how they are calculated.
Built-in Metrics
ArkSim includes standard evaluation metrics covering common agent quality dimensions. These metrics are automatically available without additional configuration.
Custom Metrics
Users can define custom evaluation metrics by creating a Python module that implements the metric interfaces. The custom metrics file must define metrics following the provided schema:
from pydantic import BaseModel
from arksim.evaluator import (
QualitativeMetric,
QualResult,
QuantitativeMetric,
QuantResult,
ScoreInput,
format_chat_history,
)
Sources: examples/customer-service/custom_metrics.py
Custom metrics are referenced in the configuration file using the custom_metrics_file_paths parameter, and optionally listed in metrics_to_run to include them in the evaluation:
custom_metrics_file_paths:
- tests/arksim/custom_metrics/my_metric.py
metrics_to_run:
- custom_metric_name
Sources: examples/ci/README.md
Custom Metric Schema
Custom quantitative metrics should return a Pydantic model defining the score components:
class VerificationComplianceSchema(BaseModel):
identity_verification: float # 0.0-1.0
action_gating: float # 0.0-1.0
reason: str
Sources: examples/customer-service/custom_metrics.py
The system prompt for qualitative metrics defines the evaluation criteria that the LLM judge applies when scoring agent responses.
CLI Configuration Interface
The ArkSim CLI provides commands for initializing configurations, running simulations, and managing examples.
Initialization Command
The init command scaffolds starter files for agent testing:
arksim init --agent-type custom
| Flag | Options | Default | Description |
|---|---|---|---|
--agent-type | custom, chat_completions, a2a | custom | Agent connection type |
--force | boolean | false | Overwrite existing files |
Sources: arksim/cli.py
The --agent-type flag determines which template is generated:
customgenerates a Python agent file (no server needed)chat_completionsgenerates YAML configuration for HTTP endpointsa2agenerates YAML configuration for Agent-to-Agent protocol
Simulation Commands
| Command | Description |
|---|---|
arksim simulate-evaluate config.yaml | Run simulation and evaluation in sequence |
arksim simulate config_simulate.yaml | Run simulation only |
arksim evaluate config_evaluate.yaml | Evaluate previously saved simulation results |
Sources: examples/customer-service/README.md
Examples Command
The CLI also provides access to example projects:
arksim examples # Download all examples
arksim examples bank-insurance # Download specific example
arksim examples --list # List available examples
Integration Configuration Patterns
Each integration example follows a consistent pattern with three standard files.
File Structure
| File | Purpose |
|---|---|
custom_agent.py | Agent implementation connecting to the target framework |
config.yaml | Simulation and evaluation settings |
scenarios.json | Test scenarios for the example domain |
Supported Framework Integrations
ArkSim provides example configurations for the following agent frameworks:
| Framework | Installation | Example Directory |
|---|---|---|
| LangGraph | pip install langgraph langchain-openai | examples/integrations/langgraph/ |
| LangChain | pip install langgraph langchain-openai | examples/integrations/langchain/ |
| Claude Agent SDK | pip install claude-agent-sdk | examples/integrations/claude-agent-sdk/ |
| AutoGen | pip install autogen-agentchat autogen-ext[openai] | examples/integrations/autogen/ |
| CrewAI | pip install crewai | examples/integrations/crewai/ |
| Pydantic AI | pip install pydantic-ai | examples/integrations/pydantic-ai/ |
| Smolagents | pip install smolagents | examples/integrations/smolagents/ |
| Dify | HTTP integration | examples/integrations/dify/ |
| OpenClaw | Gateway token auth | examples/openclaw/ |
Sources: examples/integrations/*/README.md
Trace Receiver Configuration
For frameworks that support tracing, ArkSim includes a trace receiver component that captures tool calls automatically without requiring explicit AgentResponse returns.
Configuration Options
trace_receiver:
enabled: true
wait_timeout: 5
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable trace-based tool call capture |
wait_timeout | integer | 5 | Seconds to wait for trace data |
Sources: examples/customer-service/README.md
When enabled is false or omitted, ArkSim only captures tool calls from AgentResponse objects returned by the agent's execute method.
Workflow Summary
The following diagram illustrates the configuration-driven workflow from specification to results:
graph LR
A[config.yaml] --> B[Scenario Loading]
C[scenarios.json] --> B
B --> D[Simulation Engine]
D --> E{Agent Type?}
E -->|custom| F[Python Agent]
E -->|chat_completions| G[HTTP Endpoint]
E -->|a2a| H[A2A Agent]
F --> I[Execute Scenarios]
G --> I
H --> I
I --> J[Simulation Results]
J --> K[Evaluator]
J --> L[Conversation Viewer]
K --> M[Evaluation Report]
M --> N[scores.json]
M --> O[failures.json]Best Practices
Environment Variables: API keys should be set as environment variables rather than hardcoded in configuration files:
export OPENAI_API_KEY="<your-key>"
export ANTHROPIC_API_KEY="<your-key>"
Separation of Concerns: Use separate config_simulate.yaml and config_evaluate.yaml files when iterating on scenarios or metrics independently.
Custom Metrics Organization: Place custom metrics in dedicated directories (e.g., tests/arksim/custom_metrics/) and reference them from the main configuration.
Scenario Versioning: Keep scenarios in version-controlled JSON files that can be reviewed and updated as agent requirements evolve.
Sources: CONTRIBUTING.md
Sources: examples/customer-service/README.md
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
First-time setup may fail or require extra isolation and rollback planning.
Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Doramagic Pitfall Log
Doramagic extracted 15 source-linked risk signals. Review them before installing or handing real data to the project.
1. Installation risk: v0.1.0
- Severity: medium
- Finding: Installation risk is backed by a source signal: v0.1.0. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.1.0
2. Configuration risk: v0.0.6
- Severity: medium
- Finding: Configuration risk is backed by a source signal: v0.0.6. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.0.6
3. Configuration risk: v0.3.2
- Severity: medium
- Finding: Configuration risk is backed by a source signal: v0.3.2. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.2
4. Configuration risk: v0.3.4
- Severity: medium
- Finding: Configuration risk is backed by a source signal: v0.3.4. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.4
5. Configuration risk: v0.3.5
- Severity: medium
- Finding: Configuration risk is backed by a source signal: v0.3.5. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.5
6. Capability assumption: v0.3.1
- Severity: medium
- Finding: Capability assumption is backed by a source signal: v0.3.1. Treat it as a review item until the current version is checked.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.1
7. Capability assumption: README/documentation is current enough for a first validation pass.
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: capability.assumptions | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | README/documentation is current enough for a first validation pass.
8. Maintenance risk: v0.3.3
- Severity: medium
- Finding: Maintenance risk is backed by a source signal: v0.3.3. Treat it as a review item until the current version is checked.
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.3.3
9. Maintenance risk: Maintainer activity is unknown
- Severity: medium
- Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: evidence.maintainer_signals | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | last_activity_observed missing
10. Security or permission risk: no_demo
- Severity: medium
- Finding: no_demo
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: downstream_validation.risk_items | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | no_demo; severity=medium
11. Security or permission risk: no_demo
- Severity: medium
- Finding: no_demo
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: risks.scoring_risks | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | no_demo; severity=medium
12. Security or permission risk: v0.2.0
- Severity: medium
- Finding: Security or permission risk is backed by a source signal: v0.2.0. Treat it as a review item until the current version is checked.
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/arklexai/arksim/releases/tag/v0.2.0
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using arksim with real data or production workflows.
- v0.3.7 - github / github_release
- v0.3.6 - github / github_release
- v0.3.5 - github / github_release
- v0.3.4 - github / github_release
- v0.3.3 - github / github_release
- v0.3.2 - github / github_release
- v0.3.1 - github / github_release
- v0.3.0 - github / github_release
- v0.2.0 - github / github_release
- v0.1.0 - github / github_release
- v0.0.6 - github / github_release
- Zhou Yu (@Zhou_Yu_AI) / Highlights / X - x / searxng_indexed
Source: Project Pack community evidence and pitfall evidence