# https://github.com/arklexai/arksim 项目说明书

生成时间：2026-05-15 12:11:39 UTC

## 目录

- [Introduction to ArkSim](#intro)
- [Quickstart Guide](#quickstart)
- [System Architecture Overview](#arch-overview)
- [Simulation Engine](#simulation-engine)
- [Evaluation System](#evaluation-system)
- [Agent Types and Integration](#agent-types)
- [LLM Provider Integration](#llm-providers)
- [Tool Call Capture](#tool-call-capture)
- [Scenario Management](#scenario-management)
- [Configuration System](#configuration)

<a id='intro'></a>

## Introduction to ArkSim

### 相关页面

相关主题：[Quickstart Guide](#quickstart), [System Architecture Overview](#arch-overview)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/arklexai/arksim/blob/main/README.md)
- [CONTRIBUTING.md](https://github.com/arklexai/arksim/blob/main/CONTRIBUTING.md)
- [CLAUDE.md](https://github.com/arklexai/arksim/blob/main/CLAUDE.md)
- [arksim/simulation_engine/tool_types.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/tool_types.py)
- [examples/customer-service/custom_metrics.py](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_metrics.py)
- [examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)
- [examples/integrations/langgraph/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/langgraph/README.md)
- [examples/integrations/autogen/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/autogen/README.md)
- [examples/ci/README.md](https://github.com/arklexai/arksim/blob/main/examples/ci/README.md)
</details>

# Introduction to ArkSim

ArkSim is a multi-turn agent evaluation and simulation platform designed to test, measure, and improve AI agent quality across conversational scenarios. It enables developers to define test scenarios, simulate user interactions with their agents, and generate comprehensive evaluation reports with quantitative and qualitative metrics.

## Overview

ArkSim addresses the challenge of systematically evaluating conversational AI agents by providing:

- **Scenario-based simulation**: Define test cases with user profiles, knowledge bases, and expected behaviors
- **Automated evaluation**: Score agent responses across multiple dimensions including goal completion, helpfulness, and coherence
- **Framework integration**: Connect with various agent frameworks including LangGraph, LangChain, AutoGen, Claude Agent SDK, Rasa, Dify, and OpenClaw
- **Custom metrics**: Extend evaluation capabilities with user-defined quantitative and qualitative metrics
- **Visual reporting**: Generate HTML reports with conversation transcripts, failure analysis, and per-metric scores

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

## Architecture Overview

ArkSim follows a pipeline architecture consisting of three primary stages:

```mermaid
graph TD
    A[Configuration<br/>config.yaml] --> B[Simulation Engine]
    C[Scenarios<br/>scenarios.json] --> B
    D[Agent<br/>custom_agent.py] --> B
    B --> E[Conversation Data]
    E --> F[Evaluator]
    F --> G[HTML Report<br/>final_report.html]
    H[Custom Metrics<br/>custom_metrics.py] --> F
```

### Core Components

| Component | Description |
|-----------|-------------|
| **Simulation Engine** | Orchestrates multi-turn conversations between simulated users and the agent |
| **Evaluator** | Scores agent performance using built-in and custom metrics |
| **CLI Interface** | Command-line interface for running simulations (`arksim simulate-evaluate`) |
| **Report Generator** | Produces HTML reports with visualizations and conversation transcripts |

资料来源：[examples/customer-service/README.md:1-25](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

## Agent Integration Methods

ArkSim supports multiple agent integration patterns to accommodate different agent architectures.

### Python Class Integration (BaseAgent)

The default integration method uses a Python class inheriting from `BaseAgent`. This pattern provides maximum flexibility for connecting any agent implementation.

```python
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"
```

The agent can return either:
- A plain string response
- An `AgentResponse` object containing both content and tool calls

资料来源：[README.md:40-55](https://github.com/arklexai/arksim/blob/main/README.md)

### AgentResponse Data Model

The `AgentResponse` model encapsulates structured agent outputs:

```python
class AgentResponse(BaseModel):
    """Structured return from agent execution, carrying both text and tool calls."""
    model_config = ConfigDict(extra="ignore")
    
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)
```

资料来源：[arksim/simulation_engine/tool_types.py:32-42](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/tool_types.py)

### ToolCall Model

Tool calls are captured using the `ToolCall` model:

```python
class ToolCall(BaseModel):
    """A single tool/function call observed during a turn."""
    model_config = ConfigDict(extra="ignore")
    
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None
```

The `extra="ignore"` configuration ensures forward compatibility with future versions.

资料来源：[arksim/simulation_engine/tool_types.py:18-30](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/tool_types.py)

### Chat Completions Endpoint Integration

For agents exposing a standard OpenAI-compatible API:

```yaml
agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions
```

资料来源：[README.md:57-62](https://github.com/arklexai/arksim/blob/main/README.md)

### A2A Protocol Integration

For agents implementing the Agent-to-Agent (A2A) protocol:

```yaml
agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent
```

资料来源：[README.md:64-70](https://github.com/arklexai/arksim/blob/main/README.md)

### Automatic Tool Call Capture (Tracing)

ArkSim supports automatic tool call capture through a tracing processor, eliminating the need for agents to explicitly return tool calls:

```mermaid
graph LR
    A[Simulator sets<br/>routing context] --> B[agent.execute()]
    B --> C[SDK fires<br/>TracingProcessor.on_span_end]
    C --> D[arksim captures<br/>tool calls]
    D --> E[Evaluator scores]
```

```bash
pip install -r requirements-traced.txt
arksim simulate-evaluate config_traced.yaml
```

资料来源：[examples/customer-service/README.md:35-55](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

## Configuration System

ArkSim uses YAML configuration files to define simulation and evaluation parameters.

### Configuration File Structure

```yaml
# Simulation settings
simulation:
  max_turns: 10
  timeout: 300

# Agent configuration
agent_config:
  agent_type: chat_completions  # or 'a2a', 'custom'
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

# Trace receiver (optional)
trace_receiver:
  enabled: true
  wait_timeout: 5

# Custom metrics
custom_metrics_file_paths:
  - ./custom_metrics.py

metrics_to_run:
  - goal_completion
  - verification_compliance
```

资料来源：[examples/ci/README.md:1-25](https://github.com/arklexai/arksim/blob/main/examples/ci/README.md)

### Scenario Definition

Scenarios are defined in `scenarios.json` with the following structure:

| Field | Description |
|-------|-------------|
| `scenario_id` | Unique identifier for the scenario |
| `user_profile` | Description of the simulated user |
| `knowledge` | Context and information available to the user |
| `goals` | Objectives the user aims to achieve |
| `expected_behavior` | Expected agent behavior patterns |

资料来源：[examples/integrations/dify/README.md:1-20](https://github.com/arklexai/arksim/blob/main/examples/integrations/dify/README.md)

## Evaluation Metrics

ArkSim provides built-in metrics and supports custom metric definitions.

### Built-in Metrics

| Metric | Description | Weight |
|--------|-------------|--------|
| Goal Completion | Whether the agent achieved the user's objectives | 60% |
| Turn Success Ratio | Ratio of successful turns to total turns | 40% |
| Final Score | Weighted average of other metrics | - |

资料来源：[arksim/utils/html_report/report_template.html:80-95](https://github.com/arklexai/arksim/blob/main/arksim/utils/html_report/report_template.html)

### Custom Metrics

Custom metrics can be implemented as either quantitative (scored numerically) or qualitative (evaluated via LLM).

#### Quantitative Metric Pattern

```python
from arksim.evaluator import (
    QuantitativeMetric,
    QuantResult,
    ScoreInput,
    format_chat_history,
)

class VerificationComplianceMetric(Queresult):
    identity_verification: float  # 0.0-1.0
    action_gating: float  # 0.0-1.0
    reason: str

VERIFICATION_COMPLIANCE_SYSTEM_PROMPT = """\
You are an impartial evaluator for a customer service agent.
Score how well the agent followed identity verification protocols..."""

def verification_compliance_metric(
    input_data: ScoreInput,
) -> QuantResult:
    # Implementation
    return QuantResult(score=0.85, reason="Verification completed")
```

资料来源：[examples/customer-service/custom_metrics.py:15-40](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_metrics.py)

#### Qualitative Metric Pattern

```python
from arksim.evaluator import (
    QualitativeMetric,
    QualResult,
    ScoreInput,
    format_chat_history,
)

def helpfulness_metric(input_data: ScoreInput) -> QualResult:
    system_prompt = """Evaluate the helpfulness of the agent response..."""
    return QualResult(
        assessment="The agent provided comprehensive assistance",
        score=0.9,
        reason="Clear explanation with actionable steps"
    )
```

### Metric Configuration

To use custom metrics:

1. Implement the metric function in a Python file
2. Reference the file path in `custom_metrics_file_paths` in config.yaml
3. Add the metric name to `metrics_to_run`

资料来源：[examples/customer-service/custom_metrics.py:1-15](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_metrics.py)

## CLI Usage

ArkSim provides a command-line interface for running simulations and evaluations.

### Primary Commands

| Command | Description |
|---------|-------------|
| `arksim simulate-evaluate config.yaml` | Run simulation and evaluation in one step |
| `arksim simulate config_simulate.yaml` | Run simulation only |
| `arksim evaluate config_evaluate.yaml` | Evaluate existing simulation results |

资料来源：[examples/customer-service/README.md:15-22](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

### Workflow Examples

#### Combined Simulation and Evaluation

```bash
arksim simulate-evaluate config.yaml
```

#### Separate Simulation and Evaluation

```bash
# Step 1: Simulate
arksim simulate config_simulate.yaml

# Step 2: Evaluate
arksim evaluate config_evaluate.yaml
```

Results are written to `./results/simulation/simulation.json`. The evaluation report is printed to stdout with per-scenario metric scores and failure analysis.

资料来源：[examples/integrations/dify/README.md:15-25](https://github.com/arklexai/arksim/blob/main/examples/integrations/dify/README.md)

## Framework Integrations

ArkSim provides integration examples for popular agent frameworks.

### Integration Matrix

| Framework | Package Required | Documentation |
|-----------|-----------------|---------------|
| LangGraph | `langgraph langchain-openai` | [Link](https://github.com/arklexai/arksim/blob/main/examples/integrations/langgraph/README.md) |
| LangChain | `langgraph langchain-openai` | [Link](https://github.com/arklexai/arksim/blob/main/examples/integrations/langchain/README.md) |
| AutoGen | `autogen-agentchat autogen-ext[openai]` | [Link](https://github.com/arklexai/arksim/blob/main/examples/integrations/autogen/README.md) |
| Claude Agent SDK | `claude-agent-sdk` | [Link](https://github.com/arklexai/arksim/blob/main/examples/integrations/claude-agent-sdk/README.md) |
| Rasa | `rasa-pro` | [Link](https://github.com/arklexai/arksim/blob/main/examples/integrations/rasa/README.md) |
| Dify | (custom agent) | [Link](https://github.com/arklexai/arksim/blob/main/examples/integrations/dify/README.md) |
| OpenClaw | (custom agent) | [Link](https://github.com/arklexai/arksim/blob/main/examples/integrations/openclaw/README.md) |

资料来源：[examples/integrations/langgraph/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/langgraph/README.md), [examples/integrations/autogen/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/autogen/README.md)

### General Integration Pattern

Regardless of the framework, the integration follows this pattern:

1. Create a `custom_agent.py` file with a `BaseAgent` subclass
2. Implement the `execute()` method to call the framework's agent
3. Return either a string or `AgentResponse` with tool calls
4. Configure `config.yaml` to use the custom agent
5. Create `scenarios.json` with test cases

```python
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class FrameworkAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-session-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Call your framework's agent here
        result = await your_agent.run(user_query)
        return str(result)
```

## Report Generation

ArkSim generates comprehensive HTML evaluation reports containing:

### Report Sections

| Section | Content |
|---------|---------|
| **Summary** | Overall scores, pass/fail status |
| **Per-Scenario Scores** | Individual metric scores for each scenario |
| **Failure Categories** | Grouped failure patterns with counts |
| **Conversation Viewer** | Full transcript of each simulated conversation |
| **Score Explanations** | Rationale for each score with references to conversation turns |

资料来源：[README.md:25-30](https://github.com/arklexai/arksim/blob/main/README.md)

### Report Output

The report tells you where your agent is strong and where it breaks. You get per-metric scores, categorized failures, and full conversation transcripts so you can read the exact turns where things went wrong.

资料来源：[README.md:30-35](https://github.com/arklexai/arksim/blob/main/README.md)

## Development Guidelines

### Project Setup

```bash
# Fork and clone
git clone https://github.com/<your-username>/arksim.git
cd arksim

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# Create a branch
git checkout -b my-feature
```

### Code Quality

ArkSim uses Ruff for linting and formatting:

```bash
ruff check .    # Lint
ruff format .   # Format
```

Pre-commit hooks run both automatically on commit.

### Code Style

- Follow PEP 8 conventions
- Code lines: maximum 120 characters
- Comments and docstrings: maximum 80 characters
- Type hints encouraged for function signatures
- Use absolute imports over relative imports

### Commit Message Format

```
<component>: <verb> <description>
```

Examples:
- `evaluator: add custom metric support`
- `simulator: fix profile generation for empty attributes`
- `cli: support verbose flag for streaming output`

Keep the subject line under 72 characters, use lowercase, imperative mood.

### Branch Naming

```
<type>/<short-description>
```

Examples: `feat/retry-logic`, `fix/empty-list-handling`, `docs/update-quickstart`.

资料来源：[CONTRIBUTING.md:1-50](https://github.com/arklexai/arksim/blob/main/CONTRIBUTING.md)

## CI/CD Integration

ArkSim can be integrated into GitHub Actions workflows for automated testing.

### Workflow Options

| Workflow | Use Case |
|----------|----------|
| `arksim.yml` | HTTP server endpoints requiring startup/shutdown |
| `arksim-pytest.yml` | Python-based pytest integration |

### Setup Requirements

1. Update `TODO` sections in `arksim.yml` (startup command and health-check URL)
2. Create `tests/arksim/config.yaml` pointing to your server endpoint
3. Create `tests/arksim/scenarios.json` with test cases
4. Add custom metrics to `tests/arksim/custom_metrics/` if needed
5. Add `OPENAI_API_KEY` (and optionally `AGENT_API_KEY`) to GitHub secrets

资料来源：[examples/ci/README.md:1-15](https://github.com/arklexai/arksim/blob/main/examples/ci/README.md)

## Summary

ArkSim provides a comprehensive framework for evaluating multi-turn conversational AI agents through:

1. **Flexible integration**: Connect agents via Python classes, REST APIs, or framework-specific connectors
2. **Rich scenario definitions**: Test agents with diverse user profiles and interaction goals
3. **Extensible evaluation**: Implement custom metrics for domain-specific assessment
4. **Actionable reporting**: Generate HTML reports that pinpoint exactly where and why agent behavior falls short

By standardizing agent evaluation, ArkSim enables data-driven improvements to conversational AI systems across different frameworks and architectures.

---

<a id='quickstart'></a>

## Quickstart Guide

### 相关页面

相关主题：[Introduction to ArkSim](#intro), [Agent Types and Integration](#agent-types)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [arksim/cli.py](https://github.com/arklexai/arksim/blob/main/arksim/cli.py)
- [arksim/simulation_engine/agent/base.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/agent/base.py)
- [arksim/simulation_engine/tool_types.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/tool_types.py)
- [examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)
- [examples/integrations/langchain/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/langchain/README.md)
- [README.md](https://github.com/arklexai/arksim/blob/main/README.md)
</details>

# Quickstart Guide

This guide walks you through setting up and running your first agent simulation with ArkSim. By the end, you'll understand how to scaffold a project, connect your agent, define test scenarios, and execute simulation and evaluation workflows.

## Prerequisites

Before starting, ensure you have:

- Python 3.10 or higher
- pip or pipx for package installation
- An API key for your LLM provider (OpenAI, Anthropic, etc.) if your agent requires external API calls

## Installation

Install ArkSim using pip:

```bash
pip install arksim
```

For development with test dependencies:

```bash
pip install -e ".[dev]"
```

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

## Project Initialization

The `arksim init` command scaffolds a starter project with all necessary configuration files. Run it from your desired working directory:

```bash
arksim init
```

### Agent Connection Types

ArkSim supports three agent connection types via the `--agent-type` flag:

| Type | Description | Use Case |
|------|-------------|----------|
| `custom` | Python class implementing `BaseAgent` | Full control, no external server needed |
| `chat_completions` | HTTP endpoint compatible with OpenAI format | Existing REST APIs |
| `a2a` | Agent-to-Agent protocol | Multi-agent systems |

资料来源：[arksim/cli.py:120-136](https://github.com/arklexai/arksim/blob/main/arksim/cli.py)

### Command Options

```bash
arksim init --agent-type custom    # Default: Python agent class
arksim init --agent-type chat_completions  # HTTP endpoint
arksim init --agent-type a2a       # A2A protocol
arksim init --agent-type custom --force  # Overwrite existing files
```

Scaffolding generates these files:

| File | Purpose |
|------|---------|
| `my_agent.py` | Agent implementation stub |
| `config.yaml` | Simulator configuration |
| `scenarios.json` | Test scenario definitions |
| `.env` | Environment variables template |

## Project Structure

```
your-project/
├── my_agent.py        # Your agent implementation
├── config.yaml         # Simulation & evaluation settings
├── scenarios.json     # Test scenarios
└── results/            # Output directory (created on run)
    ├── simulation/
    │   └── simulation.json
    └── evaluation/
        └── evaluation.json
```

## Connecting Your Agent

### Python Class (Recommended)

Replace the generated `my_agent.py` with your agent logic. Your class must inherit from `BaseAgent`:

```python
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Your agent logic here
        return "agent response"
```

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

The `execute` method supports two return types:

| Return Type | Description |
|-------------|-------------|
| `str` | Plain text response |
| `AgentResponse` | Structured response with text and tool calls |

```python
from arksim.simulation_engine.tool_types import AgentResponse, ToolCall

async def execute(self, user_query: str, **kwargs: object) -> AgentResponse:
    tool_calls = [
        ToolCall(
            id="call_123",
            name="search_database",
            arguments={"query": user_query}
        )
    ]
    return AgentResponse(
        content="Found results for your query",
        tool_calls=tool_calls
    )
```

资料来源：[arksim/simulation_engine/tool_types.py:45-72](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/tool_types.py)

### Chat Completions Endpoint

For HTTP-based agents, configure `config.yaml`:

```yaml
agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions
```

### A2A Protocol

For Agent-to-Agent protocol support:

```yaml
agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent
```

A2A agents can surface tool calls for evaluation via the protocol extension.

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

## Configuring Simulation

Edit `config.yaml` to control simulation behavior:

```yaml
simulator:
  max_turns: 20
  timeouts:
    agent_response: 30
    tool_execution: 10
  model: gpt-4o-mini
```

### Core Configuration Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `max_turns` | int | 20 | Maximum conversation turns per scenario |
| `agent_response` | int | 30 | Timeout for agent response (seconds) |
| `tool_execution` | int | 10 | Timeout for tool execution (seconds) |
| `model` | string | gpt-4o-mini | Simulator model for user simulation |

## Defining Test Scenarios

Edit `scenarios.json` to define your test cases:

```json
[
  {
    "id": "scenario-001",
    "name": "Customer Inquiry",
    "category": "support",
    "user_profile": {
      "name": "Alice Johnson",
      "age": 34,
      "account_type": "premium"
    },
    "knowledge": [
      "Customer has a premium account",
      "Customer is inquiring about billing"
    ],
    "conversation": [
      {
        "turn": 1,
        "user": "I noticed a charge on my account that I don't recognize",
        "goal": "Customer wants to understand and dispute the unfamiliar charge"
      }
    ]
  }
]
```

### Scenario Schema Fields

| Field | Required | Description |
|-------|----------|-------------|
| `id` | Yes | Unique scenario identifier |
| `name` | Yes | Human-readable name |
| `category` | No | Grouping category for filtering |
| `user_profile` | No | Simulated user attributes |
| `knowledge` | No | Facts the simulated user knows |
| `conversation` | Yes | Multi-turn conversation structure |

资料来源：[examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

## Running Simulation and Evaluation

### Combined Workflow

Run both simulation and evaluation in one command:

```bash
arksim simulate-evaluate config.yaml
```

### Separate Steps

For more control, run simulation and evaluation separately:

```bash
# Step 1: Simulate
arksim simulate config_simulate.yaml

# Step 2: Evaluate
arksim evaluate config_evaluate.yaml
```

资料来源：[examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

## Command Reference

| Command | Description |
|---------|-------------|
| `arksim init` | Scaffold new project |
| `arksim simulate <config>` | Run simulation only |
| `arksim evaluate <config>` | Run evaluation only |
| `arksim simulate-evaluate <config>` | Run both steps |
| `arksim ui` | Launch web UI (port 8080) |
| `arksim examples` | Download example projects |
| `arksim prompts` | List available prompts |

资料来源：[arksim/cli.py:89-152](https://github.com/arklexai/arksim/blob/main/arksim/cli.py)

## Web UI

Launch the web-based control plane:

```bash
arksim ui --port 8080
```

The UI provides:
- Scenario management
- Real-time simulation progress
- Log viewing
- Dark/light mode toggle

## Integration Examples

ArkSim supports integrations with popular agent frameworks:

| Framework | Command |
|-----------|---------|
| LangChain/LangGraph | `pip install langgraph langchain-openai` |
| Claude Agent SDK | `pip install claude-agent-sdk` |
| Microsoft AutoGen | `pip install autogen-agentchat autogen-ext[openai]` |
| Dify | HTTP-based integration |

Example workflow for LangGraph:

```bash
cd examples/integrations/langgraph
pip install langgraph langchain-openai
export OPENAI_API_KEY="<your-key>"
arksim simulate-evaluate config.yaml
```

资料来源：[examples/integrations/langchain/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/langchain/README.md)

## Next Steps

- Review the [Evaluation Metrics](evaluator) documentation for customizing scoring
- Explore the [Custom Agent](custom-agent) guide for advanced integration patterns
- Download example projects with `arksim examples`

---

<a id='arch-overview'></a>

## System Architecture Overview

### 相关页面

相关主题：[Simulation Engine](#simulation-engine), [Evaluation System](#evaluation-system), [LLM Provider Integration](#llm-providers)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [arksim/simulation_engine/tool_types.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/tool_types.py)
- [arksim/cli.py](https://github.com/arklexai/arksim/blob/main/arksim/cli.py)
- [README.md](https://github.com/arklexai/arksim/blob/main/README.md)
- [examples/customer-service/custom_metrics.py](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_metrics.py)
- [examples/e-commerce/custom_metrics.py](https://github.com/arklexai/arksim/blob/main/examples/e-commerce/custom_metrics.py)
- [arksim/utils/html_report/report_template.html](https://github.com/arklexai/arksim/blob/main/arksim/utils/html_report/report_template.html)
</details>

# System Architecture Overview

ArkSim is a multi-turn agent evaluation framework that simulates user conversations with agents, captures tool calls, and scores agent performance against customizable metrics. This document provides a comprehensive overview of the system's architecture, core components, and data flows.

## Core Components

ArkSim's architecture is organized into three primary subsystems that work together to simulate, capture, and evaluate agent behavior.

| Component | Purpose |
|-----------|---------|
| Simulation Engine | Orchestrates multi-turn conversations between user simulators and agents |
| Evaluator | Scores agent responses against qualitative and quantitative metrics |
| LLM Layer | Powers user simulation and metric evaluation via configurable model providers |

## High-Level Architecture

```mermaid
graph TD
    subgraph Simulation["Simulation Engine"]
        CLI[CLI Interface]
        Config[Config Loader]
        Simulator[Simulator Core]
        AgentConnector[Agent Connector]
        UserSimulator[User Simulator]
    end

    subgraph Evaluation["Evaluator"]
        Metrics[Custom Metrics]
        Scoring[Scoring Engine]
        Report[HTML Report Generator]
    end

    subgraph LLM["LLM Layer"]
        ChatLLM[Chat LLM]
        EvalLLM[Evaluation LLM]
    end

    subgraph External["External Systems"]
        CustomAgent[Custom Agent]
        HTTPEndpoint[HTTP Endpoint]
        A2AAgent[A2A Agent]
    end

    CLI --> Config
    Config --> Simulator
    Simulator --> UserSimulator
    Simulator --> AgentConnector
    AgentConnector --> CustomAgent
    AgentConnector --> HTTPEndpoint
    AgentConnector --> A2AAgent
    UserSimulator --> ChatLLM
    Metrics --> EvalLLM
    EvalLLM --> Scoring
    Scoring --> Report

    style CLI fill:#e1f5fe
    style Simulator fill:#fff3e0
    style Report fill:#e8f5e9
```

## CLI Interface

The command-line interface provides multiple entry points for running simulations and evaluations. The CLI is implemented in `arksim/cli.py` and supports the following commands:

| Command | Description |
|---------|-------------|
| `arksim simulate-evaluate config.yaml` | Run simulation and evaluation in a single pipeline |
| `arksim simulate config.yaml` | Run simulation only |
| `arksim evaluate config.yaml` | Run evaluation on existing simulation results |
| `arksim init` | Scaffold starter files for agent testing |
| `arksim ui` | Launch web UI control plane |
| `arksim examples` | Download example projects from GitHub |

资料来源：[arksim/cli.py:1-100](https://github.com/arklexai/arksim/blob/main/arksim/cli.py)

### Simulation Modes

ArkSim supports three distinct agent integration patterns:

```mermaid
graph LR
    subgraph AgentTypes["Agent Types"]
        Custom[Custom Agent<br/>Python class extending BaseAgent]
        HTTP[Chat Completions<br/>HTTP endpoint /v1/chat/completions]
        A2A[A2A Protocol<br/>Agent-to-Agent standard]
    end
```

1. **Custom Agent (Python class)**: Extend `BaseAgent` and implement the `execute()` method
2. **Chat Completions**: Configure an HTTP endpoint for OpenAI-compatible chat completions API
3. **A2A Protocol**: Connect via the Agent-to-Agent protocol standard

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

## Simulation Engine

The simulation engine orchestrates multi-turn conversations between a user simulator and the agent under test. Each turn consists of:

1. User simulator generates a response based on conversation history and scenario knowledge
2. Agent executes the user query and returns a response (optionally with tool calls)
3. Simulator captures the interaction for evaluation

### Tool Call Capture

ArkSim captures tool/function calls in two ways:

| Capture Method | Mechanism | Configuration |
|----------------|-----------|---------------|
| AgentResponse | Agent returns structured `AgentResponse` with `tool_calls` list | Default behavior |
| TracingProcessor | SDK's `TracingProcessor.on_span_end` captures calls automatically | `trace_receiver.enabled: true` |

资料来源：[examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

#### Tool Call Data Model

Tool calls are represented by the `ToolCall` class:

```python
class ToolCall(BaseModel):
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None
```

The `AgentResponse` wraps both content and tool calls:

```python
class AgentResponse(BaseModel):
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)
```

资料来源：[arksim/simulation_engine/tool_types.py:1-50](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/tool_types.py)

### Traced Agent Flow

```mermaid
sequenceDiagram
    participant Simulator
    participant Agent
    participant Tracing as ArksimTracingProcessor
    participant Evaluator

    Simulator->>Agent: execute(user_query)
    Agent->>Tracing: on_span_end(span)
    Tracing->>Simulator: captured_tool_calls
    Simulator->>Agent: RunResult
    Agent->>Simulator: AgentResponse
    Simulator->>Evaluator: Tool calls + Conversation
    Evaluator->>Simulator: Metric Scores
```

资料来源：[examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

## Evaluator

The evaluator scores agent performance using both quantitative and qualitative metrics. The evaluation framework supports custom metrics defined via Python files.

### Metric Types

| Type | Description | Scoring Method |
|------|-------------|----------------|
| QuantitativeMetric | Numerical scores (0.0-1.0 scale) | Structured JSON schema validation |
| QualitativeMetric | Free-form evaluation with reasoning | LLM-generated analysis |

资料来源：[examples/customer-service/custom_metrics.py:1-60](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_metrics.py)

### Custom Metrics Structure

Custom metrics require:

1. A Pydantic `BaseModel` defining the output schema
2. A system prompt describing evaluation criteria
3. Implementation of `QuantitativeMetric` or `QualitativeMetric`

```python
class ConversionSchema(BaseModel):
    intent_strength: float
    conversion_outcome: float
    evidence: list[str]
    reason: str

CONVERSION_SYSTEM_PROMPT = """\
You are an impartial evaluator for an e-commerce shopping agent.
Your job is to score (1) the shopper's purchase intent and (2) whether the agent achieved a conversion outcome...
"""
```

资料来源：[examples/e-commerce/custom_metrics.py:1-40](https://github.com/arklexai/arksim/blob/main/examples/e-commerce/custom_metrics.py)

### Score Calculation

The evaluator computes a **Final Score** as a weighted average:

| Component | Weight | Description |
|-----------|--------|-------------|
| Turn Success Ratio | 40% | Ratio of successful turns to total turns |
| Goal Completion Score | 60% | LLM-assessed goal achievement score |

| Status | Condition |
|--------|-----------|
| Done | Final score = 1.0 |
| Partial Failure | 0.0 < Final score < 1.0 |
| Complete Failure | Final score = 0.0 |

资料来源：[arksim/utils/html_report/report_template.html:1-50](https://github.com/arklexai/arksim/blob/main/arksim/utils/html_report/report_template.html)

## Data Flow

```mermaid
graph LR
    subgraph Input["Input"]
        Config[config.yaml]
        Scenarios[scenarios.json]
        Metrics[custom_metrics.py]
    end

    subgraph Process["Processing"]
        Sim[Simulation]
        Eval[Evaluation]
        Score[Scoring]
    end

    subgraph Output["Output"]
        Results[simulation.json]
        Report[evaluation.html]
    end

    Config --> Sim
    Scenarios --> Sim
    Sim --> Eval
    Metrics --> Eval
    Eval --> Score
    Score --> Results
    Score --> Report
```

## Agent Connector Types

ArkSim supports multiple agent integration patterns via the configuration system:

```yaml
# Option 1: Custom Python Agent
agent_config:
  agent_type: custom
  agent_name: my-agent

# Option 2: Chat Completions HTTP Endpoint
agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions

# Option 3: A2A Protocol
agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent
```

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

## User Simulator

The user simulator generates realistic multi-turn conversations based on:

- **Scenario definitions**: Predefined conversation flows and expected behaviors
- **User profiles**: Demographic and behavioral attributes for persona simulation
- **Knowledge bases**: Domain-specific information the simulated user possesses
- **Conversation history**: Full context of the multi-turn interaction

The simulator uses LLM-based generation to produce contextually appropriate user responses that evolve through the conversation based on agent actions and conversation state.

## HTML Report Generation

After evaluation completes, ArkSim generates an interactive HTML report containing:

| Section | Content |
|---------|---------|
| Summary Statistics | Overall scores, pass/fail rates |
| Per-Scenario Metrics | Individual metric scores with reasoning |
| Failure Categories | Grouped analysis of failure types |
| Conversation Transcripts | Full turn-by-turn dialogue viewer |

资料来源：[arksim/utils/html_report/report_template.html:1-100](https://github.com/arklexai/arksim/blob/main/arksim/utils/html_report/report_template.html)

## Integration Examples

ArkSim provides example integrations for popular agent frameworks:

| Framework | Example Location | Connection Method |
|-----------|-----------------|-------------------|
| LangGraph | `examples/integrations/langgraph/` | Custom agent connector |
| LangChain | `examples/integrations/langchain/` | Custom agent connector |
| Claude Agent SDK | `examples/integrations/claude-agent-sdk/` | Custom agent connector |
| AutoGen | `examples/integrations/autogen/` | Custom agent connector |
| Pydantic AI | `examples/integrations/pydantic-ai/` | Custom agent connector |
| Dify | `examples/integrations/dify/` | HTTP client |

资料来源：[examples/integrations/langgraph/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/langgraph/README.md), [examples/integrations/claude-agent-sdk/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/claude-agent-sdk/README.md), [examples/integrations/autogen/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/autogen/README.md)

## Configuration Schema

The main configuration file (`config.yaml`) controls all aspects of simulation and evaluation:

| Section | Key Options |
|---------|-------------|
| `agent_config` | `agent_type`, `agent_name`, `api_config.endpoint` |
| `scenario_file` | Path to scenarios.json |
| `metrics_to_run` | List of metric names to execute |
| `custom_metrics_file_paths` | Paths to custom metric Python files |
| `trace_receiver.enabled` | Enable TracingProcessor capture (default: false) |
| `trace_receiver.wait_timeout` | Timeout for trace capture (seconds) |

资料来源：[examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

## Web UI

ArkSim includes a web-based control plane for managing simulations:

```mermaid
graph TD
    subgraph UI["Web UI"]
        Build[Build Scenarios]
        Load[Load Existing]
        Run[Run Simulation]
        View[View Results]
    end

    subgraph Features["UI Features"]
        AutoGen[Auto-generate Scenarios PRO]
        Browse[File Browser]
        Refresh[Refresh Results]
    end
```

Features include:
- Scenario building and loading
- Auto-generate scenarios (PRO feature)
- File browser integration
- Results viewing and refresh

资料来源：[arksim/ui/frontend/index.html:1-80](https://github.com/arklexai/arksim/blob/main/arksim/ui/frontend/index.html)

## Summary

ArkSim provides a comprehensive multi-turn agent evaluation framework with:

- **Flexible agent integration**: Support for custom Python agents, HTTP endpoints, and A2A protocol
- **Tool call capture**: Two mechanisms for capturing agent tool executions
- **Customizable metrics**: Both quantitative and qualitative evaluation approaches
- **Interactive reporting**: HTML-based results with conversation viewer
- **CLI and UI**: Command-line and web-based interfaces for running evaluations
- **Framework integrations**: Pre-built examples for LangGraph, LangChain, Claude SDK, AutoGen, Pydantic AI, and Dify

---

<a id='simulation-engine'></a>

## Simulation Engine

### 相关页面

相关主题：[System Architecture Overview](#arch-overview), [Evaluation System](#evaluation-system), [Scenario Management](#scenario-management)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [arksim/simulation_engine/simulator.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/simulator.py)
- [arksim/simulation_engine/entities.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/entities.py)
- [arksim/simulation_engine/core/multi_knowledge_handling.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/core/multi_knowledge_handling.py)
- [arksim/simulation_engine/agent/base.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/agent/base.py)
- [arksim/simulation_engine/tool_types.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/tool_types.py)
- [arksim/simulation_engine/utils/prompts.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/utils/prompts.py)
</details>

# Simulation Engine

The Simulation Engine is the core component of ArkSim responsible for executing multi-turn conversations between simulated users and agent systems under test. It orchestrates the entire simulation lifecycle, from scenario loading to agent execution, while capturing tool calls, responses, and conversation state for downstream evaluation.

## Architecture Overview

The Simulation Engine follows a layered architecture that separates concerns between scenario management, agent execution, tool call capture, and knowledge handling.

```mermaid
graph TD
    A[Scenarios JSON] --> B[Simulator]
    C[Config YAML] --> B
    B --> D[Agent Executor]
    D --> E[BaseAgent]
    E --> F[Custom Agent / Chat Completions / A2A]
    F --> G[Tool Calls Capture]
    G --> H[Simulation Results]
    D --> I[Multi-Knowledge Handler]
    I --> J[Knowledge Sources]
    
    style B fill:#e1f5fe
    style D fill:#fff3e0
    style E fill:#e8f5e9
```

### Core Components

| Component | File | Purpose |
|-----------|------|---------|
| Simulator | `simulator.py` | Orchestrates the simulation lifecycle |
| Entities | `entities.py` | Data models for scenarios, profiles, and turns |
| Agent Base | `agent/base.py` | Abstract base class for all agent implementations |
| Tool Types | `tool_types.py` | Data models for tool calls and agent responses |
| Multi-Knowledge Handler | `core/multi_knowledge_handling.py` | Manages multiple knowledge sources for user profiles |
| Prompt Utilities | `utils/prompts.py` | Generates prompts for simulated users |

## Data Models

### ToolCall

The `ToolCall` class represents a single tool or function call observed during a conversation turn.

```python
class ToolCall(BaseModel):
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None
```

Key characteristics:
- Declares `extra="ignore"` for forward compatibility with future versions
- Supports capturing arguments, results, and error states
- Tracks the source of tool calls (e.g., from response parsing or tracing)

资料来源：[arksim/simulation_engine/tool_types.py:26-37]()

### AgentResponse

The `AgentResponse` class provides a structured return from agent execution.

```python
class AgentResponse(BaseModel):
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)
```

资料来源：[arksim/simulation_engine/tool_types.py:45-52]()

## BaseAgent Interface

All agent implementations must inherit from `BaseAgent` and implement the required async methods.

```python
class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"
```

资料来源：[README.md:45-55]()

### Required Methods

| Method | Return Type | Description |
|--------|-------------|-------------|
| `get_chat_id()` | `str` | Returns a unique identifier for the chat session |
| `execute()` | `str \| AgentResponse` | Executes the agent with a user query and returns content with optional tool calls |

资料来源：[README.md:45-55]()

## Simulation Workflow

```mermaid
graph LR
    A[Load Config] --> B[Load Scenarios]
    B --> C[Initialize Agent]
    C --> D[For Each Scenario]
    D --> E[Generate User Profile]
    E --> F[Execute Turn]
    F --> G{Capture Tool Calls}
    G --> H[Record Response]
    H --> I{More Turns?}
    I -->|Yes| F
    I -->|No| J[Save Results]
    J --> K[Next Scenario]
    K -->|Yes| D
    K -->|No| L[Complete]
```

### Step 0: Build Scenarios

Before running a simulation, users must create or load test scenarios. The UI provides options to:

- **Auto-generate Scenarios** (Pro feature) - Automatically generate realistic test scenarios from the agent's knowledge base
- **Load Existing** - Load scenario files from a specified path

资料来源：[arksim/ui/frontend/index.html:1-50]()

### Step 1: Simulation Execution

The simulator executes each scenario by:

1. Loading the scenario configuration and user profiles
2. Generating multi-turn conversations based on the scenario goals
3. Capturing all tool calls and responses
4. Producing a `simulation.json` output file

资料来源：[examples/integrations/dify/README.md:18-20]()

## Agent Types

ArkSim supports multiple agent integration patterns:

| Agent Type | Configuration | Use Case |
|------------|---------------|----------|
| Custom Python Class | `agent_type: custom` | Full control via subclassing `BaseAgent` |
| Chat Completions | `agent_type: chat_completions` | HTTP endpoint compatible with OpenAI format |
| A2A Protocol | `agent_type: a2a` | Agent-to-Agent protocol endpoints |

资料来源：[README.md:57-68]()

### Chat Completions Configuration

```yaml
agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions
```

### A2A Protocol Configuration

```yaml
agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent
```

资料来源：[README.md:57-68]()

## Tool Call Capture

ArkSim provides two mechanisms for capturing tool calls:

### Response-Based Capture (Default)

The agent returns an `AgentResponse` containing explicit tool calls:

```python
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
    run_result = self.agent.run(user_query)
    return AgentResponse(
        content=run_result.text,
        tool_calls=extract_tool_calls(run_result)
    )
```

### Tracing-Based Capture (Automatic)

For agents using the ArkSim tracing processor, tool calls are captured automatically without modifying the agent response:

```python
# At module load
from arksim.simulation_engine.tracing import ArksimTracingProcessor
agent.register(ArksimTracingProcessor())

# Agent returns plain str
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
    return self.agent.run(user_query)  # Returns plain string
```

资料来源：[examples/customer-service/README.md:1-35]()

## Configuration

### CLI Usage

```bash
# Combined simulation and evaluation
arksim simulate-evaluate config.yaml

# Separate steps
arksim simulate config_simulate.yaml
arksim evaluate config_evaluate.yaml
```

资料来源：[examples/customer-service/README.md:8-14]()

### Trace Receiver Configuration

```yaml
trace_receiver:
  enabled: true
  wait_timeout: 5
```

When `trace_receiver.enabled` is false, ArkSim only captures tool calls from `AgentResponse`.

资料来源：[examples/customer-service/README.md:30-35]()

## Integration Examples

ArkSim provides pre-built integrations for popular agent frameworks:

| Framework | Package | Example Path |
|-----------|---------|--------------|
| LangGraph | `langgraph`, `langchain-openai` | `examples/integrations/langgraph/` |
| AutoGen | `autogen-agentchat`, `autogen-ext[openai]` | `examples/integrations/autogen/` |
| Claude Agent SDK | `claude-agent-sdk` | `examples/integrations/claude-agent-sdk/` |
| CrewAI | `crewai` | `examples/integrations/crewai/` |
| Pydantic AI | `pydantic-ai` | `examples/integrations/pydantic-ai/` |
| LangChain | `langgraph`, `langchain-openai` | `examples/integrations/langchain/` |

资料来源：[examples/integrations/langgraph/README.md](), [examples/integrations/autogen/README.md](), [examples/integrations/claude-agent-sdk/README.md](), [examples/integrations/crewai/README.md](), [examples/integrations/pydantic-ai/README.md](), [examples/integrations/langchain/README.md]()

## Output Format

Simulation results are written to `./results/simulation/simulation.json` containing:

- Complete conversation transcripts
- Captured tool calls with arguments and results
- Turn-by-turn timing information
- User profile data
- Scenario metadata

资料来源：[examples/integrations/dify/README.md:18-20]()

## Custom Metrics Support

The simulation engine supports custom quantitative and qualitative metrics through the evaluator. Custom metric files can be added to `custom_metrics/` directories and referenced in the configuration.

资料来源：[examples/customer-service/custom_metrics.py:1-20]()

## Forward Compatibility

Both `ToolCall` and `AgentResponse` declare `extra="ignore"` in their Pydantic configuration. This ensures that snapshots from future ArkSim versions containing new fields can be loaded by older versions without raising `ValidationError`.

```python
model_config = ConfigDict(extra="ignore")
```

资料来源：[arksim/simulation_engine/tool_types.py:26-28](), [arksim/simulation_engine/tool_types.py:45-47]()

---

<a id='evaluation-system'></a>

## Evaluation System

### 相关页面

相关主题：[System Architecture Overview](#arch-overview), [Simulation Engine](#simulation-engine)

<details>
<summary>Relevant Source Files</summary>

以下源码文件用于生成本页说明：

- [arksim/evaluator/evaluator.py](https://github.com/arklexai/arksim/blob/main/arksim/evaluator/evaluator.py)
- [arksim/evaluator/builtin_metrics.py](https://github.com/arklexai/arksim/blob/main/arksim/evaluator/builtin_metrics.py)
- [arksim/evaluator/base_metric.py](https://github.com/arklexai/arksim/blob/main/arksim/evaluator/base_metric.py)
- [arksim/evaluator/tool_call_metrics.py](https://github.com/arklexai/arksim/blob/main/arksim/evaluator/tool_call_metrics.py)
- [arksim/evaluator/trajectory_matching.py](https://github.com/arklexai/arksim/blob/main/arksim/evaluator/trajectory_matching.py)
- [arksim/evaluator/thresholds.py](https://github.com/arklexai/arksim/blob/main/arksim/evaluator/thresholds.py)
- [arksim/utils/html_report/generate_html_report.py](https://github.com/arklexai/arksim/blob/main/arksim/utils/html_report/generate_html_report.py)
</details>

# Evaluation System

The Evaluation System in ArkSim is responsible for scoring simulated conversations between users and agents. It measures goal completion, helpfulness, coherence, and other quality metrics to identify where agents succeed and where they fail.

## Overview

After simulation generates conversation transcripts, the evaluator analyzes each conversation and assigns scores across multiple dimensions. The system supports both built-in metrics and custom-defined metrics.

**Purpose:**
- Quantify agent performance objectively
- Identify specific failure patterns
- Generate actionable HTML reports for debugging

**Scope:**
- Scores conversations on a 0.0-1.0 scale
- Computes weighted final scores combining multiple metrics
- Categorizes outcomes as done, partial failure, or failed
- Supports LLM-based evaluation (using configured providers like OpenAI)

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

## Architecture

```mermaid
graph TD
    A[Simulation Results] --> B[Evaluator]
    B --> C[Built-in Metrics]
    B --> D[Custom Metrics]
    B --> E[Tool Call Metrics]
    C --> F[Final Scores]
    D --> F
    E --> F
    F --> G[HTML Report Generator]
    G --> H[evaluation/final_report.html]
```

### Core Components

| Component | File | Purpose |
|-----------|------|---------|
| Evaluator | `evaluator.py` | Main orchestration of evaluation pipeline |
| Base Metric | `base_metric.py` | Abstract base classes for metrics |
| Built-in Metrics | `builtin_metrics.py` | Standard metrics (faithfulness, helpfulness, etc.) |
| Tool Call Metrics | `tool_call_metrics.py` | Evaluation of tool usage patterns |
| Trajectory Matching | `trajectory_matching.py` | Compare expected vs actual agent trajectories |
| Thresholds | `thresholds.py` | Score classification logic |
| HTML Report | `generate_html_report.py` | Generate visual evaluation reports |

## Evaluation Workflow

```mermaid
sequenceDiagram
    participant S as Simulator
    participant E as Evaluator
    participant M as Metrics
    participant R as Report Generator
    
    S->>E: Simulation results (JSON)
    E->>M: For each conversation
    M->>M: Run quantitative metrics
    M->>M: Run qualitative metrics
    M->>E: Per-metric scores
    E->>E: Calculate final scores
    E->>E: Determine status
    E->>R: Score data
    R->>R: Generate HTML
```

### Step-by-Step Process

1. **Load Simulation Data**: Read conversation transcripts from simulation output
2. **Select Metrics**: Determine which built-in and custom metrics to run
3. **Score Each Conversation**: Apply metrics to score goal completion, behavior, etc.
4. **Compute Final Scores**: Calculate weighted averages (goal_completion_weight=0.25, turn_success_ratio_weight=0.75)
5. **Classify Status**: Assign done/partial_failure/failed based on thresholds
6. **Generate Report**: Create HTML report with detailed breakdowns

资料来源：[arksim/utils/html_report/report_template.html](https://github.com/arklexai/arksim/blob/main/arksim/utils/html_report/report_template.html)

## Scoring System

### Score Ranges

| Score Type | Range | Description |
|------------|-------|-------------|
| Quantitative | 0.0 - 1.0 | Numeric scores for measurable criteria |
| Qualitative | Label-based | Categorical results (compliant, professional, pass, etc.) |
| Goal Completion | 0.0 - 1.0 | Whether agent completed user goal |
| Final Score | 0.0 - 1.0 | Weighted combination of metrics |

### Status Classification

| Status | Condition | Description |
|--------|-----------|-------------|
| Done | final_score == 1.0 | Perfect performance, goal completed |
| Partial Failure | final_score >= 0.6 | Acceptable but with some failures |
| Failed | final_score < 0.6 | Poor performance requiring attention |

资料来源：[arksim/utils/html_report/report_template.html:85-87](https://github.com/arklexai/arksim/blob/main/arksim/utils/html_report/report_template.html)

## Built-in Metrics

The system provides seven built-in metrics that can be selected via configuration:

| Metric | Purpose |
|--------|---------|
| `faithfulness` | Did the agent provide factually accurate information? |
| `helpfulness` | Was the agent's response useful to the user? |
| `coherence` | Were responses logically connected and consistent? |
| `verbosity` | Did the agent maintain appropriate response length? |
| `relevance` | Did responses address the user's actual query? |
| `goal_completion` | Did the agent help the user accomplish their goal? |
| `agent_behavior_failure` | Did the agent exhibit any problematic behaviors? |

### Metric Selection

From the frontend, users can select which built-in metrics to run:

```html
<template x-for="m in ['faithfulness', 'helpfulness', 'coherence', 'verbosity', 'relevance', 'goal_completion', 'agent_behavior_failure']" :key="m">
```

If no metrics are selected, all built-in metrics run by default.

资料来源：[arksim/ui/frontend/index.html](https://github.com/arklexai/arksim/blob/main/arksim/ui/frontend/index.html)

## Custom Metrics

Developers can define custom evaluation metrics by creating a Python module.

### Creating Custom Metrics

```python
from arksim.evaluator import (
    QualitativeMetric,
    QualResult,
    QuantitativeMetric,
    QuantResult,
    ScoreInput,
)

class MyMetric(QuantitativeMetric):
    name = "my_custom_metric"
    schema = MySchema  # Pydantic model
    
    async def evaluate(self, input: ScoreInput) -> QuantResult:
        # Evaluation logic
        return QuantResult(...)
```

### Quantitative vs Qualitative Metrics

| Type | Base Class | Output |
|------|------------|--------|
| Quantitative | `QuantitativeMetric` | `QuantResult` with float value and reason |
| Qualitative | `QualitativeMetric` | `QualResult` with categorical label |

### Configuration

Add custom metrics to `config.yaml`:

```yaml
custom_metrics_file_paths:
  - /path/to/custom_metrics.py

metrics_to_run:
  - my_custom_metric  # optional; runs all if omitted
```

资料来源：[examples/customer-service/custom_metrics.py](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_metrics.py)

### Example: Verification Compliance

```python
class VerificationComplianceSchema(BaseModel):
    identity_verification: float  # 0.0-1.0
    action_gating: float  # 0.0-1.0
    reason: str
```

The evaluation prompt instructs the LLM to score:
- **Identity Verification**: Did the agent verify customer identity before actions?
- **Action Gating**: Did the agent gate sensitive actions behind verification?

资料来源：[examples/customer-service/custom_metrics.py:25-35](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_metrics.py)

### Example: E-commerce Conversion

```python
class ConversionSchema(BaseModel):
    intent_strength: float
    conversion_outcome: float
    evidence: list[str]
    reason: str
```

Metrics track:
- **Intent Strength**: How ready the shopper is to buy
- **Conversion Outcome**: Whether the agent achieved a purchase decision

资料来源：[examples/e-commerce/custom_metrics.py](https://github.com/arklexai/arksim/blob/main/examples/e-commerce/custom_metrics.py)

## HTML Report

The evaluation system generates a detailed HTML report (`evaluation/final_report.html`) containing:

### Report Sections

| Section | Content |
|---------|---------|
| Summary Statistics | Overall pass/fail rates, average scores |
| Conversations Table | Per-conversation scores, status badges |
| Detailed Breakdown | Goal completion, final scores, failure reasons |
| Score Reasons | LLM-generated explanations for each metric |

### Report Features

- **Interactive Table**: Sort and filter conversations
- **Score Details**: Expandable sections showing metric-by-metric breakdown
- **Status Badges**: Visual indicators (done/partial/failed)
- **Tooltip Explanations**: Hover info for column headers

### Score Display Logic

```javascript
const POSITIVE_LABELS = ['compliant', 'professional', 'pass', 'good', 'complete', 'no failure', 'ok'];
const NEGATIVE_LABELS = ['flagged', 'unprofessional', 'fail', 'error', 'poor', 'missing', 'violated', 'partial'];
```

Qualitative scores are automatically classified as positive, negative, or neutral based on label matching.

资料来源：[arksim/utils/html_report/report_template.html:78-81](https://github.com/arklexai/arksim/blob/main/arksim/utils/html_report/report_template.html)

## Configuration

### Evaluation Configuration Options

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `evalProvider` | string | - | LLM provider for evaluation |
| `evalModel` | string | - | Model to use for scoring |
| `metricsToRun` | list | all | Which built-in metrics to execute |
| `customMetricsFilePaths` | list | [] | Paths to custom metric modules |
| `evalNumWorkers` | int | auto | Parallel evaluation workers |

### Provider Selection

The evaluator supports multiple LLM providers:
- OpenAI
- Azure OpenAI
- Custom endpoints (via `chat_completions` type)
- A2A protocol agents

资料来源：[arksim/ui/frontend/index.html](https://github.com/arklexai/arksim/blob/main/arksim/ui/frontend/index.html)

## Integration with Simulation

The evaluation system is typically invoked via the CLI:

```bash
# Combined simulation and evaluation
arksim simulate-evaluate config.yaml

# Separate steps
arksim simulate config_simulate.yaml
arksim evaluate config_evaluate.yaml
```

### Pipeline Flow

```mermaid
graph LR
    A[config.yaml] --> B[Simulator]
    B --> C[simulation.json]
    C --> D[Evaluator]
    D --> E[evaluation/]
    E --> F[final_report.html]
    E --> F --> G[scores.json]
```

Results are written to `./results/simulation/simulation.json` for simulation and `./results/evaluation/` for evaluation output.

资料来源：[examples/integrations/dify/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/dify/README.md)

## Advanced Features

### Tool Call Capture

ArkSim can automatically capture tool calls via tracing:

```yaml
trace_receiver:
  enabled: true
  wait_timeout: 5
```

When enabled, tool calls are captured automatically without requiring explicit return in `AgentResponse`.

### Trajectory Matching

The `trajectory_matching.py` module compares expected agent trajectories against actual behavior, useful for validating that agents follow prescribed action sequences.

### Thresholds

The `thresholds.py` module defines score boundaries and classification logic for determining pass/fail conditions.

## Summary

The Evaluation System provides comprehensive, configurable scoring of agent conversations:

- **Flexible Metric System**: Built-in metrics cover common quality dimensions; custom metrics extend evaluation to domain-specific criteria
- **LLM-based Scoring**: Uses configurable language models to generate nuanced, explainable scores
- **Visual Reporting**: HTML reports make results easy to understand and act upon
- **Status Classification**: Automatic categorization into done/partial_failure/failed for quick assessment
- **Integration-ready**: Works with Python agents, chat completions endpoints, and A2A protocol agents

---

<a id='agent-types'></a>

## Agent Types and Integration

### 相关页面

相关主题：[LLM Provider Integration](#llm-providers), [Tool Call Capture](#tool-call-capture), [Configuration System](#configuration)

<details>
<summary>Relevant Source Files</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/arklexai/arksim/blob/main/README.md) - Main documentation with agent type examples
- [arksim/cli.py](https://github.com/arklexai/arksim/blob/main/arksim/cli.py) - CLI commands for agent initialization
- [arksim/ui/frontend/index.html](https://github.com/arklexai/arksim/blob/main/arksim/ui/frontend/index.html) - UI Agent Config section
- [examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md) - Custom agent implementation example
- [examples/integrations/dify/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/dify/README.md) - HTTP agent integration example
- [examples/integrations/rasa/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/rasa/README.md) - External agent server integration
</details>

# Agent Types and Integration

ArkSim supports multiple agent connection types to accommodate different architectures and deployment scenarios. This page documents the available agent types, their configuration methods, and integration patterns.

## Overview

ArkSim provides a flexible agent integration system that enables testing of various agent implementations through a unified simulation interface. The simulator communicates with agents via standardized protocols and captures responses for evaluation.

The agent integration system consists of:
- A `BaseAgent` abstract class that defines the agent interface
- Multiple client implementations for different connection types
- A factory pattern for agent instantiation based on configuration
- Support for tool call capture through both explicit responses and tracing

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

## Supported Agent Types

ArkSim supports three primary agent types, each suited for different deployment scenarios.

| Agent Type | Description | Use Case |
|------------|-------------|----------|
| `custom` | Python class extending BaseAgent | Custom agent logic, no external server required |
| `chat_completions` | HTTP endpoint with OpenAI-compatible API | Existing REST APIs, external agent services |
| `a2a` | Agent-to-Agent protocol endpoint | Multi-agent systems, A2A-compliant agents |

资料来源：[arksim/cli.py:100-109](https://github.com/arklexai/arksim/blob/main/arksim/cli.py)

## Custom Agent (Python Class)

The `custom` agent type is the default integration method. It requires implementing a Python class that extends `BaseAgent` and implements the `execute` method.

### Implementation Pattern

```python
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"
```

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

### Returning Tool Calls

To enable tool call evaluation, return an `AgentResponse` object instead of a plain string. This allows the evaluator to assess whether the agent correctly invoked required tools.

```python
from arksim.simulation_engine.tool_types import AgentResponse, ToolCall

async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
    tool_calls = [
        ToolCall(
            name="search_knowledge_base",
            arguments={"query": user_query}
        )
    ]
    return AgentResponse(
        content="Found relevant information in the knowledge base.",
        tool_calls=tool_calls
    )
```

### Traced Agent Variant

For agents using OpenTelemetry-based instrumentation, ArkSim supports automatic tool call capture through the `TracingProcessor` interface. This eliminates the need to explicitly return tool calls in `AgentResponse`.

```
Simulator sets routing context -> agent.execute() runs normally
-> SDK fires TracingProcessor.on_span_end -> arksim captures -> evaluator scores
```

资料来源：[examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

## Chat Completions Agent (HTTP API)

The `chat_completions` agent type connects to any HTTP endpoint implementing an OpenAI-compatible chat completions interface.

### Configuration

```yaml
agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions
```

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

### Request Format

ArkSim sends requests following the OpenAI chat completions format:

```json
{
  "model": "agent-model",
  "messages": [
    {"role": "user", "content": "user query"}
  ]
}
```

### Response Handling

The endpoint should return responses in the standard chat completions format:

```json
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "agent response"
      }
    }
  ]
}
```

资料来源：[examples/integrations/dify/README.md](https://github.com/arklexai/arksim/blob/main/examples/integrations/dify/README.md)

## A2A Protocol Agent

The `a2a` agent type connects to agents implementing the Agent-to-Agent (A2A) protocol, enabling integration with multi-agent systems.

### Configuration

```yaml
agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent
```

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

### A2A with Tool Calls

A2A agents can also surface tool calls for evaluation. The agent returns both the response content and tool call information in its A2A-formatted response.

## CLI Initialization

ArkSim provides a CLI command to scaffold agent implementations:

```bash
arksim init --agent-type <type>
```

The `--agent-type` flag accepts the following values:
- `custom` (default) - Generates a Python agent file
- `chat_completions` - Configures HTTP endpoint connection
- `a2a` - Configures A2A protocol connection

The `--force` flag overwrites existing files.

资料来源：[arksim/cli.py:94-118](https://github.com/arklexai/arksim/blob/main/arksim/cli.py)

## Integration Architecture

The following diagram illustrates how ArkSim communicates with different agent types:

```mermaid
graph TD
    subgraph ArkSim["ArkSim"]
        Simulator["Simulator Engine"]
        Evaluator["Evaluator"]
    end
    
    subgraph AgentTypes["Agent Implementations"]
        CustomAgent["Custom Agent (Python)"]
        HTTPAgent["Chat Completions API"]
        A2AAgent["A2A Protocol Agent"]
    end
    
    Simulator -->|execute| CustomAgent
    Simulator -->|HTTP POST| HTTPAgent
    Simulator -->|A2A Protocol| A2AAgent
    
    CustomAgent -->|AgentResponse| Evaluator
    HTTPAgent -->|JSON Response| Evaluator
    A2AAgent -->|A2A Response| Evaluator
```

## Agent Configuration in UI

The ArkSim web UI provides an "Agent Config" section for configuring agent connections. This section is dynamically loaded from the configuration YAML file.

```html
<!-- Agent Config -->
<div class="t-surface rounded-xl border t-border p-5 mb-4">
    <h2 class="font-semibold t-heading mb-1">Agent Config</h2>
    <p class="text-xs t-caption mb-3">How arksim connects to your agent.</p>
</div>
```

资料来源：[arksim/ui/frontend/index.html](https://github.com/arklexai/arksim/blob/main/arksim/ui/frontend/index.html)

## Framework Integrations

ArkSim includes pre-built integrations for popular agent frameworks through example projects:

| Framework | Integration File | Protocol |
|-----------|------------------|----------|
| LangChain/LangGraph | `custom_agent.py` | Python class |
| CrewAI | `custom_agent.py` | Python class |
| AutoGen | `custom_agent.py` | Python class |
| Claude Agent SDK | `custom_agent.py` | Python class |
| Pydantic AI | `custom_agent.py` | Python class |
| LlamaIndex | `custom_agent.py` | Python class |
| Smolagents | `custom_agent.py` | Python class |
| OpenAI Agents SDK | `custom_agent.py` | Python class |
| Dify | `custom_agent.py` | HTTP API |
| Rasa | `custom_agent.py` | HTTP API |
| OpenClaw | `config.yaml` | Chat Completions |

资料来源：[examples/integrations/*/README.md](https://github.com/arklexai/arksim/tree/main/examples/integrations)

## Running Simulations with Different Agent Types

### Single Command

```bash
arksim simulate-evaluate config.yaml
```

### Separate Steps

```bash
# Step 1: Simulate
arksim simulate config_simulate.yaml

# Step 2: Evaluate
arksim evaluate config_evaluate.yaml
```

资料来源：[examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

## Best Practices

1. **Tool Call Capture**: For accurate evaluation, ensure your agent returns tool calls either explicitly via `AgentResponse` or implicitly through tracing instrumentation.

2. **Async Implementation**: All agent implementations should use async/await patterns for proper integration with the simulation engine.

3. **Error Handling**: Agents should handle errors gracefully and return meaningful error messages that can be evaluated by the system.

4. **Configuration Management**: Use environment variables for sensitive configuration values like API keys and endpoints.

---

<a id='llm-providers'></a>

## LLM Provider Integration

### 相关页面

相关主题：[Agent Types and Integration](#agent-types), [Configuration System](#configuration)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [arksim/llms/chat/providers/openai.py](https://github.com/arklexai/arksim/blob/main/arksim/llms/chat/providers/openai.py)
- [arksim/llms/chat/providers/anthropic.py](https://github.com/arklexai/arksim/blob/main/arksim/llms/chat/providers/anthropic.py)
- [arksim/llms/chat/providers/google.py](https://github.com/arklexai/arksim/blob/main/arksim/llms/chat/providers/google.py)
- [arksim/llms/chat/providers/azure_openai.py](https://github.com/arklexai/arksim/blob/main/arksim/llms/chat/providers/azure_openai.py)
- [arksim/llms/chat/llm.py](https://github.com/arklexai/arksim/blob/main/arksim/llms/chat/llm.py)
- [arksim/llms/chat/base/base_llm.py](https://github.com/arklexai/arksim/blob/main/arksim/llms/chat/base/base_llm.py)
</details>

# LLM Provider Integration

## Overview

The LLM Provider Integration system in ArkSim provides a unified abstraction layer for connecting to various Large Language Model (LLM) providers. This modular architecture enables the simulator to interact with different AI backends while maintaining a consistent interface for chat completions, streaming responses, and token usage tracking.

The system follows a provider-based pattern where each supported LLM service (OpenAI, Anthropic, Google, Azure OpenAI) implements a common interface defined in the base class. This design allows users to swap between providers without changing the core simulation logic.

## Architecture

### System Components

```mermaid
graph TD
    A[Simulation Engine] --> B[LLM Manager<br/>arksim/llms/chat/llm.py]
    B --> C[Base LLM<br/>arksim/llms/chat/base/base_llm.py]
    C --> D[OpenAI Provider<br/>providers/openai.py]
    C --> E[Anthropic Provider<br/>providers/anthropic.py]
    C --> F[Google Provider<br/>providers/google.py]
    C --> G[Azure OpenAI Provider<br/>providers/azure_openai.py]
    
    H[Configuration YAML] --> B
    I[Environment Variables] --> D
    I --> E
    I --> F
    I --> G
```

### Provider Class Hierarchy

```mermaid
graph TD
    A[BaseLLM<br/>base/base_llm.py] --> B[OpenAIProvider<br/>providers/openai.py]
    A --> C[AnthropicProvider<br/>providers/anthropic.py]
    A --> D[GoogleProvider<br/>providers/google.py]
    A --> E[AzureOpenAIProvider<br/>providers/azure_openai.py]
    
    F[Provider Enum] --> A
    G[ChatMessage Model] --> A
    H[ChatCompletionResponse] --> A
```

## Supported Providers

ArkSim supports the following LLM providers through dedicated provider implementations:

| Provider | Provider Class | API Style | Streaming Support |
|----------|----------------|-----------|-------------------|
| OpenAI | `OpenAIProvider` | OpenAI API | Yes |
| Anthropic | `AnthropicProvider` | Anthropic API | Yes |
| Google | `GoogleProvider` | Google AI API | Yes |
| Azure OpenAI | `AzureOpenAIProvider` | Azure API | Yes |

Each provider class inherits from `BaseLLM` and implements provider-specific API call logic while adhering to the common interface contract.

## Base LLM Interface

The `BaseLLM` class defines the contract that all provider implementations must follow. This ensures consistent behavior across different LLM backends.

### Core Methods

| Method | Purpose | Parameters |
|--------|---------|------------|
| `chat()` | Send a chat completion request | `messages`, `model`, `temperature`, `max_tokens`, `**kwargs` |
| `chat_stream()` | Stream chat completion responses | `messages`, `model`, `temperature`, `max_tokens`, `**kwargs` |
| `count_tokens()` | Calculate token usage for messages | `messages`, `model` |

### Data Models

The base module defines essential data structures used across all providers:

| Model | Purpose |
|-------|---------|
| `ChatMessage` | Represents a single message with role and content |
| `ChatCompletionResponse` | Wraps the API response from providers |
| `Provider` | Enum identifying supported LLM providers |
| `ModelInfo` | Metadata about available models per provider |

## Provider Implementations

### OpenAI Provider

The OpenAI provider connects to OpenAI's API endpoints for chat completions. It supports both standard and streaming responses.

**Configuration Requirements:**

| Parameter | Source | Description |
|-----------|--------|-------------|
| `OPENAI_API_KEY` | Environment variable | API key for authentication |
| `model` | Config/Parameter | Model identifier (e.g., `gpt-4`, `gpt-4-turbo`) |

**API Endpoint:**
```
POST https://api.openai.com/v1/chat/completions
```

资料来源：[arksim/llms/chat/providers/openai.py]()

### Anthropic Provider

The Anthropic provider integrates with Anthropic's Claude models through their API. It handles the distinct message format and API conventions used by Anthropic.

**Configuration Requirements:**

| Parameter | Source | Description |
|-----------|--------|-------------|
| `ANTHROPIC_API_KEY` | Environment variable | API key for authentication |
| `model` | Config/Parameter | Model identifier (e.g., `claude-3-opus-20240229`) |

**API Endpoint:**
```
POST https://api.anthropic.com/v1/messages
```

资料来源：[arksim/llms/chat/providers/anthropic.py]()

### Google Provider

The Google provider connects to Google's Gemini models via the Google AI API.

**Configuration Requirements:**

| Parameter | Source | Description |
|-----------|--------|-------------|
| `GOOGLE_API_KEY` | Environment variable | API key for authentication |
| `model` | Config/Parameter | Model identifier (e.g., `gemini-pro`) |

**API Endpoint:**
```
POST https://generativelanguage.googleapis.com/v1/models/{model}:generateContent
```

资料来源：[arksim/llms/chat/providers/google.py]()

### Azure OpenAI Provider

The Azure OpenAI provider enables integration with Azure-hosted OpenAI models, supporting enterprise deployments with Azure-specific authentication and endpoint configuration.

**Configuration Requirements:**

| Parameter | Source | Description |
|-----------|--------|-------------|
| `AZURE_OPENAI_API_KEY` | Environment variable | API key for Azure authentication |
| `AZURE_OPENAI_ENDPOINT` | Environment variable | Azure endpoint URL |
| `AZURE_OPENAI_DEPLOYMENT` | Config | Deployment name in Azure |
| `AZURE_OPENAI_API_VERSION` | Config | Azure API version |

资料来源：[arksim/llms/chat/providers/azure_openai.py]()

## Configuration

### YAML Configuration Structure

LLM providers are configured through the `config.yaml` file used by the simulator:

```yaml
llm:
  provider: openai  # or anthropic, google, azure_openai
  model: gpt-4
  temperature: 0.7
  max_tokens: 2048
  
agent_config:
  # Agent-specific LLM settings
  provider: anthropic
  model: claude-3-opus-20240229
```

### Environment Variables

| Variable | Providers | Purpose |
|----------|-----------|---------|
| `OPENAI_API_KEY` | OpenAI, AutoGen, LangChain | OpenAI API authentication |
| `ANTHROPIC_API_KEY` | Claude Agent SDK | Anthropic API authentication |
| `GOOGLE_API_KEY` | Google | Google AI API authentication |
| `AZURE_OPENAI_API_KEY` | Azure OpenAI | Azure API authentication |
| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI | Azure resource endpoint |

## Integration with Custom Agents

The LLM provider system integrates with the custom agent connector pattern used in ArkSim simulations. Custom agents can be configured to use any supported LLM provider.

```mermaid
graph LR
    A[Scenario JSON] --> B[Simulation Engine]
    B --> C[Custom Agent<br/>custom_agent.py]
    C --> D[LLM Provider]
    D --> E[External LLM API]
```

### Integration Examples

ArkSim provides integration examples for various agent frameworks:

| Framework | Example Path | LLM Provider Used |
|-----------|--------------|-------------------|
| LangChain/LangGraph | `examples/integrations/langchain/` | OpenAI |
| Claude Agent SDK | `examples/integrations/claude-agent-sdk/` | Anthropic |
| LlamaIndex | `examples/integrations/llamaindex/` | OpenAI |
| CrewAI | `examples/integrations/crewai/` | OpenAI |
| AutoGen | `examples/integrations/autogen/` | OpenAI |
| Pydantic AI | `examples/integrations/pydantic-ai/` | OpenAI |
| Smolagents | `examples/integrations/smolagents/` | OpenAI |
| Dify | `examples/integrations/dify/` | Custom HTTP |

Each integration demonstrates how to connect the framework's agent to ArkSim's simulation engine while delegating LLM calls to the appropriate provider.

## Usage Flow

```mermaid
sequenceDiagram
    participant User
    participant Config as config.yaml
    participant LLM as LLM Manager
    participant Provider as Provider Class
    participant API as External LLM API
    
    User->>Config: Load configuration
    User->>LLM: Initialize with provider type
    LLM->>Provider: Create provider instance
    User->>LLM: chat(messages, model)
    LLM->>Provider: _chat_completion()
    Provider->>API: HTTP POST request
    API-->>Provider: Completion response
    Provider-->>LLM: Normalized response
    LLM-->>User: ChatCompletionResponse
```

## Adding New Providers

To add support for a new LLM provider:

1. Create a new provider class inheriting from `BaseLLM`
2. Implement the required methods: `chat()`, `chat_stream()`, `count_tokens()`
3. Add the provider to the `Provider` enum in `base_llm.py`
4. Update the provider factory logic in `llm.py` to instantiate the new provider
5. Add integration tests and documentation

资料来源：[arksim/llms/chat/base/base_llm.py]()
资料来源：[arksim/llms/chat/llm.py]()

## Error Handling

Provider implementations handle common error scenarios:

| Error Type | Handling Strategy |
|------------|-------------------|
| Authentication failures | Raise `AuthenticationError` with helpful message |
| Rate limiting | Implement automatic retry with backoff |
| Invalid request parameters | Raise `ValidationError` with parameter details |
| Network timeouts | Retry with exponential backoff |
| Model not found | Raise `ModelNotFoundError` listing available models |

## Best Practices

1. **API Key Security**: Store API keys in environment variables, never in configuration files committed to version control. ArkSim automatically loads keys from environment variables.

2. **Token Tracking**: Use the `count_tokens()` method to monitor token usage and estimate costs before running large-scale simulations.

3. **Streaming for Large Responses**: Enable streaming (`chat_stream()`) for scenarios expecting long agent responses to improve perceived responsiveness.

4. **Provider Selection**: Choose Azure OpenAI for enterprise deployments requiring compliance certifications and dedicated infrastructure.

5. **Model Selection**: Refer to the integration examples for recommended model configurations per provider and use case.

---

<a id='tool-call-capture'></a>

## Tool Call Capture

### 相关页面

相关主题：[Evaluation System](#evaluation-system), [Agent Types and Integration](#agent-types)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [arksim/simulation_engine/tool_types.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/tool_types.py)
- [arksim/simulation_engine/agent/clients/a2a.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/agent/clients/a2a.py)
- [arksim/tracing/openai.py](https://github.com/arklexai/arksim/blob/main/arksim/tracing/openai.py)
- [examples/customer-service/custom_agent.py](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_agent.py)
- [examples/customer-service/traced_agent.py](https://github.com/arklexai/arksim/blob/main/examples/customer-service/traced_agent.py)
- [examples/customer-service/a2a_server/agent.py](https://github.com/arklexai/arksim/blob/main/examples/customer-service/a2a_server/agent.py)
</details>

# Tool Call Capture

Tool Call Capture is a core mechanism in ArkSim that observes and records every tool or function invocation made by an agent during a simulation. This captured data feeds directly into the evaluator, enabling metrics like tool call accuracy, error detection, and trajectory analysis.

## Overview

Tool Call Capture serves as the bridge between agent execution and evaluation. Without accurate tool call capture, the evaluator cannot determine whether an agent:

- Invoked the correct tools
- Passed valid arguments
- Handled errors appropriately
- Followed the expected execution trajectory

ArkSim supports three distinct capture mechanisms:

1. **Explicit capture** via `AgentResponse.tool_calls`
2. **Automatic capture** via the `ArksimTracingProcessor`
3. **A2A protocol capture** via task artifacts

All three methods produce the same underlying `ToolCall` data model, ensuring consistent evaluation regardless of how the agent is implemented.

## Data Models

### ToolCall

The `ToolCall` class represents a single tool invocation observed during a turn:

```python
class ToolCall(BaseModel):
    model_config = ConfigDict(extra="ignore")
    
    id: str
    name: str
    arguments: dict[str, Any] = Field(default_factory=dict)
    result: str | None = None
    error: str | None = None
    source: ToolCallSource | None = None
```

| Field | Type | Description |
|-------|------|-------------|
| `id` | `str` | Unique identifier for this tool call |
| `name` | `str` | Name of the tool/function invoked |
| `arguments` | `dict[str, Any]` | Arguments passed to the tool |
| `result` | `str \| None` | Response returned by the tool |
| `error` | `str \| None` | Error message if the call failed |
| `source` | `ToolCallSource \| None` | Origin of the capture data |

The `extra="ignore"` configuration ensures forward compatibility with future versions that add new fields.

### ToolCallSource

The `ToolCallSource` enum indicates how tool call data was captured:

```python
class ToolCallSource(str, Enum):
    AGENT_RESPONSE = "agent_response"
    TRACING_PROCESSOR = "tracing_processor"
    A2A_PROTOCOL = "a2a_protocol"
```

### AgentResponse

For explicit capture, agents return structured responses:

```python
class AgentResponse(BaseModel):
    model_config = ConfigDict(extra="ignore")
    
    content: str
    tool_calls: list[ToolCall] = Field(default_factory=list)
```

## Capture Methods

### Explicit Capture (AgentResponse)

The default approach requires agents to explicitly return tool calls alongside their text response. This is the standard pattern for custom agents implementing `BaseAgent`.

```mermaid
graph TD
    A[Simulator invokes agent] --> B[Agent executes user_query]
    B --> C[Agent returns AgentResponse]
    C --> D[Simulator extracts tool_calls]
    D --> E[Evaluator scores trajectory]
```

**Implementation pattern:**

```python
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse, ToolCall

class MyAgent(BaseAgent):
    async def execute(self, user_query: str, **kwargs) -> str | AgentResponse:
        # Agent logic here...
        tool_calls = [
            ToolCall(
                id="call_1",
                name="get_order_status",
                arguments={"order_id": "ORD-1001"},
                source="agent_response"
            )
        ]
        return AgentResponse(
            content="The order status is shipped.",
            tool_calls=tool_calls
        )
```

资料来源：[examples/customer-service/custom_agent.py](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_agent.py)

### Tracing Processor (Automatic Capture)

The `ArksimTracingProcessor` uses the OpenAI Agents SDK's tracing interface to capture tool calls automatically, without requiring explicit return data.

```mermaid
graph TD
    A[Simulator sets routing context] --> B[Agent executes normally]
    B --> C[SDK fires on_span_end]
    C --> D[ArksimTracingProcessor captures]
    D --> E[Evaluator scores trajectory]
```

**Registration pattern:**

```python
from agents import Agent as SDKAgent
from arksim.tracing.openai import ArksimTracingProcessor

# Register once at module load
_tracing_processor = ArksimTracingProcessor()
```

The traced agent returns a plain `str` rather than `AgentResponse`:

```python
async def execute(self, user_query: str, **kwargs) -> str:
    result = await Runner.run(self._sdk_agent, user_query)
    return result.final_output
```

资料来源：[examples/customer-service/traced_agent.py](https://github.com/arklexai/arksim/blob/main/examples/customer-service/traced_agent.py)

### A2A Protocol Capture

For agents implementing the Agent-to-Agent (A2A) protocol, tool calls are embedded in task artifacts using the `A2AToolCaptureExtension`.

```mermaid
graph TD
    A[Simulator sends task to A2A Agent] --> B[Agent processes request]
    B --> C[Agent returns Task with artifacts]
    C --> D[Simulator extracts from metadata]
    D --> E[Evaluator scores]
```

**Tool call extraction from artifacts:**

```python
def _extract_tool_calls_from_artifact(self, artifact: Artifact) -> list[ToolCall]:
    metadata = artifact.metadata
    raw_calls = metadata.get("tool_calls", [])
    tool_calls = []
    for raw in raw_calls:
        arguments = raw.get("arguments", {})
        if not isinstance(arguments, dict):
            continue
        tool_calls.append(
            ToolCall(
                id=raw.get("id", ""),
                name=name,
                arguments=arguments,
                result=A2AAgent._coerce_to_string(raw.get("result")),
                error=A2AAgent._coerce_to_string(raw.get("error")),
                source=ToolCallSource.A2A_PROTOCOL,
            )
        )
    return tool_calls
```

资料来源：[arksim/simulation_engine/agent/clients/a2a.py](https://github.com/arklexai/arksim/blob/main/arksim/simulation_engine/agent/clients/a2a.py)

**A2A agent card declaration:**

```python
from arksim.simulation_engine.tool_types import A2AToolCaptureExtension

_capabilities = AgentCapabilities(
    streaming=False,
    extensions=[A2AToolCaptureExtension],
)
```

## Workflow Diagram

The following diagram shows the complete simulation pipeline with tool call capture:

```mermaid
flowchart LR
    subgraph Simulation
        A[User Query] --> B[Simulator]
        B --> C{Agent Type}
        C -->|Custom| D[explicit tool_calls]
        C -->|Traced| E[TracingProcessor]
        C -->|A2A| F[Artifact metadata]
        D --> G[Captured Tool Calls]
        E --> G
        F --> G
    end
    
    subgraph Evaluation
        G --> H[Evaluator]
        H --> I[Tool Call Metrics]
        H --> J[Error Detection]
        H --> K[Trajectory Analysis]
    end
    
    G --> L[Results/Report]
    I --> L
    J --> L
    K --> L
```

## Comparison of Capture Methods

| Aspect | Explicit (AgentResponse) | Traced (TracingProcessor) | A2A Protocol |
|--------|--------------------------|---------------------------|--------------|
| Return type | `AgentResponse` | `str` | Via protocol |
| Implementation complexity | Medium | Low | High |
| Agent code changes | Required | Minimal | Protocol required |
| Best for | Custom Python agents | SDK-based agents | A2A-native agents |
| Tool call source field | `agent_response` | `tracing_processor` | `a2a_protocol` |

资料来源：[examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

## Configuration

### Trace Receiver Settings

For traced agents, enable the trace receiver in the simulation config:

```yaml
simulation:
  max_turns: 10
  
trace_receiver:
  enabled: true
  wait_timeout: 5  # seconds to wait for traces
```

### When Tracing is Disabled

When `trace_receiver.enabled` is `false` or omitted, ArkSim falls back to explicit `AgentResponse` capture:

> When `trace_receiver.enabled` is false or omitted, arksim only captures tool calls from `AgentResponse` (the standard path).

## Integration with Evaluation

Captured tool calls flow into the evaluator's scoring pipeline:

1. **Trajectory matching** - Compare actual tool sequence against expected
2. **Argument validation** - Verify tool arguments match scenario requirements
3. **Error detection** - Identify tool call failures and their handling
4. **Coverage analysis** - Determine if all required tools were invoked

The evaluator uses the `source` field to differentiate between capture methods when analyzing behavioral patterns.

## Forward Compatibility

Both `ToolCall` and `AgentResponse` declare `extra="ignore"` in their Pydantic configuration:

> Declares `extra="ignore"` explicitly so snapshots from future arksim versions that add new fields can still be loaded by older arksim without raising a `ValidationError`.

This ensures that simulation results captured with newer versions remain loadable by older versions of the evaluator.

---

<a id='scenario-management'></a>

## Scenario Management

### 相关页面

相关主题：[Simulation Engine](#simulation-engine), [Configuration System](#configuration)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [arksim/scenario/entities.py](https://github.com/arklexai/arksim/blob/main/arksim/scenario/entities.py)
- [arksim/templates/scenarios.json](https://github.com/arklexai/arksim/blob/main/arksim/templates/scenarios.json)
- [examples/e-commerce/scenarios.json](https://github.com/arklexai/arksim/blob/main/examples/e-commerce/scenarios.json)
- [examples/bank-insurance/scenarios.json](https://github.com/arklexai/arksim/blob/main/examples/bank-insurance/scenarios.json)
- [examples/customer-service/scenarios.json](https://github.com/arklexai/arksim/blob/main/examples/customer-service/scenarios.json)
</details>

# Scenario Management

## Overview

Scenario Management is the core system in ArkSim for defining, loading, validating, and executing test scenarios that simulate user interactions with an AI agent. A scenario represents a structured test case that defines a user's goal, knowledge base, behavioral characteristics, and expected outcomes.

Scenarios serve as the foundation for the simulation and evaluation pipeline, enabling reproducible testing of agent behavior across diverse conversation patterns and user profiles.

## Scenario Data Model

### Core Scenario Structure

Each scenario in ArkSim is a JSON object containing the following primary fields:

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `scenario_id` | string | Yes | Unique identifier for the scenario |
| `name` | string | Yes | Human-readable scenario name |
| `description` | string | No | Detailed description of the scenario's purpose |
| `user_profile` | object | Yes | User persona characteristics |
| `goal` | object | Yes | The user's primary objective |
| `knowledge` | array | Yes | Contextual knowledge the user has access to |
| `expected_behavior` | object | No | Expected agent responses or behaviors |
| `metrics` | array | No | Custom evaluation criteria |

### User Profile Schema

```json
{
  "name": "string",
  "age": "number",
  "personality": "string",
  "background": "string",
  "communication_style": "string"
}
```

资料来源：[arksim/scenario/entities.py]()

### Goal Structure

```json
{
  "primary": "string (main user objective)",
  "secondary": ["array of secondary objectives"],
  "constraints": ["array of constraints or boundaries"],
  "success_criteria": "string"
}
```

## Scenario Management Workflow

```mermaid
graph TD
    A[Create/Load Scenarios] --> B[Validate Scenario Schema]
    B --> C{Valid?}
    C -->|Yes| D[Save to scenarios.json]
    C -->|No| E[Show Validation Errors]
    E --> A
    D --> F[Configure Simulation Parameters]
    F --> G[Run Simulation]
    G --> H[Generate Results]
    H --> I[Evaluation & Reporting]
```

## Scenario Loading and Validation

The UI provides interactive scenario management through the "Build" page:

1. **Load Existing** - Users can load pre-existing scenario files via file path input
2. **Auto-generate (PRO)** - Automatic scenario generation from agent knowledge base
3. **Manual Creation** - Build scenarios through the UI interface

```javascript
// Scenario file validation in UI
@input="validateScenarioFile()"
@blur="validateScenarioFile()"
@keydown.enter="loadScenarioFile()"
```

资料来源：[arksim/ui/frontend/index.html]()

### Validation Rules

- Scenario files must be valid JSON
- Required fields must be present
- `scenario_id` must be unique within the file
- `user_profile` must contain at minimum a `name` field

## Scenario File Format

ArkSim uses JSON format for scenario definitions. See the example structure:

```json
{
  "scenarios": [
    {
      "scenario_id": "ecommerce-return-item-001",
      "name": "Return Defective Product",
      "description": "Customer wants to return a damaged item received last week",
      "user_profile": {
        "name": "John Smith",
        "age": 35,
        "personality": "patient but firm",
        "background": "Regular online shopper",
        "communication_style": "polite and direct"
      },
      "goal": {
        "primary": "Get a full refund for a damaged product",
        "secondary": ["Understand return process", "Know timeline for refund"],
        "constraints": ["Only willing to wait up to 14 days for refund"],
        "success_criteria": "Full refund issued or replacement offered"
      },
      "knowledge": [
        "Ordered product SKU-12345 on March 1, 2024",
        "Item arrived damaged with visible scratches",
        "Has original packaging and receipt",
        "Order number: ORD-987654"
      ],
      "expected_behavior": {
        "should_mention_order_number": true,
        "should_request_photo_evidence": false
      }
    }
  ]
}
```

资料来源：[examples/e-commerce/scenarios.json]()
资料来源：[examples/bank-insurance/scenarios.json]()
资料来源：[examples/customer-service/scenarios.json]()

## Simulation Configuration

Scenarios are executed through the simulation engine with configurable parameters:

| Parameter | Default | Description |
|-----------|---------|-------------|
| `num_conversations_per_scenario` | 5 | Number of conversations to simulate per scenario |
| `max_turns` | 5 | Maximum turns per conversation |
| `num_workers` | 50 | Parallel workers (or 'auto') |
| `model` | gpt-4o | LLM model for simulation |
| `provider` | openai | LLM provider |

```python
model: str = Field(default=DEFAULT_MODEL, description="LLM model for simulation")
num_conversations_per_scenario: int = Field(
    default=5, 
    description="Number of conversations per scenario to simulate"
)
max_turns: int = Field(default=5, description="Maximum turns per conversation")
```

资料来源：[arksim/simulation_engine/entities.py:18-25]()

### Configuration via config.yaml

```yaml
simulation:
  scenarios_file: ./scenarios.json
  num_conversations_per_scenario: 5
  max_turns: 5
  num_workers: auto
  model: gpt-4o
```

## Built-in Scenario Templates

ArkSim provides template scenarios for common use cases:

```json
{
  "scenarios": [
    {
      "scenario_id": "template-basic-query",
      "name": "Basic Information Query",
      "description": "User asks a simple informational question",
      "user_profile": {...},
      "goal": {...},
      "knowledge": [...]
    }
  ]
}
```

资料来源：[arksim/templates/scenarios.json]()

## Integration with CI/CD

Scenarios can be integrated into automated testing workflows:

```yaml
# GitHub Actions workflow
steps:
  - name: Run Scenario Tests
    run: arksim simulate-evaluate config.yaml
```

资料来源：[examples/ci/README.md]()

### Required CI Setup

1. Update `TODO` sections in `arksim.yml` (startup command and health-check URL)
2. Create `tests/arksim/config.yaml` pointing to your server endpoint
3. Create `tests/arksim/scenarios.json` with your test cases
4. Add custom metrics to `tests/arksim/custom_metrics/` if needed
5. Configure `OPENAI_API_KEY` in GitHub secrets

## Best Practices

### Scenario Design

- **Distinct Goals**: Each scenario should test a single, well-defined user goal
- **Realistic Profiles**: User profiles should reflect actual customer demographics
- **Sufficient Knowledge**: Include enough context for the simulated user to maintain coherent conversation

### File Organization

```
project/
├── config.yaml
├── scenarios.json
├── custom_metrics/
│   └── my_metric.py
└── results/
    └── simulation/
```

### Version Control

- Commit `scenarios.json` with your agent code
- Use descriptive `scenario_id` values with prefixes (e.g., `ecommerce-`, `support-`)
- Document success criteria clearly in the `goal.success_criteria` field

## Command Line Usage

### Initialize with default scenarios

```bash
arksim init
```

### Run simulation with scenarios

```bash
arksim simulate-evaluate config.yaml
```

Results are written to `./results/simulation/simulation.json`.

资料来源：[README.md]()
资料来源：[examples/integrations/dify/README.md]()

## See Also

- [Simulation Engine Documentation](./simulation-engine.md)
- [Custom Agent Integration](./custom-agents.md)
- [Evaluation Metrics](./evaluation-metrics.md)
- [HTML Report Template](./html-report.md)

---

<a id='configuration'></a>

## Configuration System

### 相关页面

相关主题：[Agent Types and Integration](#agent-types), [LLM Provider Integration](#llm-providers), [Evaluation System](#evaluation-system)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [arksim/config/types.py](https://github.com/arklexai/arksim/blob/main/arksim/config/types.py)
- [arksim/config/core/agent.py](https://github.com/arklexai/arksim/blob/main/arksim/config/core/agent.py)
- [arksim/config/utils.py](https://github.com/arklexai/arksim/blob/main/arksim/config/utils.py)
- [arksim/config.yaml](https://github.com/arklexai/arksim/blob/main/arksim/config.yaml)
- [arksim/config_simulate.yaml](https://github.com/arklexai/arksim/blob/main/arksim/config_simulate.yaml)
- [arksim/config_evaluate.yaml](https://github.com/arklexai/arksim/blob/main/arksim/config_evaluate.yaml)
- [arksim/cli.py](https://github.com/arklexai/arksim/blob/main/arksim/cli.py)
- [examples/customer-service/custom_metrics.py](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_metrics.py)
</details>

# Configuration System

ArkSim provides a flexible, multi-layered configuration system that supports both YAML-based declarative configuration and programmatic Python class configuration. The system enables users to define simulation scenarios, evaluation metrics, agent connections, and runtime behavior through structured configuration files or direct Python implementations.

## Architecture Overview

The configuration system is organized into several key modules that handle different aspects of simulation and evaluation:

```mermaid
graph TD
    A[User Configuration] --> B[YAML Files]
    A --> C[Python Classes]
    B --> D[config.yaml]
    B --> E[config_simulate.yaml]
    B --> F[config_evaluate.yaml]
    C --> G[BaseAgent Subclass]
    C --> H[Custom Metrics]
    D --> I[Config Loader]
    E --> I
    F --> I
    G --> J[Simulation Engine]
    H --> K[Evaluator]
    I --> J
    J --> K
    J --> L[Results/Reports]
    K --> L
```

The system separates configuration into three distinct phases: simulation, evaluation, and combined execution. This separation allows users to run simulations independently from evaluation, which is useful for debugging and iterative development workflows.

## Configuration Files

ArkSim uses YAML configuration files to define simulation parameters, evaluation settings, and agent connections. The repository includes three primary configuration templates at the root level.

### Main Configuration (config.yaml)

The main configuration file combines both simulation and evaluation settings into a single file. This is the recommended approach for simple use cases where both phases run together using the `simulate-evaluate` command. The file structure typically includes agent configuration, scenario definitions, metric selections, and evaluation parameters.

### Separate Phase Configuration

For more complex workflows, ArkSim supports splitting configuration into separate files:

| File | Purpose | CLI Command |
|------|---------|-------------|
| `config_simulate.yaml` | Simulation-only parameters | `arksim simulate config_simulate.yaml` |
| `config_evaluate.yaml` | Evaluation-only parameters | `arksim evaluate config_evaluate.yaml` |
| `config.yaml` | Combined configuration | `arksim simulate-evaluate config.yaml` |

资料来源：[examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

This separation enables scenarios where simulation results are cached and evaluated multiple times with different metric configurations, or where the simulation phase runs in a different environment than evaluation.

## Agent Configuration

The configuration system supports multiple agent connection types, defined through the `agent_type` field in the agent configuration section.

### Supported Agent Types

| Agent Type | Description | Configuration Style |
|------------|-------------|---------------------|
| `custom` | Custom Python agent class inheriting from BaseAgent | Python class file |
| `chat_completions` | HTTP endpoint implementing Chat Completions API | YAML endpoint config |
| `a2a` | Agent-to-Agent protocol endpoint | YAML endpoint config |

资料来源：[arksim/cli.py](https://github.com/arklexai/arksim/blob/main/arksim/cli.py)

### Custom Agent (Python Class)

The default agent type uses a Python class that extends `BaseAgent`. This approach provides maximum flexibility for integrating any agent framework. The agent class must implement two required methods:

```python
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse

class MyAgent(BaseAgent):
    async def get_chat_id(self) -> str:
        return "unique-id"

    async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
        # Replace with your agent logic
        return "agent response"
```

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

The `execute` method can return either a plain string response or an `AgentResponse` object that includes both text content and tool calls for evaluation.

### Chat Completions Agent

For agents exposing an HTTP endpoint compatible with the OpenAI Chat Completions format, configuration is declarative:

```yaml
agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8000/v1/chat/completions
```

资料来源：[README.md](https://github.com/arklexai/arksim/blob/main/README.md)

### A2A Protocol Agent

Agents implementing the Agent-to-Agent protocol are configured similarly:

```yaml
agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent
```

A2A agents can also surface tool calls for evaluation through the protocol, enabling comprehensive testing of tool-using agents.

## Scenario Configuration

Scenarios define the test cases that the simulator executes against the agent. The `scenarios.json` file contains an array of scenario objects, each representing a simulated conversation or interaction.

### Scenario Structure

Each scenario typically includes:

- **Scenario ID**: Unique identifier for the scenario
- **User query**: The initial prompt or question presented to the agent
- **Expected behavior**: Criteria for successful completion
- **Knowledge/context**: Information the agent should know or have access to during the scenario

### Scenario Files Location

Scenarios are referenced from the main configuration file and can be shared across different configuration files:

| Example Directory | Scenario Purpose |
|------------------|------------------|
| `examples/bank-insurance/` | Financial services agent testing |
| `examples/customer-service/` | Customer support scenarios |
| `examples/integrations/*/` | Framework-specific testing |
| `examples/ci/` | CI/CD integration testing |

资料来源：[examples/ci/README.md](https://github.com/arklexai/arksim/blob/main/examples/ci/README.md)

## Evaluation Metrics Configuration

The evaluator component scores agent responses against defined metrics. Configuration specifies which metrics to run and how they are calculated.

### Built-in Metrics

ArkSim includes standard evaluation metrics covering common agent quality dimensions. These metrics are automatically available without additional configuration.

### Custom Metrics

Users can define custom evaluation metrics by creating a Python module that implements the metric interfaces. The custom metrics file must define metrics following the provided schema:

```python
from pydantic import BaseModel
from arksim.evaluator import (
    QualitativeMetric,
    QualResult,
    QuantitativeMetric,
    QuantResult,
    ScoreInput,
    format_chat_history,
)
```

资料来源：[examples/customer-service/custom_metrics.py](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_metrics.py)

Custom metrics are referenced in the configuration file using the `custom_metrics_file_paths` parameter, and optionally listed in `metrics_to_run` to include them in the evaluation:

```yaml
custom_metrics_file_paths:
  - tests/arksim/custom_metrics/my_metric.py
metrics_to_run:
  - custom_metric_name
```

资料来源：[examples/ci/README.md](https://github.com/arklexai/arksim/blob/main/examples/ci/README.md)

### Custom Metric Schema

Custom quantitative metrics should return a Pydantic model defining the score components:

```python
class VerificationComplianceSchema(BaseModel):
    identity_verification: float  # 0.0-1.0
    action_gating: float  # 0.0-1.0
    reason: str
```

资料来源：[examples/customer-service/custom_metrics.py](https://github.com/arklexai/arksim/blob/main/examples/customer-service/custom_metrics.py)

The system prompt for qualitative metrics defines the evaluation criteria that the LLM judge applies when scoring agent responses.

## CLI Configuration Interface

The ArkSim CLI provides commands for initializing configurations, running simulations, and managing examples.

### Initialization Command

The `init` command scaffolds starter files for agent testing:

```bash
arksim init --agent-type custom
```

| Flag | Options | Default | Description |
|------|---------|---------|-------------|
| `--agent-type` | `custom`, `chat_completions`, `a2a` | `custom` | Agent connection type |
| `--force` | boolean | `false` | Overwrite existing files |

资料来源：[arksim/cli.py](https://github.com/arklexai/arksim/blob/main/arksim/cli.py)

The `--agent-type` flag determines which template is generated:
- `custom` generates a Python agent file (no server needed)
- `chat_completions` generates YAML configuration for HTTP endpoints
- `a2a` generates YAML configuration for Agent-to-Agent protocol

### Simulation Commands

| Command | Description |
|---------|-------------|
| `arksim simulate-evaluate config.yaml` | Run simulation and evaluation in sequence |
| `arksim simulate config_simulate.yaml` | Run simulation only |
| `arksim evaluate config_evaluate.yaml` | Evaluate previously saved simulation results |

资料来源：[examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

### Examples Command

The CLI also provides access to example projects:

```bash
arksim examples                    # Download all examples
arksim examples bank-insurance     # Download specific example
arksim examples --list            # List available examples
```

## Integration Configuration Patterns

Each integration example follows a consistent pattern with three standard files.

### File Structure

| File | Purpose |
|------|---------|
| `custom_agent.py` | Agent implementation connecting to the target framework |
| `config.yaml` | Simulation and evaluation settings |
| `scenarios.json` | Test scenarios for the example domain |

### Supported Framework Integrations

ArkSim provides example configurations for the following agent frameworks:

| Framework | Installation | Example Directory |
|-----------|-------------|-------------------|
| LangGraph | `pip install langgraph langchain-openai` | `examples/integrations/langgraph/` |
| LangChain | `pip install langgraph langchain-openai` | `examples/integrations/langchain/` |
| Claude Agent SDK | `pip install claude-agent-sdk` | `examples/integrations/claude-agent-sdk/` |
| AutoGen | `pip install autogen-agentchat autogen-ext[openai]` | `examples/integrations/autogen/` |
| CrewAI | `pip install crewai` | `examples/integrations/crewai/` |
| Pydantic AI | `pip install pydantic-ai` | `examples/integrations/pydantic-ai/` |
| Smolagents | `pip install smolagents` | `examples/integrations/smolagents/` |
| Dify | HTTP integration | `examples/integrations/dify/` |
| OpenClaw | Gateway token auth | `examples/openclaw/` |

资料来源：[examples/integrations/*/README.md](https://github.com/arklexai/arksim/tree/main/examples/integrations)

## Trace Receiver Configuration

For frameworks that support tracing, ArkSim includes a trace receiver component that captures tool calls automatically without requiring explicit `AgentResponse` returns.

### Configuration Options

```yaml
trace_receiver:
  enabled: true
  wait_timeout: 5
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `enabled` | boolean | `false` | Enable trace-based tool call capture |
| `wait_timeout` | integer | `5` | Seconds to wait for trace data |

资料来源：[examples/customer-service/README.md](https://github.com/arklexai/arksim/blob/main/examples/customer-service/README.md)

When `enabled` is `false` or omitted, ArkSim only captures tool calls from `AgentResponse` objects returned by the agent's `execute` method.

## Workflow Summary

The following diagram illustrates the configuration-driven workflow from specification to results:

```mermaid
graph LR
    A[config.yaml] --> B[Scenario Loading]
    C[scenarios.json] --> B
    B --> D[Simulation Engine]
    D --> E{Agent Type?}
    E -->|custom| F[Python Agent]
    E -->|chat_completions| G[HTTP Endpoint]
    E -->|a2a| H[A2A Agent]
    F --> I[Execute Scenarios]
    G --> I
    H --> I
    I --> J[Simulation Results]
    J --> K[Evaluator]
    J --> L[Conversation Viewer]
    K --> M[Evaluation Report]
    M --> N[scores.json]
    M --> O[failures.json]
```

## Best Practices

**Environment Variables**: API keys should be set as environment variables rather than hardcoded in configuration files:

```bash
export OPENAI_API_KEY="<your-key>"
export ANTHROPIC_API_KEY="<your-key>"
```

**Separation of Concerns**: Use separate `config_simulate.yaml` and `config_evaluate.yaml` files when iterating on scenarios or metrics independently.

**Custom Metrics Organization**: Place custom metrics in dedicated directories (e.g., `tests/arksim/custom_metrics/`) and reference them from the main configuration.

**Scenario Versioning**: Keep scenarios in version-controlled JSON files that can be reviewed and updated as agent requirements evolve.

资料来源：[CONTRIBUTING.md](https://github.com/arklexai/arksim/blob/main/CONTRIBUTING.md)

---

---

## Doramagic 踩坑日志

项目：arklexai/arksim

摘要：发现 15 个潜在踩坑项，其中 0 个为 high/blocking；最高优先级：安装坑 - 来源证据：v0.1.0。

## 1. 安装坑 · 来源证据：v0.1.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：v0.1.0
- 对用户的影响：可能影响升级、迁移或版本选择。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_d55b6435222f4142bf979809f0cf0794 | https://github.com/arklexai/arksim/releases/tag/v0.1.0 | 来源类型 github_release 暴露的待验证使用条件。

## 2. 配置坑 · 来源证据：v0.0.6

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：v0.0.6
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_57b2e0dc0cf94da8b4fc97ea744ae7e9 | https://github.com/arklexai/arksim/releases/tag/v0.0.6 | 来源类型 github_release 暴露的待验证使用条件。

## 3. 配置坑 · 来源证据：v0.3.2

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：v0.3.2
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_3a2981d341cb4512a410e7d75e03baf0 | https://github.com/arklexai/arksim/releases/tag/v0.3.2 | 来源类型 github_release 暴露的待验证使用条件。

## 4. 配置坑 · 来源证据：v0.3.4

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：v0.3.4
- 对用户的影响：可能阻塞安装或首次运行。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_f949170a7f5d440cb0559c40fb441178 | https://github.com/arklexai/arksim/releases/tag/v0.3.4 | 来源类型 github_release 暴露的待验证使用条件。

## 5. 配置坑 · 来源证据：v0.3.5

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：v0.3.5
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_314def1c94614699b6c9ad728c36e9ab | https://github.com/arklexai/arksim/releases/tag/v0.3.5 | 来源类型 github_release 暴露的待验证使用条件。

## 6. 能力坑 · 来源证据：v0.3.1

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个能力理解相关的待验证问题：v0.3.1
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_c3e7552e9af84b95b2aec63c73dc7c26 | https://github.com/arklexai/arksim/releases/tag/v0.3.1 | 来源类型 github_release 暴露的待验证使用条件。

## 7. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 建议检查：将假设转成下游验证清单。
- 防护动作：假设必须转成验证项；没有验证结果前不能写成事实。
- 证据：capability.assumptions | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | README/documentation is current enough for a first validation pass.

## 8. 维护坑 · 来源证据：v0.3.3

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个维护/版本相关的待验证问题：v0.3.3
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_03b0d8ce6f8a4bc2bc8e33a7636aa07c | https://github.com/arklexai/arksim/releases/tag/v0.3.3 | 来源类型 github_release 暴露的待验证使用条件。

## 9. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 建议检查：补 GitHub 最近 commit、release、issue/PR 响应信号。
- 防护动作：维护活跃度未知时，推荐强度不能标为高信任。
- 证据：evidence.maintainer_signals | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | last_activity_observed missing

## 10. 安全/权限坑 · 下游验证发现风险项

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：下游已经要求复核，不能在页面中弱化。
- 建议检查：进入安全/权限治理复核队列。
- 防护动作：下游风险存在时必须保持 review/recommendation 降级。
- 证据：downstream_validation.risk_items | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | no_demo; severity=medium

## 11. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 建议检查：把风险写入边界卡，并确认是否需要人工复核。
- 防护动作：评分风险必须进入边界卡，不能只作为内部分数。
- 证据：risks.scoring_risks | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | no_demo; severity=medium

## 12. 安全/权限坑 · 来源证据：v0.2.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：v0.2.0
- 对用户的影响：可能影响升级、迁移或版本选择。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_3c70cc242a1242b98d50cfe5078d6752 | https://github.com/arklexai/arksim/releases/tag/v0.2.0 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 13. 安全/权限坑 · 来源证据：v0.3.0

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：v0.3.0
- 对用户的影响：可能影响升级、迁移或版本选择。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_69bd9f91a5064aeeb87f325f2f56f115 | https://github.com/arklexai/arksim/releases/tag/v0.3.0 | 来源类型 github_release 暴露的待验证使用条件。

## 14. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 建议检查：抽样最近 issue/PR，判断是否长期无人处理。
- 防护动作：issue/PR 响应未知时，必须提示维护风险。
- 证据：evidence.maintainer_signals | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | issue_or_pr_quality=unknown

## 15. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 建议检查：确认最近 release/tag 和 README 安装命令是否一致。
- 防护动作：发布节奏未知或过期时，安装说明必须标注可能漂移。
- 证据：evidence.maintainer_signals | art_e3064c2689144cfb89a534e0544c6bfc | https://github.com/arklexai/arksim#readme | release_recency=unknown

<!-- canonical_name: arklexai/arksim; human_manual_source: deepwiki_human_wiki -->
