# https://github.com/confident-ai/deepeval Project Manual

Generated at: 2026-06-12 04:15:44 UTC

## Table of Contents

- [DeepEval Overview and Core Architecture](#page-1)
- [Tracing, Observability and Framework Integrations](#page-2)
- [Evaluation Engine, Metrics and Synthetic Data](#page-3)
- [CLI, Tooling, Extensibility and TypeScript](#page-4)

<a id='page-1'></a>

## DeepEval Overview and Core Architecture

### Related Pages

Related topics: [Tracing, Observability and Framework Integrations](#page-2), [Evaluation Engine, Metrics and Synthetic Data](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/confident-ai/deepeval/blob/main/README.md)
- [deepeval/integrations/README.md](https://github.com/confident-ai/deepeval/blob/main/deepeval/integrations/README.md)
- [deepeval/models/llms/gateway_model.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/gateway_model.py)
- [deepeval/models/llms/constants.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/constants.py)
- [deepeval/cli/utils.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/cli/utils.py)
- [deepeval/cli/generate/command.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/cli/generate/command.py)
- [skills/README.md](https://github.com/confident-ai/deepeval/blob/main/skills/README.md)
- [typescript/README.md](https://github.com/confident-ai/deepeval/blob/main/typescript/README.md)
- [typescript/package.json](https://github.com/confident-ai/deepeval/blob/main/typescript/package.json)
- [typescript/src/models/base-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/base-model.ts)
- [typescript/src/models/openai-compatible-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/openai-compatible-model.ts)
- [typescript/src/models/providers/anthropic-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/providers/anthropic-model.ts)
- [typescript/src/models/providers/ai-sdk-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/providers/ai-sdk-model.ts)
- [typescript/src/tracing/tracing.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/tracing.ts)
- [typescript/src/prompt/index.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/prompt/index.ts)
</details>

# DeepEval Overview and Core Architecture

## Purpose and Scope

DeepEval is an LLM evaluation framework that combines a Python package, a CLI, a companion TypeScript SDK, and a managed cloud platform (Confident AI) into a single workflow for testing, tracing, and benchmarking LLM applications. As stated in the main [README.md](https://github.com/confident-ai/deepeval/blob/main/README.md), it ships with metrics for RAG, agents, conversational systems, and tool use, plus integrations for popular frameworks such as OpenAI, LangChain, LangGraph, Pydantic AI, CrewAI, Anthropic, AWS AgentCore, and LlamaIndex.

The framework targets three primary audiences:

1. **Application developers** who need unit-test-style evals inside `pytest` (`deepeval test run`).
2. **Agent builders** who need tracing, span-level metrics, and the `@observe` decorator (`vibe-coder` workflow described in the README).
3. **Platform users** who manage datasets, prompts, and reporting in Confident AI from either Python or TypeScript.

## High-Level Architecture

DeepEval is layered so that a stable set of metric and tracing abstractions sits above provider-specific gateways and a CLI/SDK surface. The diagram below shows how user code reaches the platform.

```mermaid
flowchart TB
    User[User Code / Agent] --> Py[deepeval Python package]
    User --> TS[TypeScript SDK]
    Py --> Metric[Metrics: RAG, Agentic, Tool, Multi-turn]
    Py --> Trace[Tracing: @observe, spans, OTEL]
    Py --> CLI[deepeval CLI: test run, generate, login]
    Py --> Model[LLM Gateway: GatewayModel]
    Model --> Provider[(Provider: OpenAI, Anthropic, Ollama, AI Gateway)]
    Trace --> CA[(Confident AI: datasets, traces, reports)]
    CLI --> CA
    TS --> CA
    Model --> CA
```

The metric layer, tracing layer, and model gateway are decoupled; a single test run can mix any metric with any model, and trace spans automatically pick up cost and token metadata from the model that produced them. Source: [deepeval/models/llms/gateway_model.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/gateway_model.py) and [README.md](https://github.com/confident-ai/deepeval/blob/main/README.md).

## Model Gateway and Provider Coverage

The Python package exposes a unified `GatewayModel` abstraction. Each subclass sets a `PROVIDER_SLUG` and a human-readable `PROVIDER_LABEL`, builds a per-instance retry decorator via `create_retry_decorator(self.PROVIDER_SLUG)`, and implements the inherited abstract `load_model` plus `_generate` / `_a_generate`. The public `generate` / `a_generate` wrap those implementations with the centralized retry decorator so retries are consistent across all gateways. Source: [deepeval/models/llms/gateway_model.py:1-50](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/gateway_model.py).

Cost and capability metadata for every supported preset is encoded in [deepeval/models/llms/constants.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/constants.py), where each entry declares `supports_log_probs`, `supports_multimodal`, `supports_structured_outputs`, `supports_json`, plus `input_price` and `output_price` used by the cost calculator. Newer presets such as `claude-opus-4-8` are added there with multimodal and structured-output support and updated pricing.

The TypeScript SDK mirrors this design. `DeepEvalBaseLLM` in [typescript/src/models/base-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/base-model.ts) defines the contract: an abstract `generate(prompt, schema)` returning `{ output, cost }`, plus `supportsMultimodal`, `supportsStructuredOutputs`, and `supportsLogProbs` capability flags. Concrete implementations include `DeepEvalOpenAICompatibleModel` (shared base for any OpenAI Chat Completions-compatible provider, see [typescript/src/models/openai-compatible-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/openai-compatible-model.ts)), `AnthropicModel` (see [typescript/src/models/providers/anthropic-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/providers/anthropic-model.ts)), and `AISDKModel` for Vercel AI SDK users (see [typescript/src/models/providers/ai-sdk-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/providers/ai-sdk-model.ts)).

## CLI, Test Runs, and Synthetic Data

The CLI is built on Typer and is the canonical entry point for `deepeval test run`, `deepeval login`, and `deepeval generate`. Shared utilities live in [deepeval/cli/utils.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/cli/utils.py), which defines a UTM-append helper `with_utm(url, *, medium, content)` and a `_CONFIDENT_UTM_HOSTS` allow-list of browser-clickable Confident AI properties. Programmatic hosts (`api.*`, `deepeval.*`, `otel.*`) are intentionally excluded so generated links land users on dashboards, not API endpoints. Source: [deepeval/cli/utils.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/cli/utils.py).

The `generate` subcommand unifies single-turn and multi-turn golden generation. It dispatches on `GenerationMethod` (`DOCUMENTS`, `CONTEXTS`, `SCRATCH`, `GOLDENS`) and `GoldenVariation` (`SINGLE_TURN` vs. conversational), calling `generate_goldens_from_docs`, `generate_goldens_from_contexts`, `generate_goldens_from_scratch`, or the corresponding `generate_conversational_goldens_*` variants. Source: [deepeval/cli/generate/command.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/cli/generate/command.py).

A long-standing community request to add a `--only-failed` flag for `deepeval test run` (useful in Git pre-push hooks, see issue [#1235](https://github.com/confident-ai/deepeval/issues/1235)) is still open; the current CLI surfaces all test results.

## Tracing, Integrations, and Skills

Tracing is provided through the `@observe` decorator and span types (`LlmSpan`, `ToolSpan`, `RetrieverSpan`, `AgentSpan`). The `update_current_trace(...)` / `update_current_span(...)` helpers work anywhere on the call stack and emit a single REST POST per trace, avoiding the UUID reconciliation overhead of raw OpenTelemetry exporters. The [deepeval/integrations/README.md](https://github.com/confident-ai/deepeval/blob/main/deepeval/integrations/README.md) describes a four-row integration matrix (`Bare`, `@observe`/`with trace(...)`, `evals_iterator`, `deepeval test run`) and explains that `ContextAwareSpanProcessor` flips OTel-mode integrations to REST routing automatically when `trace_manager.is_evaluating` is `True`. Some community-reported gaps remain: the `llm.token_count.*` attributes emitted via raw OpenInference OTLP spans do not always populate the `tokens` field in the Confident UI (issue [#2746](https://github.com/confident-ai/deepeval/issues/2746)), and the pydantic-ai integration with `OpenAIResponsesModel` returns `None` for `tools_called`, `expected_tools`, and `actual_output` under `ConfidentInstrumentationSettings` (issue [#2508](https://github.com/confident-ai/deepeval/issues/2508)). A related enhancement request to track cached input tokens in `LlmSpan` cost calculation is tracked in issue [#2741](https://github.com/confident-ai/deepeval/issues/2741).

The [skills/README.md](https://github.com/confident-ai/deepeval/blob/main/skills/README.md) ships agent "Skills" (a Claude.ai/Cursor concept) that teach coding agents how to add evals, generate datasets, enable tracing, and iterate on failures. The three published skills are `deepeval` (core eval workflow), `deepeval-otel` (raw OpenTelemetry export with no Python dependency), and `deepeval-tracing` (native `@observe`-based instrumentation).

## TypeScript SDK Surface

The TypeScript SDK lives under `typescript/` and is published as the `deepeval` package (see [typescript/package.json](https://github.com/confident-ai/deepeval/blob/main/typescript/package.json)). Subpath exports include `deepeval/dataset`, `deepeval/testCase`, `deepeval/tracing`, `deepeval/confident`, `deepeval/openai`, and `deepeval/integrations/ai-sdk`. The [typescript/README.md](https://github.com/confident-ai/deepeval/blob/main/typescript/README.md) explains that the initial release focuses on the Confident AI API surface — dataset push/pull, evaluation reporting, prompt CRUD — while local execution features (LLM-as-a-judge metrics, NLP models, fully local eval) remain Python-only. The stated milestone is 80% parity on the Confident AI surface by the end of July, including shared prompt templates consumed by both languages. The tracing layer is implemented in [typescript/src/tracing/tracing.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/tracing.ts), and prompt CRUD is in [typescript/src/prompt/index.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/prompt/index.ts).

## Known Limitations and Community Context

- **OTLP token mapping**: Raw OpenInference `llm.token_count.*` attributes do not surface as the `tokens` field in the Confident UI ([#2746](https://github.com/confident-ai/deepeval/issues/2746)).
- **Pydantic AI instrumentation**: `actual_output`, `tools_called`, and `expected_tools` are `None` for `OpenAIResponsesModel` ([#2508](https://github.com/confident-ai/deepeval/issues/2508)).
- **Cached input tokens**: `LlmSpan` exposes only `input_token_count`, `output_token_count`, `cost_per_input_token`, `cost_per_output_token`; cached-token reporting is not yet supported ([#2741](https://github.com/confident-ai/deepeval/issues/2741)).
- **Contextual Precision overlap**: Near-duplicate overlapping retrieval chunks are penalized as distinct low-quality retrievals, under-reporting retrieval quality for chunked-document RAG ([#2594](https://github.com/confident-ai/deepeval/issues/2594)).
- **CLI verbosity**: No `--only-failed` option for `deepeval test run` ([#1235](https://github.com/confident-ai/deepeval/issues/1235)).
- **Security reporting**: The repository has no `SECURITY.md` and the GitHub Private Vulnerability Reporting endpoint returns 404 ([#2744](https://github.com/confident-ai/deepeval/issues/2744)).
- **TypeScript scope**: Local metrics and NLP models remain Python-only while the TS SDK reaches parity ([#2734](https://github.com/confident-ai/deepeval/issues/2734)).

## See Also

- [Confident AI Platform Integration](https://www.confident-ai.com)
- [DeepEval Metrics Documentation](https://deepeval.com/docs/metrics-introduction)
- [Integrations Matrix](https://github.com/confident-ai/deepeval/blob/main/deepeval/integrations/README.md)
- [TypeScript SDK README](https://github.com/confident-ai/deepeval/blob/main/typescript/README.md)
- [Agent Skills](https://github.com/confident-ai/deepeval/blob/main/skills/README.md)

---

<a id='page-2'></a>

## Tracing, Observability and Framework Integrations

### Related Pages

Related topics: [DeepEval Overview and Core Architecture](#page-1), [Evaluation Engine, Metrics and Synthetic Data](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [deepeval/integrations/README.md](https://github.com/confident-ai/deepeval/blob/main/deepeval/integrations/README.md)
- [deepeval/models/llms/gateway_model.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/gateway_model.py)
- [deepeval/models/llms/constants.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/constants.py)
- [README.md](https://github.com/confident-ai/deepeval/blob/main/README.md)
- [typescript/src/tracing/tracing.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/tracing.ts)
- [typescript/src/tracing/index.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/index.ts)
- [typescript/src/integrations/openinference/processor.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/integrations/openinference/processor.ts)
- [typescript/src/openai/extractor.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/openai/extractor.ts)
- [typescript/src/openai/types.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/openai/types.ts)
- [typescript/src/models/base-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/base-model.ts)
- [typescript/src/models/openai-compatible-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/openai-compatible-model.ts)
- [typescript/src/index.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/index.ts)
- [typescript/README.md](https://github.com/confident-ai/deepeval/blob/main/typescript/README.md)
- [typescript/package.json](https://github.com/confident-ai/deepeval/blob/main/typescript/package.json)
</details>

# Tracing, Observability and Framework Integrations

## Overview and Scope

DeepEval provides a unified tracing and observability layer that records LLM-application execution as structured traces and ships them to the Confident AI platform. The system supports two transports — a native REST path (`api.confident-ai.com/v1/traces`) and an OpenTelemetry OTLP path (`otel.confident-ai.com/v1/traces`) — so teams can pick the route that fits their stack. Framework integrations such as LangChain, LangGraph, Pydantic AI, CrewAI, Anthropic, LlamaIndex, and AWS AgentCore all plug into this layer using one of four documented mechanisms ([deepeval/integrations/README.md](https://github.com/confident-ai/deepeval/blob/main/deepeval/integrations/README.md), [README.md](https://github.com/confident-ai/deepeval/blob/main/README.md)).

Each integration is graded against four capability axes. **Bare** means calling the framework directly (no enclosing decorator) still produces a trace. **`@observe` / `with trace(...)`** means wrapping a call merges framework spans into the native trace context, enabling `update_current_trace(...)` and `update_current_span(...)` anywhere in the call stack with a single REST POST per trace and no UUID reconciliation. **`evals_iterator`** works inside `dataset.evals_iterator(...)` both end-to-end and component-level. **`deepeval test run`** works under the pytest tracing-eval entry point. For OTel-mode integrations, `ContextAwareSpanProcessor` flips automatically to REST routing when `trace_manager.is_evaluating` is `True` ([deepeval/integrations/README.md](https://github.com/confident-ai/deepeval/blob/main/deepeval/integrations/README.md)).

## Tracing Architecture

The native trace model defines typed spans (`AGENT`, `LLM`, `TOOL`, `RETRIEVER`, `CHAIN`, `CUSTOM`) plus a trace root that aggregates them. Each span carries metadata, tags, timing, status (`SUCCESS` / `ERROR`), and optional I/O. The TypeScript SDK exposes this surface through `traceManager`, `getCurrentSpan`, `getCurrentTrace`, the `observe` decorator, and the span mutators `updateCurrentSpan`, `updateLlmSpan`, and `updateRetrieverSpan` ([typescript/src/tracing/index.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/index.ts)). The conversion layer in `convertSpanToApiSpan` normalizes `Date` objects to ISO strings, coerces missing outputs (defaulting to `""` or `[]` for retriever spans), and picks `endTime` from `startTime` when not provided or earlier than the start ([typescript/src/tracing/tracing.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/tracing.ts)).

For community-maintained OpenInference instrumentors, `deepeval/integrations/openinference/` registers a `TracerProvider` and an `OpenInferenceSpanInterceptor` that translates semantic-convention attributes — `openinference.span.kind`, `llm.input_messages.{idx}`, `llm.output_messages.{idx}`, `tool.name`, `llm.token_count.*` — into the internal `BaseSpan` / `LlmSpan` / `ToolSpan` representation and routes them through `ContextAwareSpanProcessor` ([deepeval/integrations/README.md](https://github.com/confident-ai/deepeval/blob/main/deepeval/integrations/README.md)). The TypeScript equivalent maintains an `OI_KIND_TO_SPAN_TYPE` mapping (`AGENT → SpanType.AGENT`, `CHAIN → SpanType.AGENT`, `LLM → SpanType.LLM`, `TOOL → SpanType.TOOL`, `RETRIEVER → SpanType.RETRIEVER`) and walks flattened `llm.input_messages.{i}.message.content` keys to reconstruct input/output text ([typescript/src/integrations/openinference/processor.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/integrations/openinference/processor.ts)).

## Framework Integrations and Model Wrappers

The TypeScript SDK ships first-class wrappers for popular providers. The OpenAI integration extracts token usage and tool calls from both the Chat Completions API and the Responses API. `extractOutputParametersFromCompletionAPI` parses `usage.prompt_tokens`, `usage.completion_tokens`, and `choices[0].message.tool_calls` into the internal `OutputParameters` shape, building a `toolsCalled` array from either `function` or `custom` tool calls and resolving human-readable descriptions via `inputParameters.toolDescriptions` ([typescript/src/openai/extractor.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/openai/extractor.ts)). The companion type definitions declare these fields explicitly: `output`, `promptTokens`, `completionTokens`, and `toolsCalled?` ([typescript/src/openai/types.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/openai/types.ts)).

For non-OpenAI providers, the TypeScript SDK defines `DeepEvalBaseLLM` as the abstract base class with a single contract: `generate(prompt, schema?)` returns `{ output, cost }`, where `cost` is the USD price computed from `costPerInputToken` and `costPerOutputToken` ([typescript/src/models/base-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/base-model.ts)). OpenAI-compatible providers extend `DeepEvalOpenAICompatibleModel`, a shared base that handles client construction, JSON-mode structured output via `toJsonSchema`, and token→cost computation ([typescript/src/models/openai-compatible-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/openai-compatible-model.ts)). The Python side implements the same pattern in `gateway_model.py`: provider subclasses expose `PROVIDER_SLUG` and `PROVIDER_LABEL`, implement `_generate` / `_a_generate`, and let a centralized `create_retry_decorator(PROVIDER_SLUG)` wrap the public `generate` / `a_generate` so retries are consistent across every gateway ([deepeval/models/llms/gateway_model.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/gateway_model.py)). Pricing metadata for each model preset lives in `constants.py` (`make_model_data(...)` declares `supports_log_probs`, `supports_multimodal`, `supports_structured_outputs`, `supports_json`, `input_price`, `output_price`) ([deepeval/models/llms/constants.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/constants.py)).

| Capability | Python | TypeScript (v0.1.28) |
|---|---|---|
| Push/pull datasets | Yes | Yes |
| Trace LLM apps | Yes | Yes |
| Run evaluations via Confident AI | Yes | Yes |
| Read/write prompts and versions | Yes | Yes |
| LLM-as-a-judge metrics | Yes | Not yet (roadmap) |
| Local NLP models | Yes | Not yet (roadmap) |

Source: [typescript/README.md](https://github.com/confident-ai/deepeval/blob/main/typescript/README.md), [typescript/package.json](https://github.com/confident-ai/deepeval/blob/main/typescript/package.json)

## Known Issues and Limitations

The community has surfaced several limitations in the tracing and integration layer:

- **Tokens blank over OTLP** — Issue [#2746](https://github.com/confident-ai/deepeval/issues/2746) reports that `llm.token_count.*` attributes sent via OTLP to `otel.confident-ai.com` are not surfaced as the `tokens` field in the trace UI. The OpenInference processor does translate the attribute, but the rendering path requires the corresponding Confident span attributes to be populated by the receiving endpoint ([typescript/src/integrations/openinference/processor.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/integrations/openinference/processor.ts)).
- **Pydantic AI + `OpenAIResponsesModel`** — Issue [#2508](https://github.com/confident-ai/deepeval/issues/2508) notes that `tools_called`, `expected_tools`, and `actual_output` are all `None` under `ConfidentInstrumentationSettings`. The Chat Completions path parses `tool_calls` correctly, but the Responses-API path (`extractOutputParametersFromResponseAPI`) is handled separately and may not surface tool data identically ([typescript/src/openai/extractor.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/openai/extractor.ts)).
- **Cached input tokens** — Issue [#2741](https://github.com/confident-ai/deepeval/issues/2741) requests a `cached_input_token_count` field (and matching price). The current span model only supports `input_token_count`, `output_token_count`, `cost_per_input_token`, and `cost_per_output_token`.
- **TypeScript parity** — Issue [#2734](https://github.com/confident-ai/deepeval/issues/2734) tracks the broader TypeScript roadmap. The current `typescript/` package is a Confident AI client wrapper that explicitly defers local execution features to Python until 80% feature parity is reached ([typescript/README.md](https://github.com/confident-ai/deepeval/blob/main/typescript/README.md)).

## See Also

- [Metrics Library](metrics.md) — LLM-as-a-judge, RAG, agentic, and multi-turn metrics
- [Datasets and Synthetic Data Generation](datasets.md) — Goldens, conversational goldens, and `evals_iterator`
- [Confident AI Platform Integration](confident-ai-platform.md) — Hosted datasets, traces, and reports
- [Model Providers and Gateways](model-providers.md) — Gateway model registry and pricing metadata

---

<a id='page-3'></a>

## Evaluation Engine, Metrics and Synthetic Data

### Related Pages

Related topics: [DeepEval Overview and Core Architecture](#page-1), [CLI, Tooling, Extensibility and TypeScript](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/confident-ai/deepeval/blob/main/README.md)
- [deepeval/models/llms/gateway_model.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/gateway_model.py)
- [deepeval/models/llms/constants.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/constants.py)
- [deepeval/cli/generate/command.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/cli/generate/command.py)
- [deepeval/cli/utils.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/cli/utils.py)
- [deepeval/integrations/README.md](https://github.com/confident-ai/deepeval/blob/main/deepeval/integrations/README.md)
- [typescript/src/models/base-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/base-model.ts)
- [typescript/src/tracing/tracing.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/tracing.ts)
- [typescript/src/tracing/api.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/api.ts)
- [typescript/src/prompt/index.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/prompt/index.ts)
- [typescript/README.md](https://github.com/confident-ai/deepeval/blob/main/typescript/README.md)
- [typescript/package.json](https://github.com/confident-ai/deepeval/blob/main/typescript/package.json)
</details>

# Evaluation Engine, Metrics and Synthetic Data

## Overview

DeepEval's evaluation stack is built around three cooperating layers: an **evaluation engine** that drives end-to-end, trace-based, and component-level test runs; a **metrics catalogue** covering RAG, multi-turn, agentic, and tool-use quality dimensions; and a **synthetic data** pipeline that bootstraps goldens from documents, contexts, scratch, or pre-existing datasets. The Python package is the canonical implementation; the TypeScript SDK ([typescript/README.md](https://github.com/confident-ai/deepeval/blob/main/typescript/README.md)) currently focuses on the Confident AI surface area (datasets, prompts, evaluation reporting) while LLM-as-a-judge metrics remain Python-only.

```mermaid
flowchart LR
    A[Goldens / Datasets] --> B[Evaluation Engine]
    C[Live Traces] --> B
    D[Metrics Catalogue] --> B
    B --> E[Console Report]
    B --> F[Confident API]
    G[CLI: deepeval generate] --> A
    H[Model Registry + Cost Data] --> D
```

## Evaluation Engine

### Contract for "judge" models

Every judge model that participates in metric scoring inherits from `DeepEvalBaseLLM`. The `GatewayModel` base class in [deepeval/models/llms/gateway_model.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/gateway_model.py) standardizes two responsibilities: a unified retry policy and a `(output, cost)` contract.

- Subclasses declare `PROVIDER_SLUG` and `PROVIDER_LABEL`; `__init__` builds a per-instance retry decorator via `create_retry_decorator(self.PROVIDER_SLUG)`.
- `_run` / `_arun` wrap provider implementations so retries are consistent across gateways.
- Subclasses implement `load_model` plus `_generate` / `_a_generate`, returning a `(output, cost)` tuple; public `generate` / `a_generate` are the wrapped, retry-aware entry points.

The TypeScript counterpart in [typescript/src/models/base-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/base-model.ts) exposes the same shape:

```ts
interface GenerationResult<T = string> { output: T; cost: number | null; }
abstract generate<T = string>(prompt: string, schema?: ZodType<T>): Promise<GenerationResult<T>>;
```

### Trace-based and end-to-end execution

Spans produced by the integrations layer are serialized through `convertSpanToApiSpan` in [typescript/src/tracing/tracing.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/tracing.ts), which normalizes `input` / `output`, converts `Date` fields to ISO strings, and groups spans by `SpanType` (`AGENT`, `LLM`, `RETRIEVER`, `TOOL`) into the `TraceApi` payload defined in [typescript/src/tracing/api.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/api.ts). The same payload carries per-span `metrics`, `metricCollection`, `llmTestCase`, `toolsCalled`, and `expectedTools` so that trace-level evaluation can be replayed server-side without re-running the application.

The integration matrix in [deepeval/integrations/README.md](https://github.com/confident-ai/deepeval/blob/main/deepeval/integrations/README.md) describes how four execution modes interact with the engine:

- **Bare** (no enclosing trace): each integration auto-creates a trace on first activity.
- **`@observe` / `with trace(...)`**: spans flow into the native trace context so `update_current_trace(...)` / `update_current_span(...)` work anywhere in the call stack.
- **`evals_iterator`**: runs inside `dataset.evals_iterator(...)` either end-to-end (`metrics=[...]`) or component-level (`@observe(metrics=[...])`).
- **`deepeval test run`**: pytest entry point with `@assert_test`, `@generate_trace_json`, and `@assert_*` decorators.

### Cost and token accounting

Pricing and capability data for individual judge models live in [deepeval/models/llms/constants.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/constants.py), keyed by model name. Each entry records `supports_log_probs`, `max_log_probs`, `supports_multimodal`, `supports_structured_outputs`, `supports_json`, `supports_temperature`, and `input_price` / `output_price` per token. The registry covers OpenAI families (`gpt-4o`, `gpt-4.1`, `gpt-5.x`, `o1`), Anthropic (`claude-3-opus-20240229` and newer), Ollama-tagged models (`phi3`, `llava`, `gemma2/3`, `qwen2.5`, `deepseek-r1`), and others. The judge layer in `gateway_model.py` uses this data to translate `(input_tokens, output_tokens, cost_per_input_token, cost_per_output_token)` into the `cost` field returned from `generate`.

## Metrics Catalogue

The metrics surface advertised in [README.md](https://github.com/confident-ai/deepeval/blob/main/README.md) groups into several families:

| Family | Representative metrics | Purpose |
|---|---|---|
| RAG | Answer Relevancy, Faithfulness, Contextual Recall, Contextual Precision, Contextual Relevancy, RAGAS | Quality of retrieval and grounded generation |
| Multi-Turn | Knowledge Retention, … | Conversational memory and consistency |
| Tool Use | Tool Correctness, Argument Correctness | Quality of agent tool calls and arguments |
| Agentic | Task Completion, plus trace-derived scores | End-to-end task success over a trace |

Community evidence (release notes for `v3.9.9`, `v4.0.2`, and `v4.0.5`) indicates that the agentic family in particular is now trace-driven, with `Task Completion` judging whether the agent *actually completes the intended task* rather than merely producing a plausible final answer.

A documented limitation: `ContextualPrecisionMetric` can over-penalize near-duplicate overlapping chunks in financial-document RAG pipelines (see [issue #2594](https://github.com/confident-ai/deepeval/issues/2594)). When chunking strategies rely on overlap to preserve context around table/section boundaries, retrieval quality may be under-reported.

## Synthetic Data Generation

The CLI command in [deepeval/cli/generate/command.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/cli/generate/command.py) is the entry point for bootstrapping goldens. It dispatches on two orthogonal axes:

- **Variation**: `GoldenVariation.SINGLE_TURN` vs. conversational.
- **Method**: `GenerationMethod.DOCS`, `GenerationMethod.CONTEXTS`, `GenerationMethod.SCRATCH`, or `GenerationMethod.GOLDENS`.

```python
if method == GenerationMethod.DOCS:
    synthesizer.generate_goldens_from_docs(
        document_paths=document_paths,
        include_expected_output=include_expected,
        max_goldens_per_context=max_goldens_per_context,
        context_construction_config=context_construction_config,
    )
```

For `DOCS`, callers tune `min_contexts_per_document`, `chunk_size`, `chunk_overlap`, `context_quality_threshold`, `context_similarity_threshold`, and `max_retries`. The conversational branch (`generate_conversational_goldens_from_docs` / `_from_contexts` / `_from_scratch`) reuses the same configuration but emits multi-turn goldens suitable for the multi-turn metric family. A long-standing CLI ergonomics request ([issue #1235](https://github.com/confident-ai/deepeval/issues/1235)) is to display only failed tests in the runner; until shipped, pre-push hooks need to post-process the full output.

## Known Limitations and Community Issues

- **Token / cost visibility**: `LlmSpan` only tracks `input_token_count`, `output_token_count`, `cost_per_input_token`, `cost_per_output_token`; cached input tokens are not yet representable ([issue #2741](https://github.com/confident-ai/deepeval/issues/2741)). The same gap is reported on the OpenTelemetry / OpenInference ingestion path, where the trace UI shows blank `tokens` despite valid `llm.token_count.*` attributes ([issue #2746](https://github.com/confident-ai/deepeval/issues/2746)).
- **pydantic-ai instrumentation**: When `ConfidentInstrumentationSettings` is used with pydantic-ai's `OpenAIResponsesModel`, `tools_called`, `expected_tools`, and `actual_output` come through as `None` ([issue #2508](https://github.com/confident-ai/deepeval/issues/2508)).
- **TypeScript surface area**: The `/typescript` package is currently a Confident AI client wrapper; LLM-as-a-judge metrics and local evaluation remain Python-only ([issue #2734](https://github.com/confident-ai/deepeval/issues/2734)). The Confident AI platform surface (datasets, prompts, evaluation reporting) is functional, with shared prompt templates planned ([typescript/README.md](https://github.com/confident-ai/deepeval/blob/main/typescript/README.md)).
- **Security reporting**: There is no `SECURITY.md` and GitHub's private vulnerability reporting endpoint returns 404 for the repo ([issue #2744](https://github.com/confident-ai/deepeval/issues/2744)). Reporters should follow the maintainers' guidance until a disclosure process is published.

## See Also

- Tracing and Integrations
- Model Registry and Cost Tracking
- Synthetic Data and Datasets
- Confident AI Platform API

---

<a id='page-4'></a>

## CLI, Tooling, Extensibility and TypeScript

### Related Pages

Related topics: [DeepEval Overview and Core Architecture](#page-1), [Evaluation Engine, Metrics and Synthetic Data](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/confident-ai/deepeval/blob/main/README.md)
- [typescript/package.json](https://github.com/confident-ai/deepeval/blob/main/typescript/package.json)
- [typescript/README.md](https://github.com/confident-ai/deepeval/blob/main/typescript/README.md)
- [typescript/src/confident/index.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/confident/index.ts)
- [typescript/src/prompt/index.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/prompt/index.ts)
- [typescript/src/prompt/types.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/prompt/types.ts)
- [typescript/src/openai/types.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/openai/types.ts)
- [typescript/src/openai/extractor.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/openai/extractor.ts)
- [typescript/src/models/base-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/base-model.ts)
- [typescript/src/models/openai-compatible-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/openai-compatible-model.ts)
- [typescript/src/models/providers/anthropic-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/providers/anthropic-model.ts)
- [typescript/src/models/providers/ollama-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/providers/ollama-model.ts)
- [typescript/src/dataset/golden.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/dataset/golden.ts)
- [typescript/src/tracing/tracing.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/tracing.ts)
- [deepeval/models/llms/gateway_model.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/gateway_model.py)
- [deepeval/models/llms/constants.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/constants.py)
</details>

# CLI, Tooling, Extensibility and TypeScript

This page documents the user-facing tooling layer of DeepEval: its Python CLI, the supporting TUI for trace inspection, the TypeScript SDK entry points, and the provider-pluggable model architecture that drives extensibility across both language ecosystems. The material is anchored in the source files listed above and the active community discussions that surfaced during the v4.0 release cycle.

## CLI and Operator Tooling

DeepEval is operated through a Python entry point that exposes a top-level command line, an HTTP inspection server, and a TUI for navigating traces captured from instrumented LLM applications. The integration catalog in [README.md](https://github.com/confident-ai/deepeval/blob/main/README.md) lists one-line integrations with LangChain, LangGraph, Pydantic AI, CrewAI, Anthropic, AWS AgentCore, and LlamaIndex, indicating that the CLI is the surface where these integrations are usually registered or invoked.

The TUI for trace inspection is described in the DeepEval 4.0 release notes as a day-one feature aimed at "rapid debugging" of agent traces. Community request [#1235](https://github.com/confident-ai/deepeval/issues/1235) further asks for a CLI option to display only failed tests, motivated by running DeepEval inside a Git pre-push hook. This is consistent with the design intent of the CLI being scriptable in CI contexts where terse, failure-only output is preferred.

## TypeScript SDK Overview

The TypeScript package is published from the [`typescript/`](https://github.com/confident-ai/deepeval/tree/main/typescript) directory and is configured for multiple sub-entry points to keep surface areas independent. Per [typescript/package.json](https://github.com/confident-ai/deepeval/blob/main/typescript/package.json), the package exposes:

| Sub-entry | Purpose |
|---|---|
| `deepeval/dataset` | Dataset and `Golden` objects |
| `deepeval/testCase` | Test case construction |
| `deepeval/tracing` | Trace collection and OTLP export |
| `deepeval/confident` | Confident AI platform API client |
| `deepeval/openai` | OpenAI wrapper utilities |
| `deepeval/integrations/*` | Framework integrations (AI SDK, LangChain, OpenAI Agents) |

The TypeScript SDK targets JavaScript and TypeScript teams building on the Confident AI platform. The [typescript/README.md](https://github.com/confident-ai/deepeval/blob/main/typescript/README.md) explicitly states: "Local execution features, such as LLM-as-a-judge metrics, NLP models, and fully local evaluation, currently remain in the Python package while we expand TypeScript support." The roadmap commits to 80% feature parity on the Confident AI integration surface by the end of July, including shared prompt templates consumed by both Python and TypeScript.

A typical TypeScript flow wires a client against the Confident API, pushes or pulls prompts, and reports evaluation runs. The module surface is documented by [typescript/src/confident/index.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/confident/index.ts), which re-exports the API helpers, the shared `types`, and the top-level `evaluate` function. The `evaluate` export is the user-facing entry point for reporting results back to the platform.

## Prompt Management in TypeScript

Prompt management is one of the more mature TypeScript surfaces. The `Prompt` class in [typescript/src/prompt/index.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/prompt/index.ts) provides an object-oriented wrapper around the Confident AI prompt endpoints, supporting `text` and `messages` payloads (mutually exclusive), `interpolationType`, `modelSettings`, `outputType`, `outputSchema`, and `tools`. Tool definitions are normalized via `ToolDataSchema` and `NormalizedToolDataSchema` defined in [typescript/src/prompt/types.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/prompt/types.ts), with `ToolMode` (auto / required / none) controlling tool selection semantics.

The push method explicitly refuses to send both `text` and `messages` simultaneously, throwing a `TypeError` to prevent ambiguous prompt payloads on the server. Pull options support `version`, `label`, `hash`, and `branch` selectors, mirroring the versioned commit graph used in the Python package. This is the foundation for the "shared prompt templates" feature mentioned in the TypeScript roadmap.

## Provider Model Architecture and Extensibility

Extensibility on the Python side is driven by a unified `DeepEvalBaseLLM` contract and a `gateway_model.py` retry-and-cost wrapper documented in [deepeval/models/llms/gateway_model.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/gateway_model.py). Subclasses set `PROVIDER_SLUG`, `PROVIDER_LABEL`, and per-token cost fields, then implement `load_model`, `_generate`, and `_a_generate`. The base class builds a per-instance retry decorator from the provider slug, so retries honor runtime configuration without re-instantiating.

Model capability metadata is centralized in [deepeval/models/llms/constants.py](https://github.com/confident-ai/deepeval/blob/main/deepeval/models/llms/constants.py), where each model preset declares `supports_log_probs`, `supports_multimodal`, `supports_structured_outputs`, `supports_json`, and pricing fields. This registry is the single source of truth referenced when new model IDs are added (e.g., the `claude-opus-4-8` preset in release v4.0.5).

The TypeScript equivalent is mirrored in [typescript/src/models/base-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/base-model.ts), which defines the `DeepEvalBaseLLM` abstract class with `generate`, `getModelName`, and capability accessors. Concrete providers extend this base. [typescript/src/models/openai-compatible-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/openai-compatible-model.ts) is the shared base for every OpenAI Chat-Completions-speaking provider; subclasses only resolve model name, base URL, env-var API keys, and default headers, then delegate generation to the base. [typescript/src/models/providers/anthropic-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/providers/anthropic-model.ts) and [typescript/src/models/providers/ollama-model.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/models/providers/ollama-model.ts) demonstrate the pattern: each is a thin class that defers to the base for token→cost computation and structured-output handling. Importantly, both providers leave `temperature` unset unless explicitly provided, because reasoning models reject the parameter.

```mermaid
flowchart LR
  A[DeepEvalBaseLLM] --> B[OpenAICompatibleModel]
  A --> C[AnthropicModel]
  A --> D[OllamaModel]
  B --> E[Gateway / Provider]
  C --> F[Anthropic API]
  D --> G[Local Ollama]
  E --> H[Cost + Retry]
  F --> H
  G --> H
  H --> I[GenerationResult]
```

OpenAI token and tool extraction lives in [typescript/src/openai/extractor.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/openai/extractor.ts), with input and output payload shapes declared in [typescript/src/openai/types.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/openai/types.ts). The extractor splits the Chat Completions and Responses APIs and walks `tool_calls` to produce a normalized `ToolCall[]` used by tracing and evaluation.

## Tracing, Datasets, and Community-Driven Gaps

The tracing module, surfaced via [`typescript/src/tracing/tracing.ts`](https://github.com/confident-ai/deepeval/blob/main/typescript/src/tracing/tracing.ts), converts in-memory `BaseSpan` objects into API-shaped `BaseApiSpan` objects, normalizes `Date` instances to ISO strings, and collapses undefined inputs/outputs to safe defaults (empty string for `LLM` spans, empty array for `RETRIEVER` spans). The `Golden` class in [typescript/src/dataset/golden.ts](https://github.com/confident-ai/deepeval/blob/main/typescript/src/dataset/golden.ts) is the dataset-side counterpart, supporting both `actualOutput` and `expectedOutput`, retrieval context, tool call tracking, and Confident AI bookkeeping fields (`_datasetRank`, `_datasetAlias`, `_datasetId`).

Several open issues highlight tooling gaps that this architecture must absorb:

- [#2746](https://github.com/confident-ai/deepeval/issues/2746): OTLP-exported LLM spans carry `llm.token_count.*` attributes but the Confident AI trace UI shows blank tokens, indicating a normalization gap in the OTEL ingestion path.
- [#2741](https://github.com/confident-ai/deepeval/issues/2741): `LlmSpan` cost tracking lacks fields for cached input tokens, even though providers return them.
- [#2508](https://github.com/confident-ai/deepeval/issues/2508): The Pydantic AI integration drops `actual_output`, `tools_called`, and `expected_tools` when paired with `OpenAIResponsesModel`, pointing to a coverage gap in the integration shim.

These gaps sit at the seam between framework integrations and the tracing/dataset modules documented above, and they map directly onto the v4.0 focus on "1-line integrations" and an agent-native evaluation workflow.

## See Also

- [Tracing and Observability](./Tracing-and-Observability.md)
- [Model Providers and Pricing](./Model-Providers-and-Pricing.md)
- [Confident AI Platform Integration](./Confident-AI-Platform-Integration.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: confident-ai/deepeval

Summary: Found 28 structured pitfall item(s), including 4 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/1235

## 2. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2508

## 3. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Developers should check this security_permissions risk before relying on the project: Security: request for a submitting security vulnerabilities.
- User impact: Developers may expose sensitive permissions or credentials: Security: request for a submitting security vulnerabilities.
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2744

## 4. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2594

## 5. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.host_targets | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 6. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: 🎉 New Interfaces, Reduce ETL Code < 50%!
- User impact: Upgrade or migration may change expected behavior: 🎉 New Interfaces, Reduce ETL Code < 50%!
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.7.2

## 7. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!
- User impact: Upgrade or migration may change expected behavior: 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v4.0.2

## 8. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 9. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: ConfidentInstrumentationSettings with pydantic-ai: tools_called, expected_tools, and actual_output are all None when using OpenAIResponsesModel
- User impact: Developers may hit a documented source-backed failure mode: ConfidentInstrumentationSettings with pydantic-ai: tools_called, expected_tools, and actual_output are all None when using OpenAIResponsesModel
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2508

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this migration risk before relying on the project: 🎉 New Decision Graph Logic for Granular Simulation Control
- User impact: Upgrade or migration may change expected behavior: 🎉 New Decision Graph Logic for Granular Simulation Control
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v4.0.3

## 11. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 12. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 13. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 14. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2741

## 15. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2746

## 16. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: CLI improvement: option to display only failed tests
- User impact: Developers may hit a documented source-backed failure mode: CLI improvement: option to display only failed tests
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/1235

## 17. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: Feature: support cached input tokens in LLM span cost tracking
- User impact: Developers may hit a documented source-backed failure mode: Feature: support cached input tokens in LLM span cost tracking
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2741

## 18. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this conceptual risk before relying on the project: DeepEval for Typescript
- User impact: Developers may hit a documented source-backed failure mode: DeepEval for Typescript
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2734

## 19. Runtime risk - Runtime risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: Contextual Precision over-penalizes overlapping chunks in financial-document RAG
- User impact: Developers may hit a documented source-backed failure mode: Contextual Precision over-penalizes overlapping chunks in financial-document RAG
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2594

## 20. Runtime risk - Runtime risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: LLM tokens not displayed when using custom OpenTelemetry / OpenInference OTLP export
- User impact: Developers may hit a documented source-backed failure mode: LLM tokens not displayed when using custom OpenTelemetry / OpenInference OTLP export
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2746

## 21. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 22. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 23. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: Opus 4.8: Day 0 Support
- User impact: Upgrade or migration may change expected behavior: Opus 4.8: Day 0 Support
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v4.0.5

## 24. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 🎉 Metrics for AI agents, multi-turn synthetic data generation, and more!
- User impact: Upgrade or migration may change expected behavior: 🎉 Metrics for AI agents, multi-turn synthetic data generation, and more!
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.9.9

## 25. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 🎉 New Arena GEval Metric, for Pairwise Comparisons
- User impact: Upgrade or migration may change expected behavior: 🎉 New Arena GEval Metric, for Pairwise Comparisons
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.1.9

## 26. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 🎉 New Conversational Evaluation, LiteLLM Integration
- User impact: Upgrade or migration may change expected behavior: 🎉 New Conversational Evaluation, LiteLLM Integration
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.0.8

## 27. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 🎉 New Multimodal Metrics, with Platform Support
- User impact: Upgrade or migration may change expected behavior: 🎉 New Multimodal Metrics, with Platform Support
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.1.5

## 28. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 🎉 Renewed datasets, single vs multi-turn
- User impact: Upgrade or migration may change expected behavior: 🎉 Renewed datasets, single vs multi-turn
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.2.6

<!-- canonical_name: confident-ai/deepeval; human_manual_source: deepwiki_human_wiki -->
