deepeval Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

deepeval

The LLM Evaluation Framework

DeepEval Overview and Core Architecture

Related topics: Tracing, Observability and Framework Integrations, Evaluation Engine, Metrics and Synthetic Data

Section Related Pages

Continue reading this section for the full explanation and source context.

DeepEval Overview and Core Architecture

Purpose and Scope

DeepEval is an LLM evaluation framework that combines a Python package, a CLI, a companion TypeScript SDK, and a managed cloud platform (Confident AI) into a single workflow for testing, tracing, and benchmarking LLM applications. As stated in the main README.md, it ships with metrics for RAG, agents, conversational systems, and tool use, plus integrations for popular frameworks such as OpenAI, LangChain, LangGraph, Pydantic AI, CrewAI, Anthropic, AWS AgentCore, and LlamaIndex.

The framework targets three primary audiences:

Application developers who need unit-test-style evals inside pytest (deepeval test run).
Agent builders who need tracing, span-level metrics, and the @observe decorator (vibe-coder workflow described in the README).
Platform users who manage datasets, prompts, and reporting in Confident AI from either Python or TypeScript.

High-Level Architecture

DeepEval is layered so that a stable set of metric and tracing abstractions sits above provider-specific gateways and a CLI/SDK surface. The diagram below shows how user code reaches the platform.

flowchart TB
    User[User Code / Agent] --> Py[deepeval Python package]
    User --> TS[TypeScript SDK]
    Py --> Metric[Metrics: RAG, Agentic, Tool, Multi-turn]
    Py --> Trace[Tracing: @observe, spans, OTEL]
    Py --> CLI[deepeval CLI: test run, generate, login]
    Py --> Model[LLM Gateway: GatewayModel]
    Model --> Provider[(Provider: OpenAI, Anthropic, Ollama, AI Gateway)]
    Trace --> CA[(Confident AI: datasets, traces, reports)]
    CLI --> CA
    TS --> CA
    Model --> CA

The metric layer, tracing layer, and model gateway are decoupled; a single test run can mix any metric with any model, and trace spans automatically pick up cost and token metadata from the model that produced them. Source: deepeval/models/llms/gateway_model.py and README.md.

Model Gateway and Provider Coverage

The Python package exposes a unified GatewayModel abstraction. Each subclass sets a PROVIDER_SLUG and a human-readable PROVIDER_LABEL, builds a per-instance retry decorator via create_retry_decorator(self.PROVIDER_SLUG), and implements the inherited abstract load_model plus _generate / _a_generate. The public generate / a_generate wrap those implementations with the centralized retry decorator so retries are consistent across all gateways. Source: deepeval/models/llms/gateway_model.py:1-50.

Cost and capability metadata for every supported preset is encoded in deepeval/models/llms/constants.py, where each entry declares supports_log_probs, supports_multimodal, supports_structured_outputs, supports_json, plus input_price and output_price used by the cost calculator. Newer presets such as claude-opus-4-8 are added there with multimodal and structured-output support and updated pricing.

The TypeScript SDK mirrors this design. DeepEvalBaseLLM in typescript/src/models/base-model.ts defines the contract: an abstract generate(prompt, schema) returning { output, cost }, plus supportsMultimodal, supportsStructuredOutputs, and supportsLogProbs capability flags. Concrete implementations include DeepEvalOpenAICompatibleModel (shared base for any OpenAI Chat Completions-compatible provider, see typescript/src/models/openai-compatible-model.ts), AnthropicModel (see typescript/src/models/providers/anthropic-model.ts), and AISDKModel for Vercel AI SDK users (see typescript/src/models/providers/ai-sdk-model.ts).

CLI, Test Runs, and Synthetic Data

The CLI is built on Typer and is the canonical entry point for deepeval test run, deepeval login, and deepeval generate. Shared utilities live in deepeval/cli/utils.py, which defines a UTM-append helper with_utm(url, *, medium, content) and a _CONFIDENT_UTM_HOSTS allow-list of browser-clickable Confident AI properties. Programmatic hosts (api.*, deepeval.*, otel.*) are intentionally excluded so generated links land users on dashboards, not API endpoints. Source: deepeval/cli/utils.py.

The generate subcommand unifies single-turn and multi-turn golden generation. It dispatches on GenerationMethod (DOCUMENTS, CONTEXTS, SCRATCH, GOLDENS) and GoldenVariation (SINGLE_TURN vs. conversational), calling generate_goldens_from_docs, generate_goldens_from_contexts, generate_goldens_from_scratch, or the corresponding generate_conversational_goldens_* variants. Source: deepeval/cli/generate/command.py.

A long-standing community request to add a --only-failed flag for deepeval test run (useful in Git pre-push hooks, see issue #1235) is still open; the current CLI surfaces all test results.

Tracing, Integrations, and Skills

Tracing is provided through the @observe decorator and span types (LlmSpan, ToolSpan, RetrieverSpan, AgentSpan). The update_current_trace(...) / update_current_span(...) helpers work anywhere on the call stack and emit a single REST POST per trace, avoiding the UUID reconciliation overhead of raw OpenTelemetry exporters. The deepeval/integrations/README.md describes a four-row integration matrix (Bare, @observe/with trace(...), evals_iterator, deepeval test run) and explains that ContextAwareSpanProcessor flips OTel-mode integrations to REST routing automatically when trace_manager.is_evaluating is True. Some community-reported gaps remain: the llm.token_count.* attributes emitted via raw OpenInference OTLP spans do not always populate the tokens field in the Confident UI (issue #2746), and the pydantic-ai integration with OpenAIResponsesModel returns None for tools_called, expected_tools, and actual_output under ConfidentInstrumentationSettings (issue #2508). A related enhancement request to track cached input tokens in LlmSpan cost calculation is tracked in issue #2741.

The skills/README.md ships agent "Skills" (a Claude.ai/Cursor concept) that teach coding agents how to add evals, generate datasets, enable tracing, and iterate on failures. The three published skills are deepeval (core eval workflow), deepeval-otel (raw OpenTelemetry export with no Python dependency), and deepeval-tracing (native @observe-based instrumentation).

TypeScript SDK Surface

The TypeScript SDK lives under typescript/ and is published as the deepeval package (see typescript/package.json). Subpath exports include deepeval/dataset, deepeval/testCase, deepeval/tracing, deepeval/confident, deepeval/openai, and deepeval/integrations/ai-sdk. The typescript/README.md explains that the initial release focuses on the Confident AI API surface — dataset push/pull, evaluation reporting, prompt CRUD — while local execution features (LLM-as-a-judge metrics, NLP models, fully local eval) remain Python-only. The stated milestone is 80% parity on the Confident AI surface by the end of July, including shared prompt templates consumed by both languages. The tracing layer is implemented in typescript/src/tracing/tracing.ts, and prompt CRUD is in typescript/src/prompt/index.ts.

Known Limitations and Community Context

OTLP token mapping: Raw OpenInference llm.token_count.* attributes do not surface as the tokens field in the Confident UI (#2746).
Pydantic AI instrumentation: actual_output, tools_called, and expected_tools are None for OpenAIResponsesModel (#2508).
Cached input tokens: LlmSpan exposes only input_token_count, output_token_count, cost_per_input_token, cost_per_output_token; cached-token reporting is not yet supported (#2741).
Contextual Precision overlap: Near-duplicate overlapping retrieval chunks are penalized as distinct low-quality retrievals, under-reporting retrieval quality for chunked-document RAG (#2594).
CLI verbosity: No --only-failed option for deepeval test run (#1235).
Security reporting: The repository has no SECURITY.md and the GitHub Private Vulnerability Reporting endpoint returns 404 (#2744).
TypeScript scope: Local metrics and NLP models remain Python-only while the TS SDK reaches parity (#2734).

Tracing, Observability and Framework Integrations

Related topics: DeepEval Overview and Core Architecture, Evaluation Engine, Metrics and Synthetic Data

Section Related Pages

Continue reading this section for the full explanation and source context.

Tracing, Observability and Framework Integrations

Overview and Scope

DeepEval provides a unified tracing and observability layer that records LLM-application execution as structured traces and ships them to the Confident AI platform. The system supports two transports — a native REST path (api.confident-ai.com/v1/traces) and an OpenTelemetry OTLP path (otel.confident-ai.com/v1/traces) — so teams can pick the route that fits their stack. Framework integrations such as LangChain, LangGraph, Pydantic AI, CrewAI, Anthropic, LlamaIndex, and AWS AgentCore all plug into this layer using one of four documented mechanisms (deepeval/integrations/README.md, README.md).

Each integration is graded against four capability axes. Bare means calling the framework directly (no enclosing decorator) still produces a trace. @observe / with trace(...) means wrapping a call merges framework spans into the native trace context, enabling update_current_trace(...) and update_current_span(...) anywhere in the call stack with a single REST POST per trace and no UUID reconciliation. evals_iterator works inside dataset.evals_iterator(...) both end-to-end and component-level. deepeval test run works under the pytest tracing-eval entry point. For OTel-mode integrations, ContextAwareSpanProcessor flips automatically to REST routing when trace_manager.is_evaluating is True (deepeval/integrations/README.md).

Tracing Architecture

The native trace model defines typed spans (AGENT, LLM, TOOL, RETRIEVER, CHAIN, CUSTOM) plus a trace root that aggregates them. Each span carries metadata, tags, timing, status (SUCCESS / ERROR), and optional I/O. The TypeScript SDK exposes this surface through traceManager, getCurrentSpan, getCurrentTrace, the observe decorator, and the span mutators updateCurrentSpan, updateLlmSpan, and updateRetrieverSpan (typescript/src/tracing/index.ts). The conversion layer in convertSpanToApiSpan normalizes Date objects to ISO strings, coerces missing outputs (defaulting to "" or [] for retriever spans), and picks endTime from startTime when not provided or earlier than the start (typescript/src/tracing/tracing.ts).

For community-maintained OpenInference instrumentors, deepeval/integrations/openinference/ registers a TracerProvider and an OpenInferenceSpanInterceptor that translates semantic-convention attributes — openinference.span.kind, llm.input_messages.{idx}, llm.output_messages.{idx}, tool.name, llm.token_count.* — into the internal BaseSpan / LlmSpan / ToolSpan representation and routes them through ContextAwareSpanProcessor (deepeval/integrations/README.md). The TypeScript equivalent maintains an OI_KIND_TO_SPAN_TYPE mapping (AGENT → SpanType.AGENT, CHAIN → SpanType.AGENT, LLM → SpanType.LLM, TOOL → SpanType.TOOL, RETRIEVER → SpanType.RETRIEVER) and walks flattened llm.input_messages.{i}.message.content keys to reconstruct input/output text (typescript/src/integrations/openinference/processor.ts).

Framework Integrations and Model Wrappers

The TypeScript SDK ships first-class wrappers for popular providers. The OpenAI integration extracts token usage and tool calls from both the Chat Completions API and the Responses API. extractOutputParametersFromCompletionAPI parses usage.prompt_tokens, usage.completion_tokens, and choices[0].message.tool_calls into the internal OutputParameters shape, building a toolsCalled array from either function or custom tool calls and resolving human-readable descriptions via inputParameters.toolDescriptions (typescript/src/openai/extractor.ts). The companion type definitions declare these fields explicitly: output, promptTokens, completionTokens, and toolsCalled? (typescript/src/openai/types.ts).

For non-OpenAI providers, the TypeScript SDK defines DeepEvalBaseLLM as the abstract base class with a single contract: generate(prompt, schema?) returns { output, cost }, where cost is the USD price computed from costPerInputToken and costPerOutputToken (typescript/src/models/base-model.ts). OpenAI-compatible providers extend DeepEvalOpenAICompatibleModel, a shared base that handles client construction, JSON-mode structured output via toJsonSchema, and token→cost computation (typescript/src/models/openai-compatible-model.ts). The Python side implements the same pattern in gateway_model.py: provider subclasses expose PROVIDER_SLUG and PROVIDER_LABEL, implement _generate / _a_generate, and let a centralized create_retry_decorator(PROVIDER_SLUG) wrap the public generate / a_generate so retries are consistent across every gateway (deepeval/models/llms/gateway_model.py). Pricing metadata for each model preset lives in constants.py (make_model_data(...) declares supports_log_probs, supports_multimodal, supports_structured_outputs, supports_json, input_price, output_price) (deepeval/models/llms/constants.py).

Capability	Python	TypeScript (v0.1.28)
Push/pull datasets	Yes	Yes
Trace LLM apps	Yes	Yes
Run evaluations via Confident AI	Yes	Yes
Read/write prompts and versions	Yes	Yes
LLM-as-a-judge metrics	Yes	Not yet (roadmap)
Local NLP models	Yes	Not yet (roadmap)

Source: typescript/README.md, typescript/package.json

Known Issues and Limitations

The community has surfaced several limitations in the tracing and integration layer:

Tokens blank over OTLP — Issue #2746 reports that llm.token_count.* attributes sent via OTLP to otel.confident-ai.com are not surfaced as the tokens field in the trace UI. The OpenInference processor does translate the attribute, but the rendering path requires the corresponding Confident span attributes to be populated by the receiving endpoint (typescript/src/integrations/openinference/processor.ts).
Pydantic AI + OpenAIResponsesModel — Issue #2508 notes that tools_called, expected_tools, and actual_output are all None under ConfidentInstrumentationSettings. The Chat Completions path parses tool_calls correctly, but the Responses-API path (extractOutputParametersFromResponseAPI) is handled separately and may not surface tool data identically (typescript/src/openai/extractor.ts).
Cached input tokens — Issue #2741 requests a cached_input_token_count field (and matching price). The current span model only supports input_token_count, output_token_count, cost_per_input_token, and cost_per_output_token.
TypeScript parity — Issue #2734 tracks the broader TypeScript roadmap. The current typescript/ package is a Confident AI client wrapper that explicitly defers local execution features to Python until 80% feature parity is reached (typescript/README.md).

Evaluation Engine, Metrics and Synthetic Data

Related topics: DeepEval Overview and Core Architecture, CLI, Tooling, Extensibility and TypeScript

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Contract for "judge" models

Continue reading this section for the full explanation and source context.

Section Trace-based and end-to-end execution

Continue reading this section for the full explanation and source context.

Section Cost and token accounting

Continue reading this section for the full explanation and source context.

Evaluation Engine, Metrics and Synthetic Data

Overview

DeepEval's evaluation stack is built around three cooperating layers: an evaluation engine that drives end-to-end, trace-based, and component-level test runs; a metrics catalogue covering RAG, multi-turn, agentic, and tool-use quality dimensions; and a synthetic data pipeline that bootstraps goldens from documents, contexts, scratch, or pre-existing datasets. The Python package is the canonical implementation; the TypeScript SDK (typescript/README.md) currently focuses on the Confident AI surface area (datasets, prompts, evaluation reporting) while LLM-as-a-judge metrics remain Python-only.

flowchart LR
    A[Goldens / Datasets] --> B[Evaluation Engine]
    C[Live Traces] --> B
    D[Metrics Catalogue] --> B
    B --> E[Console Report]
    B --> F[Confident API]
    G[CLI: deepeval generate] --> A
    H[Model Registry + Cost Data] --> D

Evaluation Engine

Contract for "judge" models

Every judge model that participates in metric scoring inherits from DeepEvalBaseLLM. The GatewayModel base class in deepeval/models/llms/gateway_model.py standardizes two responsibilities: a unified retry policy and a (output, cost) contract.

Subclasses declare PROVIDER_SLUG and PROVIDER_LABEL; __init__ builds a per-instance retry decorator via create_retry_decorator(self.PROVIDER_SLUG).
_run / _arun wrap provider implementations so retries are consistent across gateways.
Subclasses implement load_model plus _generate / _a_generate, returning a (output, cost) tuple; public generate / a_generate are the wrapped, retry-aware entry points.

The TypeScript counterpart in typescript/src/models/base-model.ts exposes the same shape:

interface GenerationResult<T = string> { output: T; cost: number | null; }
abstract generate<T = string>(prompt: string, schema?: ZodType<T>): Promise<GenerationResult<T>>;

Trace-based and end-to-end execution

Spans produced by the integrations layer are serialized through convertSpanToApiSpan in typescript/src/tracing/tracing.ts, which normalizes input / output, converts Date fields to ISO strings, and groups spans by SpanType (AGENT, LLM, RETRIEVER, TOOL) into the TraceApi payload defined in typescript/src/tracing/api.ts. The same payload carries per-span metrics, metricCollection, llmTestCase, toolsCalled, and expectedTools so that trace-level evaluation can be replayed server-side without re-running the application.

The integration matrix in deepeval/integrations/README.md describes how four execution modes interact with the engine:

Bare (no enclosing trace): each integration auto-creates a trace on first activity.
@observe / with trace(...): spans flow into the native trace context so update_current_trace(...) / update_current_span(...) work anywhere in the call stack.
evals_iterator: runs inside dataset.evals_iterator(...) either end-to-end (metrics=[...]) or component-level (@observe(metrics=[...])).
deepeval test run: pytest entry point with @assert_test, @generate_trace_json, and @assert_* decorators.

Cost and token accounting

Pricing and capability data for individual judge models live in deepeval/models/llms/constants.py, keyed by model name. Each entry records supports_log_probs, max_log_probs, supports_multimodal, supports_structured_outputs, supports_json, supports_temperature, and input_price / output_price per token. The registry covers OpenAI families (gpt-4o, gpt-4.1, gpt-5.x, o1), Anthropic (claude-3-opus-20240229 and newer), Ollama-tagged models (phi3, llava, gemma2/3, qwen2.5, deepseek-r1), and others. The judge layer in gateway_model.py uses this data to translate (input_tokens, output_tokens, cost_per_input_token, cost_per_output_token) into the cost field returned from generate.

Metrics Catalogue

The metrics surface advertised in README.md groups into several families:

Family	Representative metrics	Purpose
RAG	Answer Relevancy, Faithfulness, Contextual Recall, Contextual Precision, Contextual Relevancy, RAGAS	Quality of retrieval and grounded generation
Multi-Turn	Knowledge Retention, …	Conversational memory and consistency
Tool Use	Tool Correctness, Argument Correctness	Quality of agent tool calls and arguments
Agentic	Task Completion, plus trace-derived scores	End-to-end task success over a trace

Community evidence (release notes for v3.9.9, v4.0.2, and v4.0.5) indicates that the agentic family in particular is now trace-driven, with Task Completion judging whether the agent *actually completes the intended task* rather than merely producing a plausible final answer.

A documented limitation: ContextualPrecisionMetric can over-penalize near-duplicate overlapping chunks in financial-document RAG pipelines (see issue #2594). When chunking strategies rely on overlap to preserve context around table/section boundaries, retrieval quality may be under-reported.

Synthetic Data Generation

The CLI command in deepeval/cli/generate/command.py is the entry point for bootstrapping goldens. It dispatches on two orthogonal axes:

Variation: GoldenVariation.SINGLE_TURN vs. conversational.
Method: GenerationMethod.DOCS, GenerationMethod.CONTEXTS, GenerationMethod.SCRATCH, or GenerationMethod.GOLDENS.

if method == GenerationMethod.DOCS:
    synthesizer.generate_goldens_from_docs(
        document_paths=document_paths,
        include_expected_output=include_expected,
        max_goldens_per_context=max_goldens_per_context,
        context_construction_config=context_construction_config,
    )

For DOCS, callers tune min_contexts_per_document, chunk_size, chunk_overlap, context_quality_threshold, context_similarity_threshold, and max_retries. The conversational branch (generate_conversational_goldens_from_docs / _from_contexts / _from_scratch) reuses the same configuration but emits multi-turn goldens suitable for the multi-turn metric family. A long-standing CLI ergonomics request (issue #1235) is to display only failed tests in the runner; until shipped, pre-push hooks need to post-process the full output.

Known Limitations and Community Issues

Token / cost visibility: LlmSpan only tracks input_token_count, output_token_count, cost_per_input_token, cost_per_output_token; cached input tokens are not yet representable (issue #2741). The same gap is reported on the OpenTelemetry / OpenInference ingestion path, where the trace UI shows blank tokens despite valid llm.token_count.* attributes (issue #2746).
pydantic-ai instrumentation: When ConfidentInstrumentationSettings is used with pydantic-ai's OpenAIResponsesModel, tools_called, expected_tools, and actual_output come through as None (issue #2508).
TypeScript surface area: The /typescript package is currently a Confident AI client wrapper; LLM-as-a-judge metrics and local evaluation remain Python-only (issue #2734). The Confident AI platform surface (datasets, prompts, evaluation reporting) is functional, with shared prompt templates planned (typescript/README.md).
Security reporting: There is no SECURITY.md and GitHub's private vulnerability reporting endpoint returns 404 for the repo (issue #2744). Reporters should follow the maintainers' guidance until a disclosure process is published.

CLI, Tooling, Extensibility and TypeScript

Related topics: DeepEval Overview and Core Architecture, Evaluation Engine, Metrics and Synthetic Data

Section Related Pages

Continue reading this section for the full explanation and source context.

CLI, Tooling, Extensibility and TypeScript

This page documents the user-facing tooling layer of DeepEval: its Python CLI, the supporting TUI for trace inspection, the TypeScript SDK entry points, and the provider-pluggable model architecture that drives extensibility across both language ecosystems. The material is anchored in the source files listed above and the active community discussions that surfaced during the v4.0 release cycle.

CLI and Operator Tooling

DeepEval is operated through a Python entry point that exposes a top-level command line, an HTTP inspection server, and a TUI for navigating traces captured from instrumented LLM applications. The integration catalog in README.md lists one-line integrations with LangChain, LangGraph, Pydantic AI, CrewAI, Anthropic, AWS AgentCore, and LlamaIndex, indicating that the CLI is the surface where these integrations are usually registered or invoked.

The TUI for trace inspection is described in the DeepEval 4.0 release notes as a day-one feature aimed at "rapid debugging" of agent traces. Community request #1235 further asks for a CLI option to display only failed tests, motivated by running DeepEval inside a Git pre-push hook. This is consistent with the design intent of the CLI being scriptable in CI contexts where terse, failure-only output is preferred.

TypeScript SDK Overview

The TypeScript package is published from the typescript/ directory and is configured for multiple sub-entry points to keep surface areas independent. Per typescript/package.json, the package exposes:

Sub-entry	Purpose
`deepeval/dataset`	Dataset and `Golden` objects
`deepeval/testCase`	Test case construction
`deepeval/tracing`	Trace collection and OTLP export
`deepeval/confident`	Confident AI platform API client
`deepeval/openai`	OpenAI wrapper utilities
`deepeval/integrations/*`	Framework integrations (AI SDK, LangChain, OpenAI Agents)

The TypeScript SDK targets JavaScript and TypeScript teams building on the Confident AI platform. The typescript/README.md explicitly states: "Local execution features, such as LLM-as-a-judge metrics, NLP models, and fully local evaluation, currently remain in the Python package while we expand TypeScript support." The roadmap commits to 80% feature parity on the Confident AI integration surface by the end of July, including shared prompt templates consumed by both Python and TypeScript.

A typical TypeScript flow wires a client against the Confident API, pushes or pulls prompts, and reports evaluation runs. The module surface is documented by typescript/src/confident/index.ts, which re-exports the API helpers, the shared types, and the top-level evaluate function. The evaluate export is the user-facing entry point for reporting results back to the platform.

Prompt Management in TypeScript

Prompt management is one of the more mature TypeScript surfaces. The Prompt class in typescript/src/prompt/index.ts provides an object-oriented wrapper around the Confident AI prompt endpoints, supporting text and messages payloads (mutually exclusive), interpolationType, modelSettings, outputType, outputSchema, and tools. Tool definitions are normalized via ToolDataSchema and NormalizedToolDataSchema defined in typescript/src/prompt/types.ts, with ToolMode (auto / required / none) controlling tool selection semantics.

The push method explicitly refuses to send both text and messages simultaneously, throwing a TypeError to prevent ambiguous prompt payloads on the server. Pull options support version, label, hash, and branch selectors, mirroring the versioned commit graph used in the Python package. This is the foundation for the "shared prompt templates" feature mentioned in the TypeScript roadmap.

Provider Model Architecture and Extensibility

Extensibility on the Python side is driven by a unified DeepEvalBaseLLM contract and a gateway_model.py retry-and-cost wrapper documented in deepeval/models/llms/gateway_model.py. Subclasses set PROVIDER_SLUG, PROVIDER_LABEL, and per-token cost fields, then implement load_model, _generate, and _a_generate. The base class builds a per-instance retry decorator from the provider slug, so retries honor runtime configuration without re-instantiating.

Model capability metadata is centralized in deepeval/models/llms/constants.py, where each model preset declares supports_log_probs, supports_multimodal, supports_structured_outputs, supports_json, and pricing fields. This registry is the single source of truth referenced when new model IDs are added (e.g., the claude-opus-4-8 preset in release v4.0.5).

The TypeScript equivalent is mirrored in typescript/src/models/base-model.ts, which defines the DeepEvalBaseLLM abstract class with generate, getModelName, and capability accessors. Concrete providers extend this base. typescript/src/models/openai-compatible-model.ts is the shared base for every OpenAI Chat-Completions-speaking provider; subclasses only resolve model name, base URL, env-var API keys, and default headers, then delegate generation to the base. typescript/src/models/providers/anthropic-model.ts and typescript/src/models/providers/ollama-model.ts demonstrate the pattern: each is a thin class that defers to the base for token→cost computation and structured-output handling. Importantly, both providers leave temperature unset unless explicitly provided, because reasoning models reject the parameter.

flowchart LR
  A[DeepEvalBaseLLM] --> B[OpenAICompatibleModel]
  A --> C[AnthropicModel]
  A --> D[OllamaModel]
  B --> E[Gateway / Provider]
  C --> F[Anthropic API]
  D --> G[Local Ollama]
  E --> H[Cost + Retry]
  F --> H
  G --> H
  H --> I[GenerationResult]

OpenAI token and tool extraction lives in typescript/src/openai/extractor.ts, with input and output payload shapes declared in typescript/src/openai/types.ts. The extractor splits the Chat Completions and Responses APIs and walks tool_calls to produce a normalized ToolCall[] used by tracing and evaluation.

Tracing, Datasets, and Community-Driven Gaps

The tracing module, surfaced via typescript/src/tracing/tracing.ts, converts in-memory BaseSpan objects into API-shaped BaseApiSpan objects, normalizes Date instances to ISO strings, and collapses undefined inputs/outputs to safe defaults (empty string for LLM spans, empty array for RETRIEVER spans). The Golden class in typescript/src/dataset/golden.ts is the dataset-side counterpart, supporting both actualOutput and expectedOutput, retrieval context, tool call tracking, and Confident AI bookkeeping fields (_datasetRank, _datasetAlias, _datasetId).

Several open issues highlight tooling gaps that this architecture must absorb:

#2746: OTLP-exported LLM spans carry llm.token_count.* attributes but the Confident AI trace UI shows blank tokens, indicating a normalization gap in the OTEL ingestion path.
#2741: LlmSpan cost tracking lacks fields for cached input tokens, even though providers return them.
#2508: The Pydantic AI integration drops actual_output, tools_called, and expected_tools when paired with OpenAIResponsesModel, pointing to a coverage gap in the integration shim.

These gaps sit at the seam between framework integrations and the tracing/dataset modules documented above, and they map directly onto the v4.0 focus on "1-line integrations" and an agent-native evaluation workflow.

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Security or permission risk requires verification

Developers may expose sensitive permissions or credentials: Security: request for a submitting security vulnerabilities.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 28 structured pitfall item(s), including 4 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

Severity: high
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/1235

2. Installation risk: Installation risk requires verification

Severity: high
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2508

3. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Developers should check this security_permissions risk before relying on the project: Security: request for a submitting security vulnerabilities.
User impact: Developers may expose sensitive permissions or credentials: Security: request for a submitting security vulnerabilities.
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Security: request for a submitting security vulnerabilities.. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2744

4. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2594

5. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.host_targets | github_repo:676829188 | https://github.com/confident-ai/deepeval

6. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Developers should check this configuration risk before relying on the project: 🎉 New Interfaces, Reduce ETL Code < 50%!
User impact: Upgrade or migration may change expected behavior: 🎉 New Interfaces, Reduce ETL Code < 50%!
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 🎉 New Interfaces, Reduce ETL Code < 50%!. Context: Observed when using python
Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.7.2

7. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Developers should check this configuration risk before relying on the project: 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!
User impact: Upgrade or migration may change expected behavior: 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!. Context: Observed when using python
Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v4.0.2

8. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | github_repo:676829188 | https://github.com/confident-ai/deepeval

9. Runtime risk: Runtime risk requires verification

Severity: medium
Finding: Developers should check this runtime risk before relying on the project: ConfidentInstrumentationSettings with pydantic-ai: tools_called, expected_tools, and actual_output are all None when using OpenAIResponsesModel
User impact: Developers may hit a documented source-backed failure mode: ConfidentInstrumentationSettings with pydantic-ai: tools_called, expected_tools, and actual_output are all None when using OpenAIResponsesModel
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: ConfidentInstrumentationSettings with pydantic-ai: tools_called, expected_tools, and actual_output are all None when using OpenAIResponsesModel. Context: Observed when using python
Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2508

10. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Developers should check this migration risk before relying on the project: 🎉 New Decision Graph Logic for Granular Simulation Control
User impact: Upgrade or migration may change expected behavior: 🎉 New Decision Graph Logic for Granular Simulation Control
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 🎉 New Decision Graph Logic for Granular Simulation Control. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v4.0.3

11. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | github_repo:676829188 | https://github.com/confident-ai/deepeval

12. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | github_repo:676829188 | https://github.com/confident-ai/deepeval

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using deepeval with real data or production workflows.

LLM tokens not displayed when using custom OpenTelemetry / OpenInference - github / github_issue
ConfidentInstrumentationSettings with pydantic-ai: tools_called, expecte - github / github_issue
Security: request for a submitting security vulnerabilities. - github / github_issue
Feature: support cached input tokens in LLM span cost tracking - github / github_issue
CLI improvement: option to display only failed tests - github / github_issue
Contextual Precision over-penalizes overlapping chunks in financial-docu - github / github_issue
DeepEval for Typescript - github / github_issue
Opus 4.8: Day 0 Support - github / github_release
🎉 New Decision Graph Logic for Granular Simulation Control - github / github_release
🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI - github / github_release
🎉 Metrics for AI agents, multi-turn synthetic data generation, and more! - github / github_release
🎉 New Interfaces, Reduce ETL Code < 50%! - github / github_release

Source: Project Pack community evidence and pitfall evidence

deepeval

DeepEval Overview and Core Architecture

Related Pages

DeepEval Overview and Core Architecture

Purpose and Scope

High-Level Architecture

Model Gateway and Provider Coverage

CLI, Test Runs, and Synthetic Data

Tracing, Integrations, and Skills

TypeScript SDK Surface

Known Limitations and Community Context

See Also

Tracing, Observability and Framework Integrations

Related Pages

Tracing, Observability and Framework Integrations

Overview and Scope

Tracing Architecture

Framework Integrations and Model Wrappers

Known Issues and Limitations

See Also

Evaluation Engine, Metrics and Synthetic Data

Related Pages

Evaluation Engine, Metrics and Synthetic Data

Overview

Evaluation Engine

Contract for "judge" models

Trace-based and end-to-end execution

Cost and token accounting

Metrics Catalogue

Synthetic Data Generation

Known Limitations and Community Issues

See Also

CLI, Tooling, Extensibility and TypeScript

Related Pages

CLI, Tooling, Extensibility and TypeScript

CLI and Operator Tooling

TypeScript SDK Overview

Prompt Management in TypeScript

Provider Model Architecture and Extensibility

Tracing, Datasets, and Community-Driven Gaps

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Installation risk: Installation risk requires verification

2. Installation risk: Installation risk requires verification

3. Security or permission risk: Security or permission risk requires verification

4. Security or permission risk: Security or permission risk requires verification

5. Configuration risk: Configuration risk requires verification

6. Configuration risk: Configuration risk requires verification

7. Configuration risk: Configuration risk requires verification

8. Capability evidence risk: Capability evidence risk requires verification

9. Runtime risk: Runtime risk requires verification

10. Maintenance risk: Maintenance risk requires verification

11. Maintenance risk: Maintenance risk requires verification

12. Security or permission risk: Security or permission risk requires verification

Community Discussion Evidence

Community Discussion Evidence