Doramagic Project Pack · Human Manual
deepeval
The LLM Evaluation Framework
DeepEval Overview and Core Architecture
Related topics: Tracing, Observability and Framework Integrations, Evaluation Engine, Metrics and Synthetic Data
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Tracing, Observability and Framework Integrations, Evaluation Engine, Metrics and Synthetic Data
DeepEval Overview and Core Architecture
Purpose and Scope
DeepEval is an LLM evaluation framework that combines a Python package, a CLI, a companion TypeScript SDK, and a managed cloud platform (Confident AI) into a single workflow for testing, tracing, and benchmarking LLM applications. As stated in the main README.md, it ships with metrics for RAG, agents, conversational systems, and tool use, plus integrations for popular frameworks such as OpenAI, LangChain, LangGraph, Pydantic AI, CrewAI, Anthropic, AWS AgentCore, and LlamaIndex.
The framework targets three primary audiences:
- Application developers who need unit-test-style evals inside
pytest(deepeval test run). - Agent builders who need tracing, span-level metrics, and the
@observedecorator (vibe-coderworkflow described in the README). - Platform users who manage datasets, prompts, and reporting in Confident AI from either Python or TypeScript.
High-Level Architecture
DeepEval is layered so that a stable set of metric and tracing abstractions sits above provider-specific gateways and a CLI/SDK surface. The diagram below shows how user code reaches the platform.
flowchart TB
User[User Code / Agent] --> Py[deepeval Python package]
User --> TS[TypeScript SDK]
Py --> Metric[Metrics: RAG, Agentic, Tool, Multi-turn]
Py --> Trace[Tracing: @observe, spans, OTEL]
Py --> CLI[deepeval CLI: test run, generate, login]
Py --> Model[LLM Gateway: GatewayModel]
Model --> Provider[(Provider: OpenAI, Anthropic, Ollama, AI Gateway)]
Trace --> CA[(Confident AI: datasets, traces, reports)]
CLI --> CA
TS --> CA
Model --> CAThe metric layer, tracing layer, and model gateway are decoupled; a single test run can mix any metric with any model, and trace spans automatically pick up cost and token metadata from the model that produced them. Source: deepeval/models/llms/gateway_model.py and README.md.
Model Gateway and Provider Coverage
The Python package exposes a unified GatewayModel abstraction. Each subclass sets a PROVIDER_SLUG and a human-readable PROVIDER_LABEL, builds a per-instance retry decorator via create_retry_decorator(self.PROVIDER_SLUG), and implements the inherited abstract load_model plus _generate / _a_generate. The public generate / a_generate wrap those implementations with the centralized retry decorator so retries are consistent across all gateways. Source: deepeval/models/llms/gateway_model.py:1-50.
Cost and capability metadata for every supported preset is encoded in deepeval/models/llms/constants.py, where each entry declares supports_log_probs, supports_multimodal, supports_structured_outputs, supports_json, plus input_price and output_price used by the cost calculator. Newer presets such as claude-opus-4-8 are added there with multimodal and structured-output support and updated pricing.
The TypeScript SDK mirrors this design. DeepEvalBaseLLM in typescript/src/models/base-model.ts defines the contract: an abstract generate(prompt, schema) returning { output, cost }, plus supportsMultimodal, supportsStructuredOutputs, and supportsLogProbs capability flags. Concrete implementations include DeepEvalOpenAICompatibleModel (shared base for any OpenAI Chat Completions-compatible provider, see typescript/src/models/openai-compatible-model.ts), AnthropicModel (see typescript/src/models/providers/anthropic-model.ts), and AISDKModel for Vercel AI SDK users (see typescript/src/models/providers/ai-sdk-model.ts).
CLI, Test Runs, and Synthetic Data
The CLI is built on Typer and is the canonical entry point for deepeval test run, deepeval login, and deepeval generate. Shared utilities live in deepeval/cli/utils.py, which defines a UTM-append helper with_utm(url, *, medium, content) and a _CONFIDENT_UTM_HOSTS allow-list of browser-clickable Confident AI properties. Programmatic hosts (api.*, deepeval.*, otel.*) are intentionally excluded so generated links land users on dashboards, not API endpoints. Source: deepeval/cli/utils.py.
The generate subcommand unifies single-turn and multi-turn golden generation. It dispatches on GenerationMethod (DOCUMENTS, CONTEXTS, SCRATCH, GOLDENS) and GoldenVariation (SINGLE_TURN vs. conversational), calling generate_goldens_from_docs, generate_goldens_from_contexts, generate_goldens_from_scratch, or the corresponding generate_conversational_goldens_* variants. Source: deepeval/cli/generate/command.py.
A long-standing community request to add a --only-failed flag for deepeval test run (useful in Git pre-push hooks, see issue #1235) is still open; the current CLI surfaces all test results.
Tracing, Integrations, and Skills
Tracing is provided through the @observe decorator and span types (LlmSpan, ToolSpan, RetrieverSpan, AgentSpan). The update_current_trace(...) / update_current_span(...) helpers work anywhere on the call stack and emit a single REST POST per trace, avoiding the UUID reconciliation overhead of raw OpenTelemetry exporters. The deepeval/integrations/README.md describes a four-row integration matrix (Bare, @observe/with trace(...), evals_iterator, deepeval test run) and explains that ContextAwareSpanProcessor flips OTel-mode integrations to REST routing automatically when trace_manager.is_evaluating is True. Some community-reported gaps remain: the llm.token_count.* attributes emitted via raw OpenInference OTLP spans do not always populate the tokens field in the Confident UI (issue #2746), and the pydantic-ai integration with OpenAIResponsesModel returns None for tools_called, expected_tools, and actual_output under ConfidentInstrumentationSettings (issue #2508). A related enhancement request to track cached input tokens in LlmSpan cost calculation is tracked in issue #2741.
The skills/README.md ships agent "Skills" (a Claude.ai/Cursor concept) that teach coding agents how to add evals, generate datasets, enable tracing, and iterate on failures. The three published skills are deepeval (core eval workflow), deepeval-otel (raw OpenTelemetry export with no Python dependency), and deepeval-tracing (native @observe-based instrumentation).
TypeScript SDK Surface
The TypeScript SDK lives under typescript/ and is published as the deepeval package (see typescript/package.json). Subpath exports include deepeval/dataset, deepeval/testCase, deepeval/tracing, deepeval/confident, deepeval/openai, and deepeval/integrations/ai-sdk. The typescript/README.md explains that the initial release focuses on the Confident AI API surface — dataset push/pull, evaluation reporting, prompt CRUD — while local execution features (LLM-as-a-judge metrics, NLP models, fully local eval) remain Python-only. The stated milestone is 80% parity on the Confident AI surface by the end of July, including shared prompt templates consumed by both languages. The tracing layer is implemented in typescript/src/tracing/tracing.ts, and prompt CRUD is in typescript/src/prompt/index.ts.
Known Limitations and Community Context
- OTLP token mapping: Raw OpenInference
llm.token_count.*attributes do not surface as thetokensfield in the Confident UI (#2746). - Pydantic AI instrumentation:
actual_output,tools_called, andexpected_toolsareNoneforOpenAIResponsesModel(#2508). - Cached input tokens:
LlmSpanexposes onlyinput_token_count,output_token_count,cost_per_input_token,cost_per_output_token; cached-token reporting is not yet supported (#2741). - Contextual Precision overlap: Near-duplicate overlapping retrieval chunks are penalized as distinct low-quality retrievals, under-reporting retrieval quality for chunked-document RAG (#2594).
- CLI verbosity: No
--only-failedoption fordeepeval test run(#1235). - Security reporting: The repository has no
SECURITY.mdand the GitHub Private Vulnerability Reporting endpoint returns 404 (#2744). - TypeScript scope: Local metrics and NLP models remain Python-only while the TS SDK reaches parity (#2734).
See Also
Source: https://github.com/confident-ai/deepeval / Human Manual
Tracing, Observability and Framework Integrations
Related topics: DeepEval Overview and Core Architecture, Evaluation Engine, Metrics and Synthetic Data
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: DeepEval Overview and Core Architecture, Evaluation Engine, Metrics and Synthetic Data
Tracing, Observability and Framework Integrations
Overview and Scope
DeepEval provides a unified tracing and observability layer that records LLM-application execution as structured traces and ships them to the Confident AI platform. The system supports two transports — a native REST path (api.confident-ai.com/v1/traces) and an OpenTelemetry OTLP path (otel.confident-ai.com/v1/traces) — so teams can pick the route that fits their stack. Framework integrations such as LangChain, LangGraph, Pydantic AI, CrewAI, Anthropic, LlamaIndex, and AWS AgentCore all plug into this layer using one of four documented mechanisms (deepeval/integrations/README.md, README.md).
Each integration is graded against four capability axes. Bare means calling the framework directly (no enclosing decorator) still produces a trace. @observe / with trace(...) means wrapping a call merges framework spans into the native trace context, enabling update_current_trace(...) and update_current_span(...) anywhere in the call stack with a single REST POST per trace and no UUID reconciliation. evals_iterator works inside dataset.evals_iterator(...) both end-to-end and component-level. deepeval test run works under the pytest tracing-eval entry point. For OTel-mode integrations, ContextAwareSpanProcessor flips automatically to REST routing when trace_manager.is_evaluating is True (deepeval/integrations/README.md).
Tracing Architecture
The native trace model defines typed spans (AGENT, LLM, TOOL, RETRIEVER, CHAIN, CUSTOM) plus a trace root that aggregates them. Each span carries metadata, tags, timing, status (SUCCESS / ERROR), and optional I/O. The TypeScript SDK exposes this surface through traceManager, getCurrentSpan, getCurrentTrace, the observe decorator, and the span mutators updateCurrentSpan, updateLlmSpan, and updateRetrieverSpan (typescript/src/tracing/index.ts). The conversion layer in convertSpanToApiSpan normalizes Date objects to ISO strings, coerces missing outputs (defaulting to "" or [] for retriever spans), and picks endTime from startTime when not provided or earlier than the start (typescript/src/tracing/tracing.ts).
For community-maintained OpenInference instrumentors, deepeval/integrations/openinference/ registers a TracerProvider and an OpenInferenceSpanInterceptor that translates semantic-convention attributes — openinference.span.kind, llm.input_messages.{idx}, llm.output_messages.{idx}, tool.name, llm.token_count.* — into the internal BaseSpan / LlmSpan / ToolSpan representation and routes them through ContextAwareSpanProcessor (deepeval/integrations/README.md). The TypeScript equivalent maintains an OI_KIND_TO_SPAN_TYPE mapping (AGENT → SpanType.AGENT, CHAIN → SpanType.AGENT, LLM → SpanType.LLM, TOOL → SpanType.TOOL, RETRIEVER → SpanType.RETRIEVER) and walks flattened llm.input_messages.{i}.message.content keys to reconstruct input/output text (typescript/src/integrations/openinference/processor.ts).
Framework Integrations and Model Wrappers
The TypeScript SDK ships first-class wrappers for popular providers. The OpenAI integration extracts token usage and tool calls from both the Chat Completions API and the Responses API. extractOutputParametersFromCompletionAPI parses usage.prompt_tokens, usage.completion_tokens, and choices[0].message.tool_calls into the internal OutputParameters shape, building a toolsCalled array from either function or custom tool calls and resolving human-readable descriptions via inputParameters.toolDescriptions (typescript/src/openai/extractor.ts). The companion type definitions declare these fields explicitly: output, promptTokens, completionTokens, and toolsCalled? (typescript/src/openai/types.ts).
For non-OpenAI providers, the TypeScript SDK defines DeepEvalBaseLLM as the abstract base class with a single contract: generate(prompt, schema?) returns { output, cost }, where cost is the USD price computed from costPerInputToken and costPerOutputToken (typescript/src/models/base-model.ts). OpenAI-compatible providers extend DeepEvalOpenAICompatibleModel, a shared base that handles client construction, JSON-mode structured output via toJsonSchema, and token→cost computation (typescript/src/models/openai-compatible-model.ts). The Python side implements the same pattern in gateway_model.py: provider subclasses expose PROVIDER_SLUG and PROVIDER_LABEL, implement _generate / _a_generate, and let a centralized create_retry_decorator(PROVIDER_SLUG) wrap the public generate / a_generate so retries are consistent across every gateway (deepeval/models/llms/gateway_model.py). Pricing metadata for each model preset lives in constants.py (make_model_data(...) declares supports_log_probs, supports_multimodal, supports_structured_outputs, supports_json, input_price, output_price) (deepeval/models/llms/constants.py).
| Capability | Python | TypeScript (v0.1.28) |
|---|---|---|
| Push/pull datasets | Yes | Yes |
| Trace LLM apps | Yes | Yes |
| Run evaluations via Confident AI | Yes | Yes |
| Read/write prompts and versions | Yes | Yes |
| LLM-as-a-judge metrics | Yes | Not yet (roadmap) |
| Local NLP models | Yes | Not yet (roadmap) |
Source: typescript/README.md, typescript/package.json
Known Issues and Limitations
The community has surfaced several limitations in the tracing and integration layer:
- Tokens blank over OTLP — Issue #2746 reports that
llm.token_count.*attributes sent via OTLP tootel.confident-ai.comare not surfaced as thetokensfield in the trace UI. The OpenInference processor does translate the attribute, but the rendering path requires the corresponding Confident span attributes to be populated by the receiving endpoint (typescript/src/integrations/openinference/processor.ts). - Pydantic AI +
OpenAIResponsesModel— Issue #2508 notes thattools_called,expected_tools, andactual_outputare allNoneunderConfidentInstrumentationSettings. The Chat Completions path parsestool_callscorrectly, but the Responses-API path (extractOutputParametersFromResponseAPI) is handled separately and may not surface tool data identically (typescript/src/openai/extractor.ts). - Cached input tokens — Issue #2741 requests a
cached_input_token_countfield (and matching price). The current span model only supportsinput_token_count,output_token_count,cost_per_input_token, andcost_per_output_token. - TypeScript parity — Issue #2734 tracks the broader TypeScript roadmap. The current
typescript/package is a Confident AI client wrapper that explicitly defers local execution features to Python until 80% feature parity is reached (typescript/README.md).
See Also
- Metrics Library — LLM-as-a-judge, RAG, agentic, and multi-turn metrics
- Datasets and Synthetic Data Generation — Goldens, conversational goldens, and
evals_iterator - Confident AI Platform Integration — Hosted datasets, traces, and reports
- Model Providers and Gateways — Gateway model registry and pricing metadata
Source: https://github.com/confident-ai/deepeval / Human Manual
Evaluation Engine, Metrics and Synthetic Data
Related topics: DeepEval Overview and Core Architecture, CLI, Tooling, Extensibility and TypeScript
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: DeepEval Overview and Core Architecture, CLI, Tooling, Extensibility and TypeScript
Evaluation Engine, Metrics and Synthetic Data
Overview
DeepEval's evaluation stack is built around three cooperating layers: an evaluation engine that drives end-to-end, trace-based, and component-level test runs; a metrics catalogue covering RAG, multi-turn, agentic, and tool-use quality dimensions; and a synthetic data pipeline that bootstraps goldens from documents, contexts, scratch, or pre-existing datasets. The Python package is the canonical implementation; the TypeScript SDK (typescript/README.md) currently focuses on the Confident AI surface area (datasets, prompts, evaluation reporting) while LLM-as-a-judge metrics remain Python-only.
flowchart LR
A[Goldens / Datasets] --> B[Evaluation Engine]
C[Live Traces] --> B
D[Metrics Catalogue] --> B
B --> E[Console Report]
B --> F[Confident API]
G[CLI: deepeval generate] --> A
H[Model Registry + Cost Data] --> DEvaluation Engine
Contract for "judge" models
Every judge model that participates in metric scoring inherits from DeepEvalBaseLLM. The GatewayModel base class in deepeval/models/llms/gateway_model.py standardizes two responsibilities: a unified retry policy and a (output, cost) contract.
- Subclasses declare
PROVIDER_SLUGandPROVIDER_LABEL;__init__builds a per-instance retry decorator viacreate_retry_decorator(self.PROVIDER_SLUG). _run/_arunwrap provider implementations so retries are consistent across gateways.- Subclasses implement
load_modelplus_generate/_a_generate, returning a(output, cost)tuple; publicgenerate/a_generateare the wrapped, retry-aware entry points.
The TypeScript counterpart in typescript/src/models/base-model.ts exposes the same shape:
interface GenerationResult<T = string> { output: T; cost: number | null; }
abstract generate<T = string>(prompt: string, schema?: ZodType<T>): Promise<GenerationResult<T>>;
Trace-based and end-to-end execution
Spans produced by the integrations layer are serialized through convertSpanToApiSpan in typescript/src/tracing/tracing.ts, which normalizes input / output, converts Date fields to ISO strings, and groups spans by SpanType (AGENT, LLM, RETRIEVER, TOOL) into the TraceApi payload defined in typescript/src/tracing/api.ts. The same payload carries per-span metrics, metricCollection, llmTestCase, toolsCalled, and expectedTools so that trace-level evaluation can be replayed server-side without re-running the application.
The integration matrix in deepeval/integrations/README.md describes how four execution modes interact with the engine:
- Bare (no enclosing trace): each integration auto-creates a trace on first activity.
@observe/with trace(...): spans flow into the native trace context soupdate_current_trace(...)/update_current_span(...)work anywhere in the call stack.evals_iterator: runs insidedataset.evals_iterator(...)either end-to-end (metrics=[...]) or component-level (@observe(metrics=[...])).deepeval test run: pytest entry point with@assert_test,@generate_trace_json, and@assert_*decorators.
Cost and token accounting
Pricing and capability data for individual judge models live in deepeval/models/llms/constants.py, keyed by model name. Each entry records supports_log_probs, max_log_probs, supports_multimodal, supports_structured_outputs, supports_json, supports_temperature, and input_price / output_price per token. The registry covers OpenAI families (gpt-4o, gpt-4.1, gpt-5.x, o1), Anthropic (claude-3-opus-20240229 and newer), Ollama-tagged models (phi3, llava, gemma2/3, qwen2.5, deepseek-r1), and others. The judge layer in gateway_model.py uses this data to translate (input_tokens, output_tokens, cost_per_input_token, cost_per_output_token) into the cost field returned from generate.
Metrics Catalogue
The metrics surface advertised in README.md groups into several families:
| Family | Representative metrics | Purpose |
|---|---|---|
| RAG | Answer Relevancy, Faithfulness, Contextual Recall, Contextual Precision, Contextual Relevancy, RAGAS | Quality of retrieval and grounded generation |
| Multi-Turn | Knowledge Retention, … | Conversational memory and consistency |
| Tool Use | Tool Correctness, Argument Correctness | Quality of agent tool calls and arguments |
| Agentic | Task Completion, plus trace-derived scores | End-to-end task success over a trace |
Community evidence (release notes for v3.9.9, v4.0.2, and v4.0.5) indicates that the agentic family in particular is now trace-driven, with Task Completion judging whether the agent *actually completes the intended task* rather than merely producing a plausible final answer.
A documented limitation: ContextualPrecisionMetric can over-penalize near-duplicate overlapping chunks in financial-document RAG pipelines (see issue #2594). When chunking strategies rely on overlap to preserve context around table/section boundaries, retrieval quality may be under-reported.
Synthetic Data Generation
The CLI command in deepeval/cli/generate/command.py is the entry point for bootstrapping goldens. It dispatches on two orthogonal axes:
- Variation:
GoldenVariation.SINGLE_TURNvs. conversational. - Method:
GenerationMethod.DOCS,GenerationMethod.CONTEXTS,GenerationMethod.SCRATCH, orGenerationMethod.GOLDENS.
if method == GenerationMethod.DOCS:
synthesizer.generate_goldens_from_docs(
document_paths=document_paths,
include_expected_output=include_expected,
max_goldens_per_context=max_goldens_per_context,
context_construction_config=context_construction_config,
)
For DOCS, callers tune min_contexts_per_document, chunk_size, chunk_overlap, context_quality_threshold, context_similarity_threshold, and max_retries. The conversational branch (generate_conversational_goldens_from_docs / _from_contexts / _from_scratch) reuses the same configuration but emits multi-turn goldens suitable for the multi-turn metric family. A long-standing CLI ergonomics request (issue #1235) is to display only failed tests in the runner; until shipped, pre-push hooks need to post-process the full output.
Known Limitations and Community Issues
- Token / cost visibility:
LlmSpanonly tracksinput_token_count,output_token_count,cost_per_input_token,cost_per_output_token; cached input tokens are not yet representable (issue #2741). The same gap is reported on the OpenTelemetry / OpenInference ingestion path, where the trace UI shows blanktokensdespite validllm.token_count.*attributes (issue #2746). - pydantic-ai instrumentation: When
ConfidentInstrumentationSettingsis used with pydantic-ai'sOpenAIResponsesModel,tools_called,expected_tools, andactual_outputcome through asNone(issue #2508). - TypeScript surface area: The
/typescriptpackage is currently a Confident AI client wrapper; LLM-as-a-judge metrics and local evaluation remain Python-only (issue #2734). The Confident AI platform surface (datasets, prompts, evaluation reporting) is functional, with shared prompt templates planned (typescript/README.md). - Security reporting: There is no
SECURITY.mdand GitHub's private vulnerability reporting endpoint returns 404 for the repo (issue #2744). Reporters should follow the maintainers' guidance until a disclosure process is published.
See Also
- Tracing and Integrations
- Model Registry and Cost Tracking
- Synthetic Data and Datasets
- Confident AI Platform API
Source: https://github.com/confident-ai/deepeval / Human Manual
CLI, Tooling, Extensibility and TypeScript
Related topics: DeepEval Overview and Core Architecture, Evaluation Engine, Metrics and Synthetic Data
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: DeepEval Overview and Core Architecture, Evaluation Engine, Metrics and Synthetic Data
CLI, Tooling, Extensibility and TypeScript
This page documents the user-facing tooling layer of DeepEval: its Python CLI, the supporting TUI for trace inspection, the TypeScript SDK entry points, and the provider-pluggable model architecture that drives extensibility across both language ecosystems. The material is anchored in the source files listed above and the active community discussions that surfaced during the v4.0 release cycle.
CLI and Operator Tooling
DeepEval is operated through a Python entry point that exposes a top-level command line, an HTTP inspection server, and a TUI for navigating traces captured from instrumented LLM applications. The integration catalog in README.md lists one-line integrations with LangChain, LangGraph, Pydantic AI, CrewAI, Anthropic, AWS AgentCore, and LlamaIndex, indicating that the CLI is the surface where these integrations are usually registered or invoked.
The TUI for trace inspection is described in the DeepEval 4.0 release notes as a day-one feature aimed at "rapid debugging" of agent traces. Community request #1235 further asks for a CLI option to display only failed tests, motivated by running DeepEval inside a Git pre-push hook. This is consistent with the design intent of the CLI being scriptable in CI contexts where terse, failure-only output is preferred.
TypeScript SDK Overview
The TypeScript package is published from the typescript/ directory and is configured for multiple sub-entry points to keep surface areas independent. Per typescript/package.json, the package exposes:
| Sub-entry | Purpose |
|---|---|
deepeval/dataset | Dataset and Golden objects |
deepeval/testCase | Test case construction |
deepeval/tracing | Trace collection and OTLP export |
deepeval/confident | Confident AI platform API client |
deepeval/openai | OpenAI wrapper utilities |
deepeval/integrations/* | Framework integrations (AI SDK, LangChain, OpenAI Agents) |
The TypeScript SDK targets JavaScript and TypeScript teams building on the Confident AI platform. The typescript/README.md explicitly states: "Local execution features, such as LLM-as-a-judge metrics, NLP models, and fully local evaluation, currently remain in the Python package while we expand TypeScript support." The roadmap commits to 80% feature parity on the Confident AI integration surface by the end of July, including shared prompt templates consumed by both Python and TypeScript.
A typical TypeScript flow wires a client against the Confident API, pushes or pulls prompts, and reports evaluation runs. The module surface is documented by typescript/src/confident/index.ts, which re-exports the API helpers, the shared types, and the top-level evaluate function. The evaluate export is the user-facing entry point for reporting results back to the platform.
Prompt Management in TypeScript
Prompt management is one of the more mature TypeScript surfaces. The Prompt class in typescript/src/prompt/index.ts provides an object-oriented wrapper around the Confident AI prompt endpoints, supporting text and messages payloads (mutually exclusive), interpolationType, modelSettings, outputType, outputSchema, and tools. Tool definitions are normalized via ToolDataSchema and NormalizedToolDataSchema defined in typescript/src/prompt/types.ts, with ToolMode (auto / required / none) controlling tool selection semantics.
The push method explicitly refuses to send both text and messages simultaneously, throwing a TypeError to prevent ambiguous prompt payloads on the server. Pull options support version, label, hash, and branch selectors, mirroring the versioned commit graph used in the Python package. This is the foundation for the "shared prompt templates" feature mentioned in the TypeScript roadmap.
Provider Model Architecture and Extensibility
Extensibility on the Python side is driven by a unified DeepEvalBaseLLM contract and a gateway_model.py retry-and-cost wrapper documented in deepeval/models/llms/gateway_model.py. Subclasses set PROVIDER_SLUG, PROVIDER_LABEL, and per-token cost fields, then implement load_model, _generate, and _a_generate. The base class builds a per-instance retry decorator from the provider slug, so retries honor runtime configuration without re-instantiating.
Model capability metadata is centralized in deepeval/models/llms/constants.py, where each model preset declares supports_log_probs, supports_multimodal, supports_structured_outputs, supports_json, and pricing fields. This registry is the single source of truth referenced when new model IDs are added (e.g., the claude-opus-4-8 preset in release v4.0.5).
The TypeScript equivalent is mirrored in typescript/src/models/base-model.ts, which defines the DeepEvalBaseLLM abstract class with generate, getModelName, and capability accessors. Concrete providers extend this base. typescript/src/models/openai-compatible-model.ts is the shared base for every OpenAI Chat-Completions-speaking provider; subclasses only resolve model name, base URL, env-var API keys, and default headers, then delegate generation to the base. typescript/src/models/providers/anthropic-model.ts and typescript/src/models/providers/ollama-model.ts demonstrate the pattern: each is a thin class that defers to the base for token→cost computation and structured-output handling. Importantly, both providers leave temperature unset unless explicitly provided, because reasoning models reject the parameter.
flowchart LR A[DeepEvalBaseLLM] --> B[OpenAICompatibleModel] A --> C[AnthropicModel] A --> D[OllamaModel] B --> E[Gateway / Provider] C --> F[Anthropic API] D --> G[Local Ollama] E --> H[Cost + Retry] F --> H G --> H H --> I[GenerationResult]
OpenAI token and tool extraction lives in typescript/src/openai/extractor.ts, with input and output payload shapes declared in typescript/src/openai/types.ts. The extractor splits the Chat Completions and Responses APIs and walks tool_calls to produce a normalized ToolCall[] used by tracing and evaluation.
Tracing, Datasets, and Community-Driven Gaps
The tracing module, surfaced via typescript/src/tracing/tracing.ts, converts in-memory BaseSpan objects into API-shaped BaseApiSpan objects, normalizes Date instances to ISO strings, and collapses undefined inputs/outputs to safe defaults (empty string for LLM spans, empty array for RETRIEVER spans). The Golden class in typescript/src/dataset/golden.ts is the dataset-side counterpart, supporting both actualOutput and expectedOutput, retrieval context, tool call tracking, and Confident AI bookkeeping fields (_datasetRank, _datasetAlias, _datasetId).
Several open issues highlight tooling gaps that this architecture must absorb:
- #2746: OTLP-exported LLM spans carry
llm.token_count.*attributes but the Confident AI trace UI shows blank tokens, indicating a normalization gap in the OTEL ingestion path. - #2741:
LlmSpancost tracking lacks fields for cached input tokens, even though providers return them. - #2508: The Pydantic AI integration drops
actual_output,tools_called, andexpected_toolswhen paired withOpenAIResponsesModel, pointing to a coverage gap in the integration shim.
These gaps sit at the seam between framework integrations and the tracing/dataset modules documented above, and they map directly onto the v4.0 focus on "1-line integrations" and an agent-native evaluation workflow.
See Also
- Tracing and Observability
- Model Providers and Pricing
- Confident AI Platform Integration
Source: https://github.com/confident-ai/deepeval / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Developers may expose sensitive permissions or credentials: Security: request for a submitting security vulnerabilities.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 28 structured pitfall item(s), including 4 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.
1. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/1235
2. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2508
3. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Developers should check this security_permissions risk before relying on the project: Security: request for a submitting security vulnerabilities.
- User impact: Developers may expose sensitive permissions or credentials: Security: request for a submitting security vulnerabilities.
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Security: request for a submitting security vulnerabilities.. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2744
4. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2594
5. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.host_targets | github_repo:676829188 | https://github.com/confident-ai/deepeval
6. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: 🎉 New Interfaces, Reduce ETL Code < 50%!
- User impact: Upgrade or migration may change expected behavior: 🎉 New Interfaces, Reduce ETL Code < 50%!
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 🎉 New Interfaces, Reduce ETL Code < 50%!. Context: Observed when using python
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.7.2
7. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!
- User impact: Upgrade or migration may change expected behavior: 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!. Context: Observed when using python
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v4.0.2
8. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | github_repo:676829188 | https://github.com/confident-ai/deepeval
9. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Developers should check this runtime risk before relying on the project: ConfidentInstrumentationSettings with pydantic-ai: tools_called, expected_tools, and actual_output are all None when using OpenAIResponsesModel
- User impact: Developers may hit a documented source-backed failure mode: ConfidentInstrumentationSettings with pydantic-ai: tools_called, expected_tools, and actual_output are all None when using OpenAIResponsesModel
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: ConfidentInstrumentationSettings with pydantic-ai: tools_called, expected_tools, and actual_output are all None when using OpenAIResponsesModel. Context: Observed when using python
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2508
10. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Developers should check this migration risk before relying on the project: 🎉 New Decision Graph Logic for Granular Simulation Control
- User impact: Upgrade or migration may change expected behavior: 🎉 New Decision Graph Logic for Granular Simulation Control
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 🎉 New Decision Graph Logic for Granular Simulation Control. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v4.0.3
11. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:676829188 | https://github.com/confident-ai/deepeval
12. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | github_repo:676829188 | https://github.com/confident-ai/deepeval
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using deepeval with real data or production workflows.
- LLM tokens not displayed when using custom OpenTelemetry / OpenInference - github / github_issue
- ConfidentInstrumentationSettings with pydantic-ai: tools_called, expecte - github / github_issue
- Security: request for a submitting security vulnerabilities. - github / github_issue
- Feature: support cached input tokens in LLM span cost tracking - github / github_issue
- CLI improvement: option to display only failed tests - github / github_issue
- Contextual Precision over-penalizes overlapping chunks in financial-docu - github / github_issue
- DeepEval for Typescript - github / github_issue
- Opus 4.8: Day 0 Support - github / github_release
- 🎉 New Decision Graph Logic for Granular Simulation Control - github / github_release
- 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI - github / github_release
- 🎉 Metrics for AI agents, multi-turn synthetic data generation, and more! - github / github_release
- 🎉 New Interfaces, Reduce ETL Code < 50%! - github / github_release
Source: Project Pack community evidence and pitfall evidence