evidently Manual Preview

Doramagic Project Pack · Human Manual

evidently

Related topics: Core Components, Data Management and Data Flow

Architecture Overview

Related topics: Core Components, Data Management and Data Flow

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Key Imports for LLM Evals

Continue reading this section for the full explanation and source context.

Section CloudConfigAPI

Continue reading this section for the full explanation and source context.

Section Config Version Management

Continue reading this section for the full explanation and source context.

Architecture Overview

Introduction

Evidently is an open-source Python framework designed to evaluate, test, and monitor ML and LLM-powered systems. The architecture follows a modular design pattern that separates concerns between core evaluation logic, SDK APIs, user interface components, and configuration management.

Sources: README.md:1-30

High-Level Architecture

The Evidently platform consists of several interconnected layers:

graph TB
    subgraph "User Interface Layer"
        UI["UI Components<br/>(React/TypeScript)"]
    end
    
    subgraph "SDK Layer"
        SDK["Python SDK<br/>(evidently.sdk)"]
        API["Cloud Config API"]
    end
    
    subgraph "Core Layer"
        CORE["Core Modules<br/>(metrics, descriptors, reports)"]
        LEGACY["Legacy Module<br/>(evidently.legacy)"]
        FUTURE["Future Module<br/>(evidently.future)"]
    end
    
    subgraph "Data Layer"
        WS["Workspace Abstraction"]
        CFG["Config System"]
        ADP["Adapters"]
    end
    
    UI --> SDK
    SDK --> API
    SDK --> WS
    SDK --> CFG
    SDK --> ADP
    ADP --> WS
    CFG --> WS

Core Module Structure

The main evidently package provides the primary user-facing APIs for ML evaluation:

Component	Purpose
`Report`	Generate evaluation reports combining multiple metrics
`Dataset`	Container for evaluation data with descriptors
`DataDefinition`	Schema definition for dataset structure
`Metrics`	Individual evaluation metrics
`Descriptors`	Row-level evaluators (e.g., Sentiment, TextLength)
`Presets`	Pre-configured evaluation suites (e.g., TextEvals)

Sources: README.md:30-55

Key Imports for LLM Evals

from evidently import Report
from evidently import Dataset, DataDefinition
from evidently.descriptors import Sentiment, TextLength, Contains
from evidently.presets import TextEvals

SDK Architecture

The SDK layer provides programmatic access to remote configurations and cloud features.

CloudConfigAPI

The CloudConfigAPI class manages remote configuration operations:

Method	Purpose
`create_config()`	Create a new remote configuration
`update_config()`	Update an existing configuration
`get_config()`	Retrieve a configuration by ID
`list_configs()`	List all configurations for a project
`delete_config()`	Remove a configuration
`create_version()`	Create a new version of a configuration
`list_versions()`	List all versions of a configuration
`get_version()`	Get a specific version

Sources: src/evidently/sdk/configs.py:1-50

Config Version Management

The configuration system uses a ConfigVersion model to track changes:

classDiagram
    class ConfigMetadata {
        +str created_at
        +str updated_at
        +str author
        +str description
    }
    
    class ConfigVersion {
        +str id
        +str artifact_id
        +int version
        +Any content
        +ConfigVersionMetadata metadata
    }
    
    class ConfigVersionMetadata {
        +str created_at
        +str updated_at
        +str author
        +str comment
    }
    
    ConfigVersion --> ConfigVersionMetadata
    ConfigVersionMetadata --|> ConfigMetadata

Sources: src/evidently/sdk/configs.py:50-150

Adapter Pattern

The SDK uses adapters to convert between internal config models and domain objects:

graph LR
    A["ConfigVersion"] -->|Adapter| B["ArtifactVersion"]
    A -->|Adapter| C["Descriptor"]
    A -->|Adapter| D["Prompt"]

Adapter Class	Source Config	Target Domain Object
`ArtifactAdapter`	`ConfigVersion`	`ArtifactVersion`
`DescriptorAdapter`	`ConfigVersion`	`Descriptor`
`PromptAdapter`	`ConfigVersion`	`Prompt`

Sources: src/evidently/sdk/adapters.py:1-100

Workspace Architecture

The workspace abstraction provides a unified interface for project management:

Abstract Base Class

classDiagram
    class Workspace {
        <<abstract>>
        +create_project(name, description, org_id) Project
        +add_project(project, org_id) Project
        +get_project(project_id) Optional~Project~
        +delete_project(project_id)
        +list_projects(org_id) Sequence~Project~
    }

Core Workspace Methods

Method	Parameters	Returns	Description
`create_project`	`name: str`, `description: str`, `org_id: Optional[OrgID]`	`Project`	Creates and adds a new project
`add_project`	`project: ProjectModel`, `org_id: Optional[OrgID]`	`Project`	Adds an existing project model
`get_project`	`project_id: STR_UUID`	`Optional[Project]`	Retrieves project by UUID
`delete_project`	`project_id: STR_UUID`	`None`	Removes project from workspace
`list_projects`	`org_id: Optional[OrgID]`	`Sequence[Project]`	Lists projects with optional filtering

Sources: src/evidently/ui/workspace.py:1-100

Project Model

Projects are stored using a ProjectModel data structure:

ProjectModel(
    name=str,
    description=str,
    org_id=Optional[OrgID]
)

UI Component Architecture

The frontend is built with React and TypeScript, organized as separate packages:

graph TD
    subgraph "ui/packages/evidently-ui-lib"
        WC["Widgets"]
        CC["Components"]
        FP["Forms"]
        TB["Tables"]
    end
    
    subgraph "Components"
        DT["Dashboard"]
        TR["Traces"]
        PR["Prompts"]
        DS["Descriptors"]
    end
    
    CC --> DT
    CC --> TR
    CC --> PR
    CC --> DS
    CC --> FP
    CC --> TB

Component Categories

Package Path	Purpose
`src/widgets/`	Dashboard widgets and test suite components
`src/components/Dashboard/`	Dashboard-specific UI elements
`src/components/Traces/`	Trace viewing and management
`src/components/Prompts/`	Prompt template management
`src/components/Descriptors/`	Descriptor configuration forms
`src/components/Utils/`	Shared utility components

Template and Link Generation

The UI layer uses template functions to generate HTML for external links and reports:

Template	Purpose
`HTML_LINK_WITH_ID_TEMPLATE`	Report links with button and ID display
`FILE_LINK_WITH_ID_TEMPLATE`	File links with ID metadata
`RUNNING_SERVICE_LINK_TEMPLATE`	Service endpoint links
`EVIDENTLY_STYLES_COMMON`	Shared CSS styles for links

Sources: src/evidently/ui/utils.py:1-60

Template Structure

<div class="evidently-links container">
    <a target="_blank" href="{button_url}">{button_title}</a>
    <p><b>{id_title}:</b> <span>{id}</span></p>
</div>

Data Flow Architecture

graph TB
    A["User Code"] --> B["Dataset Creation"]
    B --> C["Descriptor Application"]
    C --> D["Report Generation"]
    D --> E["Workspace Storage"]
    
    F["Cloud API"] --> G["Config Management"]
    G --> E
    
    E --> H["UI Display"]

API Reference Documentation

The project includes automated API documentation generation:

Command	Description
`./api-reference/generate.py --local-source-code`	Generate docs from local source
`./api-reference/generate.py --git-revision <ref>`	Generate docs from git revision
`./api-reference/generate.py --additional-modules`	Include extra modules

Documentation is output to api-reference/dist/ organized by revision.

Sources: api-reference/README.md:1-50

Summary

The Evidently architecture is organized into three main layers:

Core Layer - Python packages for evaluation logic (evidently, evidently.core)
SDK Layer - Remote configuration and cloud integration (evidently.sdk)
UI Layer - React/TypeScript web interface for visualization and management

The modular design allows users to:

Use the Python SDK directly for programmatic evaluation
Store and version configurations in the cloud
Access results through the web UI
Extend functionality through the descriptor and metric system

Sources: [README.md:1-30]()

Core Components

Related topics: Architecture Overview, Reports and Test Suites, Custom Metrics and Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Purpose and Scope

Continue reading this section for the full explanation and source context.

Section Key Data Structures

Continue reading this section for the full explanation and source context.

Section Metric Architecture

Continue reading this section for the full explanation and source context.

Core Components

Overview

The Core Components form the foundational architecture of the Evidently framework, providing the essential building blocks for evaluation, testing, and monitoring of ML and LLM-powered systems. These components establish the base type system, metric definitions, reporting mechanisms, and registry patterns that enable the framework's extensibility and modularity.

The Core Components encompass several interconnected modules that work together to provide a unified approach to ML evaluation. At the heart of this system is a registration-based architecture that allows metrics, tests, and components to be dynamically discovered, configured, and executed within the Evidently ecosystem.

This architecture follows a plugin-like pattern where individual components can be registered, versioned, and managed through centralized registries. This design enables users to extend the framework's functionality by implementing custom metrics and tests while maintaining compatibility with the existing evaluation pipeline.

Base Types Module

Purpose and Scope

The base types module defines the fundamental data structures and abstractions that underpin all Evidently components. These types establish common interfaces and data models used throughout the framework, ensuring consistency and interoperability between different modules.

The base types primarily focus on defining what data flows through the evaluation pipeline, including data definitions, dataset representations, and result containers. This module serves as the foundation upon which all higher-level abstractions are built.

Key Data Structures

The base types module defines several critical classes and interfaces that form the backbone of the Evidently type system. These structures provide a standardized way to represent input data, evaluation results, and configuration options across all components.

The module establishes the DataDefinition class, which serves as a schema for describing the structure of datasets being evaluated. This includes column definitions, data types, and metadata that describe the characteristics of the data being analyzed. Data definitions enable Evidently to understand the structure of incoming data and apply appropriate transformations and evaluations.

The module also defines base classes for Dataset objects, which encapsulate reference and current data batches used in evaluations. These dataset representations include functionality for data validation, transformation, and slicing based on various criteria.

Metric Types Module

Metric Architecture

The metric types module defines the core abstraction for all metrics within Evidently. Metrics are the primary mechanism for evaluating data quality, model performance, and statistical properties of datasets. The module establishes a class hierarchy that enables both simple single-value metrics and complex multi-dimensional metric calculations.

Metrics in Evidently follow a consistent pattern where each metric is associated with specific columns or data subsets and produces standardized result objects. This design enables metrics to be composed, combined, and analyzed together within reports and dashboards.

Metric Execution Model

The metric execution model defines how metrics are calculated, what inputs they receive, and how results are structured. Each metric implementation receives a data context containing the relevant dataset columns and produces a MetricResult object that encapsulates the calculated values and metadata.

The execution model supports both eager and lazy evaluation strategies. Some metrics compute their results immediately upon execution, while others may defer computation until results are actually needed. This flexibility enables optimizations in scenarios where multiple metrics share intermediate calculations or where memory efficiency is a concern.

Metrics can be configured with parameters that control their behavior, including aggregation methods, thresholds, and visualization options. The parameter system uses type hints and validation to ensure configuration correctness while providing clear documentation through the API reference.

Report System

Report Architecture

The Report class serves as the primary interface for executing evaluations in Evidently. Reports orchestrate the execution of metrics and tests, collect results, and generate visualizations and summaries. The Report system provides a flexible framework for combining multiple evaluation components into cohesive analysis workflows.

Reports maintain state throughout their lifecycle, tracking which metrics and tests have been executed, their results, and any errors or warnings encountered during execution. This state management enables features like incremental recalculation, where only changed components need to be re-evaluated.

graph TD
    A[Create Report] --> B[Add Metrics]
    B --> C[Add Tests]
    C --> D[Run Report]
    D --> E[Collect Results]
    E --> F[Generate Visualizations]
    F --> G[Export Report]
    
    H[Reference Data] --> D
    I[Current Data] --> D
    
    E --> J[Metric Results]
    E --> K[Test Results]

Report Configuration

Reports support extensive configuration options that control execution behavior, result aggregation, and output formatting. Configuration options include parallel execution settings, result caching policies, and visualization preferences.

The Report class provides methods for both synchronous and asynchronous execution, enabling integration with various application architectures. Asynchronous execution is particularly useful for long-running evaluations or when reports are generated as part of batch processing workflows.

Report Export

The Report system includes export functionality that generates output in various formats including HTML, JSON, and Python dictionary structures. Export options enable customization of included visualizations, result detail levels, and styling preferences.

Testing Framework

Test Definition

The testing module extends Evidently's evaluation capabilities to include pass/fail style assertions. Tests in Evidently are designed to validate specific conditions about data or model behavior, returning boolean results that indicate whether defined criteria are met.

Tests can be defined inline within metrics or as standalone components that check specific conditions. This dual approach provides flexibility in how validation logic is structured and reused across different evaluation scenarios.

Test Execution

Tests execute alongside metrics within the Report framework, allowing unified execution of both evaluation and validation logic. The test execution model mirrors the metric execution model, receiving data contexts and producing structured results that include pass/fail status and diagnostic information.

Test results include detailed failure messages that explain why a test did not pass, helping users understand data quality issues or model behavior problems. These diagnostic messages reference specific rows, values, or statistical properties that contributed to the test failure.

Component Registry System

Registry Architecture

The registry system provides a centralized mechanism for managing and discovering Evidently components. Registries maintain mappings between component identifiers and their implementations, enabling dynamic resolution of components at runtime.

Evidently uses registries extensively for metrics, tests, descriptors, and other extensible components. This architecture allows the framework to support third-party extensions while maintaining a consistent interface for component discovery and execution.

Component Registration

Components are registered with their respective registries using decorators or explicit registration calls. The registration process captures metadata about each component, including its name, version, parameters, and dependencies. This metadata enables automatic documentation generation and configuration interfaces.

Registration supports versioning, allowing multiple versions of a component to coexist and enabling rollback to previous versions when needed. Version management is particularly important for production deployments where stability and reproducibility are critical.

Metrics Registry

The metrics registry extends the component registry pattern specifically for metrics. It provides specialized functionality for metric discovery, parameter validation, and result aggregation.

graph TD
    A[Metrics Registry] --> B[Metric Base Class]
    A --> C[Metric Parameters]
    A --> D[Metric Results]
    
    B --> E[Column Metrics]
    B --> F[Dataset Metrics]
    B --> G[Statistical Metrics]
    
    D --> H[Visualization Configs]
    D --> I[Aggregation Methods]

Metric Tests Registry

The metric tests registry manages test definitions that validate metric results or data conditions. This registry enables tests to reference metrics by name and access their results for validation purposes.

Tests registered in this registry can be automatically discovered and included in reports based on configuration. This discovery mechanism enables declarative specification of validation requirements without requiring explicit test instantiation.

Integration Patterns

SDK Integration

The Core Components integrate with Evidently's SDK layer, which provides higher-level APIs for cloud and remote deployments. The SDK layer uses the Core Components as its foundation while adding capabilities for remote execution, result storage, and collaborative features.

The CloudConfigAPI class demonstrates this integration, providing methods for managing project configurations, descriptor configs, and artifact versions. These high-level APIs delegate core functionality to the Core Components while handling network communication and serialization.

Adapter Pattern

Evidently uses adapters to bridge different execution contexts and storage backends. Adapters transform between internal data structures and external representations, enabling Evidently to work with various data sources and deployment environments.

The adapter pattern is particularly important for cloud deployments where configurations and results are stored remotely. Adapters handle serialization, deserialization, and API communication while maintaining compatibility with the Core Component interfaces.

Configuration Management

Config Versioning

The Core Components support configuration versioning through specialized config classes. Each versioned configuration maintains a history of changes, enabling audit trails and rollback capabilities.

Configurations are organized by project, with separate namespaces for different config types like metrics, tests, and descriptors. This organization enables clear separation of concerns while maintaining relationships between related configurations.

Remote Configuration

For deployments requiring centralized configuration management, Evidently supports remote configuration storage through the CloudConfigAPI. Remote configurations can be fetched, updated, and versioned through the SDK, with changes automatically synchronized to connected clients.

Summary

The Core Components provide the essential infrastructure for Evidently's evaluation capabilities. Through a combination of base types, metric abstractions, reporting mechanisms, and registry systems, these components establish a flexible and extensible framework for ML evaluation.

The modular design enables users to leverage individual components for specific use cases or combine them into comprehensive evaluation pipelines. The registration-based architecture ensures extensibility while maintaining consistency across the framework.

Understanding these core concepts is essential for effectively using Evidently and for extending the framework with custom metrics, tests, and integrations.

Source: https://github.com/evidentlyai/evidently / Human Manual

Data Management and Data Flow

Related topics: Architecture Overview, ML Model Evaluation, LLM Evaluation and Judging

Section Related Pages

Continue reading this section for the full explanation and source context.

Section The Dataset Class

Continue reading this section for the full explanation and source context.

Section Dataset Factory Methods

Continue reading this section for the full explanation and source context.

Section Container-Dataset Relationship

Continue reading this section for the full explanation and source context.

Data Management and Data Flow

Overview

Data Management and Data Flow in Evidently encompasses the mechanisms by which data is ingested, transformed, stored, and processed throughout the evaluation lifecycle. The system provides a unified abstraction layer that handles multiple data sources (pandas DataFrames, CSV files, Parquet files) and integrates with the broader evaluation pipeline including Reports, Test Suites, and LLM-specific processing like RAG (Retrieval-Augmented Generation) systems.

The architecture separates concerns between core data structures (Dataset, Container), SDK-level abstractions for cloud/local deployment, and specialized LLM data processing components. This modularity allows users to work with familiar Python data structures while benefiting from Evidently's evaluation and monitoring capabilities.

Sources: src/evidently/core/datasets.py:1-50

Core Data Abstraction

The Dataset Class

The Dataset class serves as the primary data container throughout Evidently. It wraps various data sources into a standardized interface that supports:

Direct creation from pandas DataFrames
Automatic type inference via DataDefinition
Descriptor-based feature computation
Metadata and tagging support

# Basic Dataset creation from pandas
from evidently import Dataset, DataDefinition

dataset = Dataset.from_pandas(
    dataframe,
    data_definition=DataDefinition(),
    metadata={"source": "production"},
    tags=["eval", "2024"]
)

Sources: src/evidently/core/datasets.py:30-45

Dataset Factory Methods

Method	Purpose	Input Types
`from_pandas()`	Create Dataset from pandas DataFrame	`pd.DataFrame`
`from_any()`	Convert various types to Dataset	`pd.DataFrame`, `Dataset`
`as_dataframe()`	Extract underlying DataFrame	N/A (output)
`column()`	Retrieve specific column as `DatasetColumn`	column name
`subdataset()`	Filter dataset by column value	column name, label

Sources: src/evidently/core/datasets.py:50-80

The from_any() static method implements a factory pattern that handles type conversion automatically:

@staticmethod
def from_any(dataset: PossibleDatasetTypes) -> "Dataset":
    if isinstance(dataset, Dataset):
        return dataset
    if isinstance(dataset, pd.DataFrame):
        return Dataset.from_pandas(dataset)
    raise ValueError(f"Unsupported dataset type: {type(dataset)}")

Sources: src/evidently/core/datasets.py:60-70

Data Flow Architecture

graph TD
    A[Input Data: pd.DataFrame / CSV / Parquet] --> B[Dataset.from_any]
    B --> C{Data Type Check}
    C -->|pd.DataFrame| D[Dataset.from_pandas]
    C -->|Already Dataset| E[Return as-is]
    D --> F[DataDefinition Type Inference]
    F --> G[Add Descriptors]
    G --> H[Dataset Object]
    H --> I[Report.run / TestSuite.run]
    I --> J[Snapshot with Results]
    J --> K[Export: JSON / HTML / Dict]

Container Architecture

The Container class provides the underlying storage and management mechanism for datasets within the Evidently ecosystem. Containers maintain references to data assets and provide CRUD operations for dataset lifecycle management.

Key container responsibilities include:

Storing dataset references and metadata
Managing dataset versioning
Providing query and filtering capabilities
Handling persistence to storage backends

Sources: src/evidently/core/container.py:1-30

Container-Dataset Relationship

graph LR
    A[Container] -->|manages| B[Dataset Registry]
    B -->|references| C[Dataset v1]
    B -->|references| D[Dataset v2]
    B -->|references| E[Dataset vN]
    C -->|wraps| F[pd.DataFrame]
    D -->|wraps| G[pd.DataFrame]
    E -->|wraps| H[pd.DataFrame]

SDK-Level Data Management

The SDK layer provides deployment-agnostic data handling through the CloudConfigAPI and local storage adapters. This abstraction enables consistent data operations whether running locally or connected to Evidently Cloud.

Sources: src/evidently/sdk/datasets.py:1-50

SDK Dataset Operations

Operation	Method Signature	Description
Create	`add_dataset(project_id, dataset, name, description, link)`	Add dataset to project
Read	`load_dataset(dataset_id)`	Retrieve dataset by UUID
List	`list_datasets(project, origins)`	List all datasets in project
Update	`update_dataset(dataset_id, dataset)`	Modify existing dataset
Delete	`delete_dataset(dataset_id)`	Remove dataset from storage

Sources: src/evidently/ui/workspace.py:50-100

LLM-Specific Data Processing

RAG Data Pipeline

For LLM applications using Retrieval-Augmented Generation, Evidently provides specialized data processing components that handle document ingestion, chunking, and indexing.

Sources: src/evidently/llm/rag/index.py:1-50

graph TD
    A[Raw Documents] --> B[RAG Splitter]
    B --> C[Document Chunks]
    C --> D[RAG Index]
    D --> E[Vector Store]
    E --> F[Retrieval Query]
    F --> G[Context + Query]
    G --> H[LLM Response]

Document Splitting

The splitter.py module handles text chunking with configurable parameters:

Chunk size: Target size for each text segment
Overlap: Amount of overlap between consecutive chunks
Separators: Text boundaries for splitting priority

Sources: src/evidently/llm/rag/splitter.py:1-40

RAG Indexing

The index module manages the vector storage and retrieval:

Embedding generation for document chunks
Vector store integration (FAISS, ChromaDB, etc.)
Similarity search capabilities
Metadata preservation for filtering

Sources: src/evidently/llm/rag/index.py:50-100

Data Generation for LLM Evals

The datagen/base.py module provides infrastructure for synthetic data generation, enabling users to create evaluation datasets programmatically.

Sources: src/evidently/llm/datagen/base.py:1-50

Data Generation Workflow

graph LR
    A[Seed Data / Templates] --> B[LLM Generator]
    B --> C[Generated Samples]
    C --> D[Quality Validation]
    D -->|Pass| E[Evaluation Dataset]
    D -->|Fail| F[Regeneration Loop]
    F --> B

DataGeneratorBase Class

The base class defines the contract for data generation:

Method	Purpose
`generate()`	Create new data samples
`validate()`	Check generated data quality
`expand()`	Augment existing datasets

Sources: src/evidently/llm/datagen/base.py:30-60

Data Flow in Report Execution

Reports represent the primary consumer of dataset objects within Evidently. The run() method orchestrates the complete data flow from input to output.

def run(
    self,
    current_data,
    reference_data=None,
    additional_data=None,
    timestamp=None,
    metadata=None,
    tags=None,
    name=None,
) -> Snapshot:
    # Data validation
    if isinstance(current_data, pd.DataFrame) and current_data.empty:
        raise ValueError("current_data must contain at least one column...")
    
    # Convert to Dataset objects
    current_dataset = Dataset.from_any(current_data)
    reference_dataset = Dataset.from_any(reference_data) if reference_data else None
    
    # Execute metrics and generate snapshot
    ...

Sources: src/evidently/core/report.py:100-130

Data Validation Rules

Condition	Error Raised
Empty current_data DataFrame	`ValueError: current_data must contain at least one column; received an empty DataFrame`
Empty reference_data DataFrame	`ValueError: reference_data must contain at least one column; received an empty DataFrame`
Unsupported dataset type	`ValueError: Unsupported dataset type: {type(dataset)}`

Sources: src/evidently/core/datasets.py:65-70

Workspace Dataset Management

The Workspace class provides high-level dataset management capabilities for organizing evaluations across projects.

class Workspace:
    def add_dataset(
        self,
        project_id: STR_UUID,
        dataset: Dataset,
        name: str,
        description: Optional[str] = None,
        link: Optional[SnapshotLink] = None,
    ) -> DatasetID:
        """Add a dataset to a project."""
        return self.datasets.add(project_id=project_id, dataset=dataset, ...)
    
    def load_dataset(self, dataset_id: DatasetID) -> Dataset:
        """Load a dataset by ID."""
        return self.datasets.load(dataset_id)
    
    def list_datasets(
        self,
        project: STR_UUID,
        origins: Optional[List[str]] = None,
    ) -> DatasetList:
        """List all datasets in a project."""
        return self.datasets.list(project, origins=origins)

Sources: src/evidently/ui/workspace.py:50-80

Dataset Metadata Schema

Field	Type	Required	Description
`name`	`str`	Yes	Human-readable dataset name
`description`	`str`	No	Detailed description
`link`	`SnapshotLink`	No	Associated snapshot reference
`created_at`	`datetime`	Auto	Creation timestamp
`author`	`str`	Auto	Creator identifier

Summary

Data Management in Evidently follows a consistent pattern across the platform:

Ingestion: Data enters through Dataset.from_pandas() or Dataset.from_any() factory methods
Validation: Empty DataFrames and unsupported types are rejected early
Processing: Descriptors and metrics operate on the standardized Dataset interface
Output: Results are packaged into Snapshots with multiple export formats
Persistence: Datasets can be stored and retrieved through Workspace and SDK APIs

The separation between core data structures, SDK abstractions, and specialized LLM components enables flexible deployment while maintaining a simple user-facing API based on pandas DataFrames.

Sources: src/evidently/core/datasets.py:1-100 Sources: src/evidently/core/report.py:100-150 Sources: src/evidently/sdk/datasets.py:1-50 Sources: src/evidently/llm/rag/index.py:1-100 Sources: src/evidently/llm/rag/splitter.py:1-50 Sources: src/evidently/llm/datagen/base.py:1-60

Sources: [src/evidently/core/datasets.py:1-50]()

ML Model Evaluation

Related topics: LLM Evaluation and Judging, Descriptors and Features System, Presets and Metric Presets

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Architecture

Continue reading this section for the full explanation and source context.

Section Supported Evaluation Types

Continue reading this section for the full explanation and source context.

Section Purpose and Scope

Continue reading this section for the full explanation and source context.

ML Model Evaluation

Overview

ML Model Evaluation in Evidently is a comprehensive module that enables data scientists and ML engineers to assess, test, and monitor machine learning models throughout their lifecycle—from experiments to production. The evaluation system supports both predictive tasks (classification, regression) and recommendation systems.

The evaluation framework is built around the concept of Metrics and Reports, where metrics are individual evaluation components that can be composed into reports for comprehensive model assessment.

Sources: src/evidently/core/report.py:1-50

Core Architecture

graph TD
    A[Dataset] --> B[Report]
    A --> C[Test Suite]
    B --> D[Interactive Report]
    B --> E[JSON/Dict Output]
    C --> F[Pass/Fail Results]
    G[Metrics] --> B
    H[Presets] --> B
    G --> C

Supported Evaluation Types

Evaluation Type	Purpose	Key Metrics
Classification	Evaluate classifier performance	Accuracy, Precision, Recall, F1, ROC-AUC
Regression	Evaluate regression models	MAE, MSE, RMSE, R2
Data Quality	Assess input data integrity	Missing values, duplicates, correlations
Data Drift	Detect distribution changes	PSI, KS test, L Infinity
Recommendation Systems	Evaluate recsys models	Precision@K, Recall@K, NDCG
Embeddings	Evaluate embedding quality	Cosine similarity, drift detection

Classification Metrics

Purpose and Scope

Classification metrics evaluate the performance of classification models by comparing predicted labels against actual labels. Evidently supports both binary and multiclass classification evaluation scenarios.

Sources: src/evidently/metrics/classification.py:1-100

Available Metrics

Metric	Description	Applicable Task Types
`Accuracy`	Proportion of correct predictions	Binary, Multiclass
`Precision`	Positive predictive value	Binary, Multiclass
`Recall`	Sensitivity, true positive rate	Binary, Multiclass
`F1Score`	Harmonic mean of precision and recall	Binary, Multiclass
`RocAuc`	Area under the ROC curve	Binary, Multiclass
`ConfusionMatrix`	Cross-tabulation of predictions vs actuals	Binary, Multiclass
`PrecisionRecallCurve`	Precision-Recall tradeoff visualization	Binary
`ClassRepresentation`	Distribution of classes in predictions	Multiclass

Usage Example

from evidently import Report
from evidently.metrics import Accuracy, Precision, Recall, F1Score, RocAuc

report = Report([
    Accuracy(),
    Precision(),
    Recall(),
    F1Score(),
    RocAuc()
])

result = report.run(current_data=current_dataset, reference_data=reference_dataset)

Legacy Classification Performance

The legacy module provides comprehensive classification evaluation with detailed visualizations.

Sources: src/evidently/legacy/metrics/classification_performance/__init__.py:1-50

Key components include:

ClassificationPerformanceMetrics: Core metrics container
Visualization widgets: Confusion matrix plots, ROC curves, PR curves
Per-class analysis: Detailed breakdown for multiclass scenarios

Regression Metrics

Purpose and Scope

Regression metrics assess the performance of regression models by computing various error measures and statistical properties of prediction residuals.

Sources: src/evidently/metrics/regression.py:1-100

Available Metrics

Metric	Description	Unit
`MeanError`	Average prediction error	Same as target
`MeanAbsoluteError`	MAE - average absolute error	Same as target
`MeanSquaredError`	MSE - average squared error	Squared target units
`RootMeanSquaredError`	RMSE - square root of MSE	Same as target
`ErrorStd`	Standard deviation of errors	Same as target
`ErrorNormality`	Shapiro-Wilk test for residual normality	p-value
`R2Score`	Coefficient of determination	Dimensionless
`ErrorPercentile`	Percentile-based error analysis	Same as target

Usage Example

from evidently import Report
from evidently.metrics import (
    MeanAbsoluteError,
    MeanSquaredError,
    R2Score,
    ErrorPercentile
)

report = Report([
    MeanAbsoluteError(),
    MeanSquaredError(),
    R2Score(),
    ErrorPercentile(percentile=95)
])

result = report.run(current_data=current_dataset, reference_data=reference_dataset)

Legacy Regression Performance

The legacy regression performance module provides time-series analysis of predictions.

Sources: src/evidently/legacy/metrics/regression_performance/__init__.py:1-50

Features include:

Predicted vs Actual plots: Time-series visualization
Residual distribution analysis: Histograms and QQ plots
Error distribution by feature: Feature-level error breakdown

graph LR
    A[Current Data] --> B[RegressionReport]
    C[Reference Data] --> B
    B --> D[Predicted vs Actual]
    B --> E[Error Distribution]
    B --> F[Performance Metrics]

Data Quality Metrics

Purpose and Scope

Data quality metrics evaluate the integrity and quality of input data. These metrics are essential for understanding whether model predictions are reliable and for identifying data pipeline issues.

Sources: src/evidently/metrics/data_quality.py:1-100

Available Metrics

Metric	Description
`ColumnCount`	Number of columns in dataset
`RowCount`	Number of rows in dataset
`ValueStats`	Statistical summary (min, max, mean, std)
`MissingValuesMetric`	Count and percentage of missing values
`UniqueValuesCount`	Number of unique values per column
`DataInconsistencyMetric`	Detection of data inconsistencies
`TextStats`	Statistics for text columns (length, word count)

Usage Example

from evidently import Report, Dataset
from evidently.metrics import ColumnCount, ValueStats, MissingValuesMetric

dataset = Dataset.from_pandas(df, data_definition=DataDefinition())

report = Report([
    ColumnCount(),
    ValueStats(column="target"),
    MissingValuesMetric()
])

result = report.run(dataset, None)

Dataset Statistics

Purpose and Scope

Dataset statistics provide comprehensive descriptive statistics about datasets, enabling quick overview and comparison between reference and current data distributions.

Sources: src/evidently/metrics/dataset_statistics.py:1-100

Available Presets

Preset	Description
`DataSummaryPreset`	Complete dataset overview
`DataDriftPreset`	Distribution comparison between reference and current
`TargetDriftPreset`	Target variable drift analysis
`NumTargetDriftPreset`	Numeric target drift analysis
`CatTargetDriftPreset`	Categorical target drift analysis

Statistical Tests for Drift Detection

Sources: src/evidently/legacy/calculations/data_drift.py:1-100

Test	Applicable Data Type	Description
`KS`	Numerical	Kolmogorov-Smirnov test
`ZScore`	Numerical	Z-score based drift detection
`TTest`	Numerical	Two-sample t-test for mean comparison
`ChiSquare`	Categorical	Chi-square test for category distribution
`PSI`	Both	Population Stability Index
`L Infinity`	Both	Max absolute difference

Data Drift Workflow

graph TD
    A[Reference Dataset] --> F[Drift Detection Engine]
    B[Current Dataset] --> F
    F --> C[Statistical Tests]
    F --> D[Distribution Comparison]
    C --> E[Drift Report]
    D --> E
    E --> G{Drift Detected?}
    G -->|Yes| H[Alert / Action]
    G -->|No| I[Continue Monitoring]

Recommendation System Metrics

Purpose and Scope

Recommendation system (recsys) metrics evaluate the quality of recommendations generated by recommendation models.

Sources: src/evidently/metrics/recsys.py:1-100

Available Metrics

Metric	Description	K-Parameter
`PrecisionAtK`	Precision at top-K recommendations	Required
`RecallAtK`	Recall at top-K recommendations	Required
`NDCGAtK`	Normalized Discounted Cumulative Gain at K	Required
`HitRateAtK`	Hit rate at top-K recommendations	Optional
`MRRAtK`	Mean Reciprocal Rank at K	Required

Usage Example

from evidently import Report
from evidently.metrics import PrecisionAtK, RecallAtK, NDCGAtK

report = Report([
    PrecisionAtK(k=10),
    RecallAtK(k=10),
    NDCGAtK(k=10)
])

result = report.run(current_data=recs_dataset, reference_data=ref_recs_dataset)

Embeddings Metrics

Purpose and Scope

Embeddings metrics evaluate the quality and drift of vector embeddings, which are critical for LLM and semantic search applications.

Sources: src/evidently/metrics/embeddings.py:1-100

Available Metrics

Metric	Description
`Embedding Drift`	Detect drift in embedding distributions
`Cosine Similarity`	Average cosine similarity between embeddings
`Retrieval Metrics`	Evaluate RAG and retrieval system quality

Key Features

Semantic Drift Detection: Identify changes in embedding space distribution
Cluster Analysis: Analyze embedding cluster stability
Pairwise Comparison: Compare individual embedding pairs

Report Configuration

Creating a Report

from evidently import Report
from evidently.presets import DataDriftPreset

# Basic report with preset
report = Report([DataDriftPreset()])

# Report with custom metadata
report = Report(
    metrics=[...],
    metadata={"model_version": "v1.2.3"},
    tags=["production", "monthly"]
)

# Run with reference data for comparison
snapshot = report.run(current_dataset, reference_dataset)

Report Output Formats

Format	Method	Use Case
Interactive	Direct notebook display	Exploration, debugging
JSON	`.json()`	API responses, CI/CD
Python Dict	`.dict()`	Programmatic access
HTML	Export functionality	Shareable reports

Test Suite Integration

Reports can be converted to Test Suites by adding pass/fail conditions, enabling automated regression testing.

from evidently.test_suite import TestSuite
from evidently.tests import TestColumnValue

suite = TestSuite([
    TestColumnValue(column="accuracy", gt=0.95),
])

suite.run(current_data=dataset, reference_data=reference)

Best Practices

1. Always Use Reference Data for Comparison

For meaningful evaluation, maintain a reference dataset representing expected data distribution or model performance.

2. Choose Appropriate Metrics by Task Type

Task Type	Recommended Metrics
Binary Classification	Accuracy, Precision, Recall, F1, ROC-AUC
Multiclass Classification	Per-class F1, Confusion Matrix, Macro/Micro Avg
Regression	MAE, RMSE, R2, Error Percentiles
Recommendation	Precision@K, Recall@K, NDCG@K
Data Quality	Missing values, duplicates, consistency

3. Set Up Monitoring Schedules

For production models, schedule regular evaluations to detect performance degradation and data drift early.

4. Use Auto-generated Test Conditions

Evidently can auto-generate test thresholds from reference data:

from evidently.presets import DataDriftPreset

suite = TestSuite.from_reference(reference_data, [DataDriftPreset()])

Summary

Evidently's ML Model Evaluation module provides a comprehensive, modular framework for evaluating machine learning models across multiple dimensions:

Classification: Binary and multiclass classification metrics with visualizations
Regression: Comprehensive error metrics and residual analysis
Data Quality: Input data integrity checks
Data Drift: Distribution comparison and statistical tests
Recommendation Systems: Top-K recommendation quality metrics
Embeddings: Semantic drift detection for LLM applications

The system supports both one-off evaluations and continuous monitoring, with seamless integration into CI/CD pipelines through Test Suites.

Sources: [src/evidently/core/report.py:1-50]()

LLM Evaluation and Judging

Related topics: ML Model Evaluation, Descriptors and Features System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section LLM Judge Types

Continue reading this section for the full explanation and source context.

Section Feature Descriptor Pattern

Continue reading this section for the full explanation and source context.

Section Common Parameters

Continue reading this section for the full explanation and source context.

LLM Evaluation and Judging

Overview

Evidently provides a comprehensive framework for evaluating Large Language Model (LLM) powered systems through specialized descriptors called LLM Judges. These judges leverage LLM APIs to automatically assess and score various quality dimensions of LLM outputs, including relevance, coherence, factual accuracy, and task completion.

LLM Judges serve as evaluators that can be integrated into monitoring pipelines to continuously assess LLM performance without requiring manual human evaluation.

Architecture

graph TD
    A[Dataset with LLM Inputs/Outputs] --> B[LLM Judge Descriptors]
    B --> C[Prompt Templates]
    C --> D[LLM Provider API]
    D --> E[Parsed Evaluation Results]
    E --> F[Feature Descriptors]
    F --> G[Metrics & Tests]
    G --> H[Evidently Reports/Monitors]
    
    B1[ContextRelevanceLLMEval] --> B
    B2[CompletenessLLMEval] --> B
    B3[ContextQualityLLMEval] --> B
    B4[GroundednessLLMEval] --> B
    B5[RelevanceLLMEval] --> B
    
    C1[System Prompt] --> C
    C2[User Query] --> C
    C3[Evaluation Criteria] --> C

Core Components

LLM Judge Types

Evidently implements multiple specialized LLM judges for different evaluation dimensions:

Judge Type	Purpose	Output
`ContextRelevanceLLMEval`	Measures relevance of retrieved context to the query	Score 0-1, reasoning
`CompletenessLLMEval`	Evaluates completeness of response coverage	Score 0-1, reasoning
`ContextQualityLLMEval`	Assesses quality of context for answering	Score 0-1, reasoning
`GroundednessLLMEval`	Checks if response is grounded in provided context	Score 0-1, reasoning
`RelevanceLLMEval`	Evaluates semantic relevance of response to query	Score 0-1, reasoning

Sources: src/evidently/descriptors/generated_descriptors.py:1-200

Feature Descriptor Pattern

LLM Judges follow the FeatureDescriptor pattern, wrapping LLM evaluation features with optional aliasing and test definitions:

FeatureDescriptor(
    feature=feature,
    alias=alias,  # Custom display name
    tests=tests   # Optional validation tests
)

Sources: src/evidently/descriptors/generated_descriptors.py:30-35

Configuration Options

Common Parameters

All LLM Judge functions share a common parameter structure:

Parameter	Type	Required	Default	Description
`column_name`	`str`	Yes	-	Name of the text column to evaluate
`provider`	`str`	No	`"openai"`	LLM provider (openai, anthropic, etc.)
`model`	`str`	No	`"gpt-4o-mini"`	Model identifier
`additional_columns`	`Dict[str, str]`	No	`None`	Extra context columns for evaluation
`include_category`	`bool`	No	`None`	Include categorical classification in output
`include_score`	`bool`	No	`None`	Include numeric score in output
`include_reasoning`	`bool`	No	`None`	Include explanation in output
`uncertainty`	`Uncertainty`	No	`None`	Uncertainty handling strategy
`alias`	`str`	No	`None`	Custom name for the feature
`tests`	`List`	No	`None`	Tests to apply to the feature

Sources: src/evidently/descriptors/generated_descriptors.py:50-70

Uncertainty Handling

The uncertainty parameter enables robust evaluation by detecting when the LLM is uncertain about its assessment:

uncertainty: Optional[Uncertainty] = None

This allows the system to flag low-confidence evaluations rather than returning potentially incorrect scores.

Sources: src/evidently/descriptors/generated_descriptors.py:58

Prompt System

Prompt Templates

Evidently uses structured prompt templates for LLM evaluation:

@dataclasses.dataclass
class LLMJudgePrompts:
    system: str
    user: str

Prompts are rendered with dynamic variables including:

Query text
Response text
Reference context
Evaluation criteria

Sources: src/evidently/llm/prompts/__init__.py

Prompt Rendering

The prompt rendering system processes templates with type-safe variable substitution:

def render_prompt(
    template: PromptTemplate,
    **kwargs: Any
) -> str:

Sources: src/evidently/llm/utils/prompt_render.py

Output Parsing

Response Parsing

Evaluation results are parsed using structured output extraction:

Parser	Purpose
`parse_json_response()`	Extract JSON-structured evaluations
`parse_score()`	Extract numeric scores
`parse_reasoning()`	Extract explanation text
`parse_category()`	Extract categorical labels

Sources: src/evidently/llm/utils/parsing.py

Score Calculation

Scores are calculated based on the evaluation criteria defined in each judge:

0.0: Does not meet criteria
0.5: Partially meets criteria
1.0: Fully meets criteria

The score can be combined with reasoning and category outputs for comprehensive evaluation reports.

Integration with Evidently Reports

Usage in Reports

LLM Judges can be added to Evidently reports:

from evidently.llm import RelevanceLLMEval

report = Report(metrics=[
    RelevanceLLMEval(
        column_name="response",
        include_score=True,
        include_reasoning=True
    )
])

Usage in Monitoring

For continuous monitoring, LLM Judges are integrated into dashboard widgets:

from evidently.dashboard import Dashboard

dashboard = Dashboard([
    RelevanceLLMEval(column_name="response")
])

Sources: ui/packages/evidently-ui-lib/src/components/Descriptors/Features/LLMJudge/template.tsx

UI Components

LLMJudgeTemplate Component

The frontend provides visualization for LLM judge configurations:

Property	Type	Description
`state`	`LLMJudgeState`	Current judge configuration
`errors`	`FormErrors`	Validation errors
`availableTags`	`string[]`	Available categorization tags

Sources: ui/packages/evidently-ui-lib/src/components/Descriptors/Features/LLMJudge/template.tsx

Visualization States

The UI renders different states for judge configurations:

Uncertainty Configuration: Shows uncertainty handling options
Multiclass Classification: Displays class criteria for classification modes
Criteria Preview: Shows evaluation criteria as formatted text
Output Options: Displays configured output fields (reasoning, category, score)

Legacy Components

Legacy LLM Judges

The legacy implementation in evidently.legacy provides backward compatibility:

from evidently.legacy.descriptors.llm_judges import (
    CompletenessLLMEval as CompletenessLLMEvalV1
)

Sources: src/evidently/legacy/descriptors/llm_judges.py

Sentiment Descriptor

Specialized descriptor for sentiment analysis:

from evidently.legacy.descriptors.sentiment_descriptor import SentimentDescriptor

Features include:

Sentiment polarity detection (positive, negative, neutral)
Confidence scoring
Integration with reporting pipeline

Sources: src/evidently/legacy/descriptors/sentiment_descriptor.py

Scorers and Optimization

LLM Scorers

Scorers provide optimized evaluation functions:

from evidently.llm.optimization.scorers import LLMScorer

Scorers support:

Batch evaluation
Caching of results
Configurable thresholds

Sources: src/evidently/llm/optimization/scorers.py

Best Practices

1. Choosing Evaluation Dimensions

Use Case	Recommended Judges
RAG Systems	ContextRelevance, Groundedness, Relevance
Summarization	Completeness, Relevance
Q&A Systems	ContextQuality, Groundedness
General Chat	All dimensions

2. Score Interpretation

0.0 - 0.3: Significant issues detected
0.4 - 0.6: Partial criteria met
0.7 - 1.0: Criteria well satisfied

3. Uncertainty Handling

Enable uncertainty handling when:

Operating on edge cases
Using smaller models
Evaluating ambiguous inputs

Descriptors and Features System

Related topics: LLM Evaluation and Judging, ML Model Evaluation

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Directory Structure

Continue reading this section for the full explanation and source context.

Section FeatureDescriptor

Continue reading this section for the full explanation and source context.

Section DescriptorRegistry

Continue reading this section for the full explanation and source context.

Descriptors and Features System

Overview

The Descriptors and Features System is a core component of the Evidently framework that provides extensible ways to extract, compute, and evaluate characteristics from data columns. This system enables users to define custom descriptors that wrap features with validation tests, making it easy to assess data quality, text properties, and model behavior in a unified evaluation pipeline.

Descriptors serve as wrappers around feature implementations, adding metadata, display names, and optional test configurations. The system supports both legacy v1 features and modern descriptor implementations, providing backward compatibility while enabling new functionality.

Architecture

The system follows a layered architecture with clear separation between descriptor definitions, feature implementations, and registry management.

graph TD
    subgraph "Public API Layer"
        PD[Public Descriptors<br/>TextMatch, HuggingFace, etc.]
        CD[Custom Descriptors<br/>FeatureDescriptor]
    end
    
    subgraph "Registry Layer"
        REG[DescriptorRegistry]
    end
    
    subgraph "Implementation Layer"
        LEG[Legacy Features V1<br/>hf_feature, text_length]
        NEW[New Features<br/>_text_length, text_match]
    end
    
    subgraph "Evaluation Layer"
        EVAL[Evidently Metrics<br/>and Tests]
    end
    
    PD --> REG
    CD --> REG
    REG --> LEG
    REG --> NEW
    LEG --> EVAL
    NEW --> EVAL
    
    style PD fill:#90EE90
    style CD fill:#87CEEB
    style REG fill:#FFD700
    style LEG fill:#FFA07A
    style NEW fill:#DDA0DD

Directory Structure

src/evidently/
├── descriptors/
│   ├── __init__.py              # Public API exports
│   ├── _text_length.py          # Text length descriptor implementation
│   ├── _custom_descriptors.py   # FeatureDescriptor class
│   ├── generated_descriptors.py # Generated descriptor functions
│   └── text_match.py            # Text matching descriptor
├── core/
│   └── registries/
│       └── descriptors.py       # Descriptor registry implementation
└── legacy/
    ├── descriptors/             # Legacy descriptor implementations
    └── features/                # Legacy feature implementations

Core Components

FeatureDescriptor

The FeatureDescriptor class is the central abstraction that wraps a feature with additional metadata and optional tests.

# Source: src/evidently/descriptors/_custom_descriptors.py
class FeatureDescriptor(BaseDescriptor):
    """Descriptor that wraps a feature with tests and metadata."""
    
    def __init__(
        self,
        feature: AnyFeatureType,
        alias: Optional[str] = None,
        tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
    ):
        self.feature = feature
        self.alias = alias
        self.tests = tests or []

Parameters:

Parameter	Type	Description
`feature`	`AnyFeatureType`	The underlying feature implementation
`alias`	`Optional[str]`	Display name for the descriptor
`tests`	`Optional[List]`	Optional list of tests to apply

DescriptorRegistry

The registry manages descriptor registration and lookup, enabling dynamic descriptor discovery.

# Source: src/evidently/core/registries/descriptors.py
class DescriptorRegistry:
    """Registry for managing descriptor instances."""
    
    def __init__(self):
        self._descriptors: Dict[str, Descriptor] = {}
    
    def register(self, name: str, descriptor: Descriptor) -> None:
        """Register a descriptor with a given name."""
        self._descriptors[name] = descriptor
    
    def get(self, name: str) -> Optional[Descriptor]:
        """Retrieve a descriptor by name."""
        return self._descriptors.get(name)

Available Descriptors

Text Length Descriptor

Computes text length statistics for string columns, including character count, word count, and sentence statistics.

Source: src/evidently/descriptors/_text_length.py

def TextLength(
    column_name: str,
    alias: Optional[str] = None,
    mode: str = "chars",
    tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
    """Compute text length metrics for a column.
    
    Args:
        column_name: Name of the text column to analyze.
        alias: Display name for the descriptor.
        mode: Measurement mode - "chars" or "words".
        tests: Optional list of tests to apply.
    """
    from evidently.legacy.features.text_length import TextLength as TextLengthV1
    
    feature = TextLengthV1(column_name=column_name, mode=mode, display_name=alias)
    return FeatureDescriptor(feature=feature, alias=alias, tests=tests)

Parameters:

Parameter	Type	Default	Description
`column_name`	`str`	Required	Column to analyze
`alias`	`Optional[str]`	`None`	Custom display name
`mode`	`str`	`"chars"`	`"chars"` for characters, `"words"` for word count
`tests`	`Optional[List]`	`None`	Tests to attach

Text Match Descriptor

Matches text content against patterns or lists, useful for validation and filtering.

Source: src/evidently/descriptors/text_match.py

def TextMatch(
    column_name: str,
    alias: str,
    words_list: List[str],
    mode: str = "match",
    lemmatize: bool = False,
    tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
    """Match text against a word list.
    
    Args:
        column_name: Text column to match against.
        alias: Name for the descriptor.
        words_list: List of words/patterns to match.
        mode: Matching mode - "match", "any", "all".
        lemmatize: Whether to apply lemmatization.
        tests: Optional test list.
    """
    from evidently.legacy.features.text_features import TextMatch as TextMatchV1
    
    feature = TextMatchV1(
        column_name=column_name,
        words_list=words_list,
        mode=mode,
        lemmatize=lemmatize,
        display_name=alias
    )
    return FeatureDescriptor(feature=feature, alias=alias, tests=tests)

Parameters:

Parameter	Type	Default	Description
`column_name`	`str`	Required	Target text column
`alias`	`str`	Required	Descriptor name
`words_list`	`List[str]`	Required	Words to match
`mode`	`str`	`"match"`	`"match"`, `"any"`, or `"all"`
`lemmatize`	`bool`	`False`	Enable lemmatization
`tests`	`Optional[List]`	`None`	Tests to apply

HuggingFace Descriptor

Applies HuggingFace models to text columns for various NLP tasks including toxicity detection.

Source: src/evidently/descriptors/generated_descriptors.py

def HuggingFace(
    column_name: str,
    model: str,
    params: dict,
    alias: str,
    tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
    """Apply a HuggingFace model to text column.
    
    Args:
        column_name: Name of the text column to process.
        model: HuggingFace model name or path.
        params: Additional parameters for the model.
        alias: Alias for the descriptor.
        tests: Optional list of tests to apply.
    """
    from evidently.legacy.features.hf_feature import HuggingFaceFeature
    
    feature = HuggingFaceFeature(
        column_name=column_name,
        model=model,
        params=params,
        display_name=alias
    )
    return FeatureDescriptor(feature=feature, alias=alias, tests=tests)

HuggingFace Toxicity Descriptor

Specialized descriptor for detecting toxic content using HuggingFace models.

Source: src/evidently/descriptors/generated_descriptors.py:39-56

def HuggingFaceToxicity(
    column_name: str,
    alias: str,
    model: Optional[str] = None,
    toxic_label: Optional[str] = None,
    tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
    """Detect toxicity in text using HuggingFace models.
    
    Args:
        column_name: Name of the text column to check.
        alias: Alias for the descriptor.
        model: HuggingFace model name or path. If None, uses default.
        toxic_label: Label for toxic content.
        tests: Optional list of tests to apply.
    """

Integration with Evidently Metrics

Descriptors are designed to integrate seamlessly with Evidently's metric and test evaluation system. When a descriptor is included in an evaluation, the underlying feature is computed and the results are automatically formatted for display.

graph LR
    A[Input Data] --> B[Descriptor Definition]
    B --> C[Feature Computation]
    C --> D[Test Evaluation]
    D --> E[Metric Results]
    C --> F[Visualization Data]
    F --> G[HTML Widgets]
    
    style A fill:#87CEEB
    style E fill:#90EE90
    style G fill:#FFD700

Legacy Features System

The legacy features system (evidently.legacy.features) provides backward compatibility for existing integrations. These features are wrapped by modern descriptors but can also be used directly.

Source: src/evidently/legacy/features/__init__.py

Available Legacy Features

Feature	Module	Purpose
`TextLength`	`text_length`	Text length statistics
`TextMatch`	`text_features`	Pattern matching in text
`HuggingFaceFeature`	`hf_feature`	HuggingFace model integration

Usage Example

from evidently.descriptors import TextLength, TextMatch, HuggingFaceToxicity

# Create descriptors for evaluation
text_length_desc = TextLength(
    column_name="review_text",
    alias="Review Length",
    mode="words"
)

toxicity_desc = HuggingFaceToxicity(
    column_name="review_text",
    alias="Toxicity Score",
    toxic_label="toxic"
)

# Use in Evidently Dashboard
dashboard = Dashboard(metrics=[
    ColumnMetrics(column_name="review_text")
        .with_descriptors([
            text_length_desc,
            toxicity_desc
        ])
])

Descriptor Tests Integration

Descriptors can be associated with tests that validate the computed values:

from evidently.descriptors import TextLength
from evidently.test_suits import ColumnTestSuite

test_suite = ColumnTestSuite(
    tests=[
        TextLength(
            column_name="description",
            alias="Description Length",
            tests=[
                # Test within the descriptor
                GreaterThan("Description Length", threshold=10),
                LessThan("Description Length", threshold=500)
            ]
        )
    ]
)

Summary

The Descriptors and Features System provides a flexible, extensible mechanism for computing and evaluating column characteristics within Evidently. Key aspects include:

Wrapper Pattern: FeatureDescriptor wraps features with metadata and tests
Registry Pattern: Centralized descriptor management via DescriptorRegistry
Backward Compatibility: Legacy features wrapped for v1 compatibility
Test Integration: Native support for attaching tests to descriptors
Visualization: Automatic integration with Evidently's HTML rendering pipeline

This architecture enables users to define custom descriptors for any data transformation or evaluation need while maintaining consistency with the Evidently evaluation framework.

Source: https://github.com/evidentlyai/evidently / Human Manual

Reports and Test Suites

Related topics: Presets and Metric Presets, Custom Metrics and Extensibility, Core Components

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Dataset Structure

Continue reading this section for the full explanation and source context.

Section DataDefinition

Continue reading this section for the full explanation and source context.

Reports and Test Suites

Overview

Evidently provides two primary evaluation constructs for ML and LLM-powered systems: Reports and Test Suites. Both are designed to evaluate, analyze, and monitor data and model quality, but they serve different purposes and use cases.

Component	Purpose	Use Case
Report	Generate comprehensive analysis with metrics, visualizations, and insights	Exploratory analysis, documentation, stakeholder communication
Test Suite	Run structured checks with pass/fail outcomes	CI/CD pipelines, automated quality gates, regression testing

Sources: src/evidently/legacy/report/report.py

Architecture

graph TD
    subgraph "Core Evaluation Engine"
        A[Dataset / DataDefinition] --> B[Report / Test Suite]
        C[Descriptors / Metrics] --> B
        D[Test Cases] --> B
    end
    
    subgraph "Report Output"
        B --> E[Metric Results]
        B --> F[Visualizations]
        B --> G[JSON/HTML Export]
    end
    
    subgraph "Test Suite Output"
        B --> H[Test Results]
        H --> I[Pass / Fail Status]
        H --> J[Failure Details]
    end

Core Components

The evaluation system is built on several key components:

Component	File Location	Role
`Report`	`src/evidently/core/report.py`	Main class for generating analytical reports
`TestSuite`	`src/evidently/core/tests.py`	Orchestrates test execution
`Dataset`	Core data structure	Holds data with schema definitions
`DataDefinition`	Schema definition	Defines column types and expectations
`Descriptor`	Feature extraction	Row-level evaluators (e.g., Sentiment, TextLength)
`Metric`	Metric calculation	Computes statistical measures

Sources: src/evidently/core/report.py

Datasets and Data Definition

Dataset Structure

Evidently uses a Dataset object to wrap pandas DataFrames with additional metadata:

from evidently import Dataset, DataDefinition

eval_dataset = Dataset.from_pandas(
    pd.DataFrame(eval_df),
    data_definition=DataDefinition(),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        Contains("answer", items=["sorry", "apologize"], mode="any")
    ]
)

Sources: README.md

DataDefinition

The DataDefinition class defines the schema for the dataset:

Property	Type	Description
`column_schema`	dict	Maps column names to data types
`relationships`	list	Defines relationships between datasets
`timestamp_column`	str	Column containing datetime values

Descriptors

Descriptors are row-level evaluators that extract features or apply checks to individual rows. They can be used within datasets or standalone.

Available Descriptors

Descriptor	Purpose	Parameters
`Sentiment`	Detect sentiment in text	`column_name`, `alias`
`TextLength`	Measure text length	`column_name`, `alias`
`Contains`	Check for substring presence	`column_name`, `items`, `mode`, `case_sensitive`
`BeginsWith`	Verify text prefix	`column_name`, `prefix`, `case_sensitive`
`HuggingFace`	Apply HF models	`column_name`, `model`, `params`
`HuggingFaceToxicity`	Detect toxic content	`column_name`, `model`, `toxic_label`

Sources: src/evidently/descriptors/generated_descriptors.py

Descriptor Parameters

Parameter	Type	Required	Description
`column_name`	str	Yes	Name of the column to evaluate
`alias`	str	No	Display name for the result
`tests`	list	No	Optional test cases to apply
`mode`	str	No	Matching mode: `"any"` or `"all"`
`case_sensitive`	bool	No	Whether comparison is case-sensitive

Reports

Creating a Report

Reports generate comprehensive HTML or JSON output with metrics and visualizations:

from evidently import Report
from evidently.presets import TextEvals

report = Report(
    metrics=[
        TextEvals(),
    ]
)

result = report.run(reference_data=reference_df, current_data=current_df)
report.save_html("report.html")

Sources: src/evidently/legacy/report/report.py

Report Configuration

Parameter	Type	Default	Description
`metrics`	list	Required	List of metrics to compute
`timestamp`	datetime	Now	Report generation time
`include_tests`	bool	False	Include test results in report

Report Output Formats

Format	Method	Use Case
HTML	`save_html(path)`	Interactive visualization
JSON	`save_json(path)` or `as_dict()`	Machine parsing, APIs
Python dict	`as_dict()`	Programmatic access

Sources: src/evidently/core/serialization.py

Test Suites

Creating a Test Suite

Test Suites run structured checks and return pass/fail status:

from evidently import TestSuite
from evidently.test_suite import TestSuite

suite = TestSuite(tests=[
    # Test definitions
])

result = suite.run(reference_data=reference_df, current_data=current_df)
suite.save("test_results.json")

Sources: src/evidently/legacy/test_suite/test_suite.py

Test Suite Workflow

graph LR
    A[Input Data] --> B[Test Suite]
    B --> C{Execute Tests}
    C --> D[Test 1]
    C --> E[Test 2]
    C --> F[Test N]
    D --> G[Test Results]
    E --> G
    F --> G
    G --> H{Pass All?}
    H -->|Yes| I[Success]
    H -->|No| J[Failure Report]

Test Results Structure

Field	Type	Description
`status`	str	`"PASSED"`, `"FAILED"`, `"WARNING"`
`name`	str	Test identifier
`group`	str	Test category
`details`	dict	Additional context and metrics
`timestamp`	datetime	When the test was run

LLM Evaluation Features

LLM-Powered Judgments

Evidently supports LLM-based evaluation through specialized descriptors:

from evidently.descriptors import (
    ContextQualityLLMEval,
    CompletenessLLMEval,
)

Descriptor	Purpose	Key Parameters
`ContextQualityLLMEval`	Evaluate context relevance	`question`, `provider`, `model`
`CompletenessLLMEval`	Check response completeness	`context`, `provider`, `model`, `include_reasoning`

Sources: src/evidently/descriptors/generated_descriptors.py

LLM Provider Configuration

Provider	Model Default	Configuration
`openai`	`gpt-4o-mini`	API key required
`anthropic`	Claude models	API key required
`azure`	Configurable	Endpoint + key

Evaluation Options

Parameter	Type	Description
`include_category`	bool	Include categorical classification
`include_score`	bool	Include numerical score
`include_reasoning`	bool	Include LLM reasoning
`uncertainty`	Uncertainty	Strategy for handling uncertainty

Metrics and Presets

Preset Configurations

Evidently provides pre-configured metric sets:

Preset	Description	File
`TextEvals`	Text quality metrics	`src/evidently/presets`
`DataDrift`	Drift detection	Legacy metrics
`DataQuality`	Quality checks	Legacy metrics

Sources: src/evidently/metrics/_legacy.py

Legacy vs. Current API

Evidently maintains both legacy and current APIs for backward compatibility:

Legacy API Structure

src/evidently/legacy/
├── report/
│   └── report.py          # Legacy Report class
└── test_suite/
    └── test_suite.py      # Legacy TestSuite class

Current API Structure

src/evidently/
├── core/
│   ├── report.py          # Current Report class
│   ├── tests.py           # Current test infrastructure
│   ├── compare.py         # Data comparison utilities
│   └── serialization.py   # Output serialization

Sources: src/evidently/metrics/_legacy.py

Usage Example: LLM Evaluation Pipeline

import pandas as pd
from evidently import Report, Dataset, DataDefinition
from evidently.descriptors import Sentiment, TextLength, Contains
from evidently.presets import TextEvals

# 1. Prepare data
eval_df = pd.DataFrame([
    ["What is the capital of Japan?", "The capital of Japan is Tokyo."],
    ["Who painted the Mona Lisa?", "Leonardo da Vinci."],
    ["Can you write an essay?", "I'm sorry, but I can't assist with homework."]],
    columns=["question", "answer"])

# 2. Create dataset with descriptors
eval_dataset = Dataset.from_pandas(
    pd.DataFrame(eval_df),
    data_definition=DataDefinition(),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        Contains("answer", items=["sorry", "apologize"], mode="any")
    ]
)

# 3. Run report
report = Report(metrics=[TextEvals()])
result = report.run(reference_data=None, current_data=eval_dataset)
report.save_html("llm_eval_report.html")

Sources: README.md

Comparison: Report vs. Test Suite

Aspect	Report	Test Suite
Output	Metrics, visualizations, insights	Pass/fail status, error details
Use Case	Exploratory analysis	Automated quality gates
Integration	Dashboards, documentation	CI/CD, monitoring alerts
Thresholds	Configurable, informative	Strict pass/fail boundaries
Exit Code	Always 0 (informational)	0 (pass) or 1 (fail)

Sources: src/evidently/core/tests.py

Serialization

Supported Formats

Format	Extension	Use Case
JSON	`.json`	APIs, automation, storage
HTML	`.html`	Human review, sharing
Parquet	`.parquet`	Large-scale data storage

Serialization Options

# Save as JSON
report.save_json("report.json")

# Save as HTML
report.save_html("report.html")

# Export as dictionary
data = report.as_dict()

Sources: src/evidently/core/serialization.py

Summary

Reports and Test Suites in Evidently provide complementary approaches to ML and LLM evaluation:

Reports excel at providing comprehensive, visual analysis suitable for exploration and documentation
Test Suites provide structured, automated quality checks ideal for CI/CD integration
Both share the same underlying dataset and descriptor infrastructure
LLM-powered evaluations are available through dedicated descriptors like ContextQualityLLMEval and CompletenessLLMEval
Output can be serialized to JSON, HTML, or accessed programmatically via Python dictionaries

Sources: [src/evidently/legacy/report/report.py](https://github.com/evidentlyai/evidently/blob/main/src/evidently/legacy/report/report.py)

Presets and Metric Presets

Related topics: Reports and Test Suites, Custom Metrics and Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Standard Presets (src/evidently/presets/)

Continue reading this section for the full explanation and source context.

Section Legacy Presets (src/evidently/legacy/)

Continue reading this section for the full explanation and source context.

Section Model Validation

Continue reading this section for the full explanation and source context.

Presets and Metric Presets

Overview

Presets in Evidently are pre-configured collections of metrics and evaluations designed to simplify the process of assessing machine learning models and LLM-powered systems. They provide ready-to-use evaluation templates that bundle relevant metrics, thresholds, and reporting components into cohesive units. This abstraction layer allows users to perform comprehensive model assessment without manually configuring individual metrics.

The preset system follows a modular architecture where each preset type addresses a specific use case domain. Users can instantiate presets and pass them directly to the Report or TestSuite objects, which execute the bundled evaluations and generate structured results.

Preset Architecture

graph TD
    A[Evidently Presets] --> B[Standard Presets]
    A --> C[Legacy Presets]
    
    B --> B1[DataDriftPreset]
    B --> B2[TargetDriftPreset]
    B --> B3[ClassificationPreset]
    B --> B4[RegressionPreset]
    B --> B5[RecSysPreset]
    B --> B6[TextEvals]
    
    C --> C1[MetricPreset]
    C --> C2[TestPreset]
    
    D[Report / TestSuite] --> E[Integrates Presets]
    E --> B
    E --> C

Preset Types

Standard Presets (src/evidently/presets/)

The modern preset implementation resides in the src/evidently/presets/ directory. These presets integrate directly with the current Evidently reporting API.

#### DataDriftPreset

The DataDriftPreset evaluates feature-level drift between reference and current datasets. It calculates drift scores for individual features and provides aggregate drift statistics. This preset is particularly useful for monitoring data pipeline changes and detecting distribution shifts that may impact model performance.

from evidently.presets import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df, current_data=current_df)

#### TargetDriftPreset

The TargetDriftPreset specifically monitors drift in the target variable distribution. It is essential for supervised learning scenarios where changes in label distribution can signal underlying data quality issues or concept drift.

#### ClassificationPreset

The ClassificationPreset bundles metrics relevant to classification model evaluation. It includes accuracy metrics, confusion matrix analysis, class-level performance indicators, and probability calibration assessments. This preset supports both binary and multiclass classification scenarios.

#### RegressionPreset

The RegressionPreset provides comprehensive regression model evaluation including error distribution analysis, quantile-based metrics, and residual diagnostics. It helps identify heteroscedasticity, non-linearity, and systematic prediction biases.

#### RecSysPreset

The RecSysPreset is specialized for recommendation system evaluation. It includes ranking metrics, coverage indicators, and user-item interaction analysis tailored to collaborative filtering and content-based recommendation approaches.

#### TextEvals

The TextEvals preset targets LLM and NLP model evaluation. It encompasses text-specific metrics such as sentiment analysis, text length distributions, and content quality indicators. This preset integrates with the descriptors system for row-level text evaluations.

from evidently.presets import TextEvals

eval_dataset = Dataset.from_pandas(
    pd.DataFrame(eval_df),
    data_definition=DataDefinition(),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        Contains("answer", words=["sorry", "cannot"], alias="Denial")
    ]
)

Legacy Presets (src/evidently/legacy/)

The legacy preset system provided foundational abstractions for metric and test evaluation. These classes served as the original implementation pattern before the current preset architecture.

#### MetricPreset

The MetricPreset class defines the base interface for metric collection presets. It maintains a list of metrics and provides methods for execution and result aggregation.

#### TestPreset

The TestPreset class extends the preset concept to testing scenarios, enabling batch execution of statistical tests against model outputs. It provides assertions and threshold-based pass/fail criteria.

Preset Composition

Presets can be combined within a single report to create comprehensive evaluation suites:

from evidently import Report
from evidently.presets import DataDriftPreset, ClassificationPreset

report = Report(
    metrics=[
        DataDriftPreset(),
        ClassificationPreset()
    ]
)

Integration with Report API

Presets integrate seamlessly with the Evidently Report class:

Component	Role	Integration
`Report`	Execution container	Accepts presets as `metrics` parameter
`Dataset`	Data wrapper	Provides reference and current data
`DataDefinition`	Schema definition	Describes columns and their roles
Preset	Metric bundle	Contains metric instances

Descriptor System

The preset system works in conjunction with the descriptor system for row-level evaluations. Descriptors apply per-row transformations and evaluations:

from evidently.descriptors import Sentiment, TextLength, Contains

Descriptors included in the Dataset configuration are evaluated alongside preset metrics, enabling both aggregate and granular assessment within a single report.

Use Cases

Model Validation

Before deploying a model, use presets to validate performance on held-out data:

report = Report(metrics=[RegressionPreset()])
report.run(reference_data=training_df, current_data=validation_df)

Production Monitoring

Continuously monitor deployed models for performance degradation:

report = Report(metrics=[DataDriftPreset(), ClassificationPreset()])
report.run(reference_data=baseline_df, current_data=current_df)

LLM Evaluation

Assess LLM responses using text-specific presets:

from evidently.presets import TextEvals
from evidently import Dataset, DataDefinition

eval_dataset = Dataset.from_pandas(response_df, data_definition=DataDefinition())
report = Report(metrics=[TextEvals()], descriptors=[...])

Summary Table: Preset Comparison

Preset	Domain	Key Metrics	Use Case
`DataDriftPreset`	Data	Drift scores, feature importance	Data pipeline monitoring
`TargetDriftPreset`	Data	Target distribution, PSI	Concept drift detection
`ClassificationPreset`	ML	Accuracy, F1, ROC-AUC	Classifier evaluation
`RegressionPreset`	ML	MAE, RMSE, R²	Regression analysis
`RecSysPreset`	ML	Ranking metrics, coverage	Recommendation systems
`TextEvals`	NLP/LLM	Sentiment, length, content	Text model evaluation

Conclusion

The preset system in Evidently provides a powerful abstraction for organizing and executing model evaluations. By bundling related metrics into coherent units, presets reduce boilerplate and enable consistent evaluation practices across different model types and deployment scenarios.

Source: https://github.com/evidentlyai/evidently / Human Manual

Custom Metrics and Extensibility

Related topics: Reports and Test Suites, Presets and Metric Presets

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Registry Components

Continue reading this section for the full explanation and source context.

Section Metric Structure

Continue reading this section for the full explanation and source context.

Section Metric Container System

Continue reading this section for the full explanation and source context.

Custom Metrics and Extensibility

Overview

Evidently provides a comprehensive extensibility system that allows users to create custom metrics, features, and test configurations. The framework follows a plugin-like architecture where custom implementations can be registered and discovered at runtime through the registry system.

The extensibility model encompasses several key areas:

Custom Metrics: User-defined metrics that compute specific evaluation logic
Custom Features: Row-level evaluators that extract or compute values from data
Generators: Automatic metric and feature generation from column specifications
Bound Tests: Test configurations attached to metrics with threshold definitions

This architecture enables users to extend the framework's built-in capabilities while maintaining consistency with the existing evaluation pipeline.

Registry System Architecture

The registry system serves as the central mechanism for discovering and managing extensions within Evidently. It provides a unified interface for registering, retrieving, and invoking custom implementations.

graph TD
    A[User Code] --> B[Registry API]
    B --> C[MetricRegistry]
    B --> D[FeatureRegistry]
    B --> E[TestRegistry]
    C --> F[Metric Implementations]
    D --> G[Feature Implementations]
    E --> H[Test Configurations]

Core Registry Components

Component	Purpose	File Location
MetricRegistry	Manages custom metric registrations	`src/evidently/core/registries/__init__.py`
ConfigRegistry	Stores configuration metadata	`src/evidently/core/registries/configs.py`
BoundTestRegistry	Handles test bindings and thresholds	`src/evidently/core/registries/bound_tests.py`

The registry pattern follows a key-based lookup system where each extension is identified by a unique name. This allows for dynamic discovery and late binding of implementations at runtime.

Custom Metrics

Custom metrics extend the base Metric class to implement domain-specific evaluation logic. The framework provides both legacy and modern approaches for metric creation.

Metric Structure

A custom metric typically consists of:

Configuration: Defines the metric's parameters and metadata
Calculation Logic: Implements the calculate method to produce results
Visualization: Provides rendering information through widget definitions

class CustomMetric(Metric):
    def __init__(self, column: str, threshold: float = 0.5):
        self.column = column
        self.threshold = threshold
    
    def calculate(self, context: Context) -> MetricResult:
        # Custom calculation logic
        pass

Sources: src/evidently/legacy/metrics/custom_metric.py:1-50

Metric Container System

The MetricContainer abstract base class provides the foundation for grouping related metrics. It implements a caching mechanism to avoid redundant metric generation.

graph LR
    A[Context] --> B{Container Fingerprint}
    B -->|Cache Hit| C[Return Cached Metrics]
    B -->|Cache Miss| D[generate_metrics]
    D --> E[Store in Context]
    E --> C

The generate_metrics method must be implemented by subclasses to define the metrics to be computed:

def generate_metrics(self, context: "Context") -> Sequence[MetricOrContainer]:
    """Generate metrics based on the container configuration.
    
    Args:
        context: Context containing datasets and configuration.
    
    Returns:
        Sequence of Metric or MetricContainer objects to compute.
    """
    raise NotImplementedError()

Sources: src/evidently/core/container.py:1-80

Rendering and Widgets

Metrics can contribute visualization widgets through the render method. The widget system supports hierarchical composition where parent containers aggregate widgets from child metrics:

def render(
    self,
    context: "Context",
    child_widgets: Optional[List[Tuple[Optional[MetricId], List[BaseWidgetInfo]]]] = None,
) -> List[BaseWidgetInfo]:
    """Render visualization widgets for this container.
    
    Combines widgets from all child metrics/containers.
    """

Sources: src/evidently/core/container.py:80-120

Custom Features

Custom features extend the Descriptor base class to provide row-level evaluation capabilities. They can be applied to individual data points during dataset processing.

Feature Implementation Pattern

Features implement the descriptor pattern, wrapping column values with evaluation logic:

class CustomFeature(Descriptor):
    def __init__(self, column: str, alias: str = None):
        self.column = column
        self.alias = alias or column
    
    def apply(self, data: pd.Series) -> pd.Series:
        # Custom feature calculation
        return result

Sources: src/evidently/legacy/features/custom_feature.py:1-40

Integration with Descriptors

The descriptor system integrates with the data definition framework, allowing features to be declared as part of a dataset configuration:

eval_dataset = Dataset.from_pandas(
    pd.DataFrame(eval_df),
    data_definition=DataDefinition(),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        Contains("answer", ["denial", "refuse"], alias="ContainsDenial")
    ]
)

Sources: README.md:1-50

Generators System

Generators provide an automated way to create metrics and features based on column specifications. This reduces boilerplate when working with large numbers of similar evaluations.

Column Generator

The column generator creates standardized metrics for numeric and categorical columns:

generator = ColumnGenerator(
    columns=["feature_1", "feature_2", "feature_3"],
    generators=[
        MeanGenerator(),
        StdGenerator(),
        NullCountGenerator()
    ]
)

Sources: src/evidently/generators/column.py:1-60

Generator Configuration

Generators can be configured with:

Parameter	Type	Description
`columns`	List[str]	Target columns for generation
`include_missing`	bool	Include columns with missing values
`exclude_patterns`	List[str]	Regex patterns for column exclusion
`generators`	List[Generator]	Specific generators to apply

Sources: src/evidently/generators/__init__.py:1-50

Bound Tests and Thresholds

Bound tests attach threshold configurations to metrics, enabling automated pass/fail determinations based on metric results.

Test Binding Pattern

Tests are bound to metrics through the registry system:

class BoundTest:
    def __init__(self, metric: MetricId, threshold: TestThreshold):
        self.metric = metric
        self.threshold = threshold
    
    def evaluate(self, metric_result: MetricResult) -> TestResult:
        # Compare metric value against threshold
        pass

Sources: src/evidently/core/registries/bound_tests.py:1-50

Threshold Types

The framework supports multiple threshold configuration types:

Threshold Type	Description	Use Case
`absolute`	Fixed value comparison	Fixed acceptable ranges
`relative`	Percentage-based comparison	Drift detection
`sigma`	Standard deviation bounds	Anomaly detection
`quantile`	Percentile-based thresholds	Distribution extremes

Sources: src/evidently/core/registries/bound_tests.py:50-100

Configuration and Versioning

The extensibility system includes robust configuration management with versioning support for metrics and descriptors.

Config Version Model

Configurations are versioned to track changes over time:

@dataclass
class ConfigVersion:
    id: STR_UUID
    artifact_id: STR_UUID
    version: int
    content: Any
    metadata: ConfigVersionMetadata

Sources: src/evidently/sdk/configs.py:1-80

Metadata Structure

Field	Type	Description
`created_at`	datetime	Version creation timestamp
`updated_at`	datetime	Last modification timestamp
`author`	str	User who created/modified
`comment`	str	Change description

Sources: src/evidently/sdk/adapters.py:1-100

Workflow Diagram

The complete extensibility workflow from definition to execution:

graph TD
    A[Define Custom Metric/Feature] --> B[Register in Registry]
    B --> C[Create Dataset with Descriptors]
    C --> D[Generate Metrics from Container]
    D --> E[Calculate Results]
    E --> F[Bind Tests with Thresholds]
    F --> G[Evaluate Pass/Fail]
    G --> H[Render Widgets]
    H --> I[Display in Report/Dashboard]

Best Practices

Performance Considerations

Caching: Use the container fingerprinting mechanism to cache generated metrics and avoid redundant calculations
Lazy Evaluation: Implement metrics using lazy evaluation patterns when possible to defer expensive computations
Batch Processing: Design features to operate on pandas Series rather than individual values for vectorized performance

Extensibility Guidelines

Consistent Naming: Follow the established naming conventions for custom implementations
Type Hints: Include comprehensive type hints for all public interfaces
Documentation: Document parameters and return values using docstring conventions
Testing: Create unit tests for custom metric calculation logic

Integration Points

Custom extensions integrate with the framework through several standardized interfaces:

Metric Protocol: Implement the calculate(self, context) method
Descriptor Protocol: Implement the apply(self, data) method
Widget Protocol: Return List[BaseWidgetInfo] from render() method
Test Protocol: Implement the evaluate(self, result) method for bound tests

API Reference

Key Classes and Functions

Class/Function	Module	Purpose
`MetricContainer`	`evidently.core.container`	Base class for metric containers
`Descriptor`	`evidently.legacy.features`	Base class for features
`ColumnGenerator`	`evidently.generators.column`	Column-based metric generation
`BoundTest`	`evidently.core.registries`	Threshold-bound test configuration
`ConfigVersion`	`evidently.sdk.configs`	Versioned configuration storage

Context Management

The Context object provides access to datasets and configuration during metric calculation:

class Context:
    def metrics_container(self, fingerprint: str) -> Optional[List[MetricOrContainer]]:
        """Retrieve cached metrics for container fingerprint."""
    
    def set_metric_container_data(
        self, 
        fingerprint: str, 
        metrics: List[MetricOrContainer]
    ) -> None:
        """Store generated metrics in cache."""

Sources: src/evidently/core/container.py:50-80

Summary

Evidently's extensibility system provides a comprehensive framework for customizing evaluation logic. The registry-based architecture enables dynamic discovery of custom implementations while maintaining consistency with built-in features. Custom metrics and features extend the base classes to implement domain-specific logic, and the generator system automates repetitive metric definitions. The bound test mechanism attaches configurable thresholds to metrics for automated validation, making the system suitable for both exploratory analysis and production monitoring scenarios.

Sources: [src/evidently/legacy/metrics/custom_metric.py:1-50]()

UI Service Backend

The Evidently UI Service Backend is a FastAPI-based REST API layer that powers the Evidently web interface, providing endpoints for managing projects, datasets, prompts, and trace data. It...

Section Component Stack

Continue reading this section for the full explanation and source context.

Section Project Model

Continue reading this section for the full explanation and source context.

Section Config API Abstraction

Continue reading this section for the full explanation and source context.

Section Descriptor Configuration

Continue reading this section for the full explanation and source context.

The Evidently UI Service Backend is a FastAPI-based REST API layer that powers the Evidently web interface, providing endpoints for managing projects, datasets, prompts, and trace data. It serves as the bridge between the React-based frontend and the Evidently SDK's core functionality.

Architecture Overview

The UI Service Backend follows a layered architecture:

    A[Frontend: React/TypeScript] --> B[UI Service Backend: FastAPI]
    B --> C[Evidently SDK]
    C --> D[Workspace Abstraction]
    C --> E[Cloud Config API]
    D --> F[(Local Storage)]
    E --> G[(Remote Backend)]

Component Stack

Layer	Technology	Purpose
Frontend	React, TypeScript, MUI	User interface components
Backend API	FastAPI/Python	REST API endpoints
Business Logic	Evidently SDK	Core evaluation logic
Data Layer	Local FS / Remote API	Persistence

Core Data Models

Project Model

Projects are the primary organizational unit in the UI. The ProjectModel represents:

ProjectModel:
    - name: str
    - description: Optional[str]
    - org_id: Optional[OrgID]

Sources: src/evidently/ui/workspace.py:50-55

Config API Abstraction

The ConfigAPI class provides a generic interface for managing versioned configurations:

Method	Purpose
`create_config()`	Create a new configuration
`get_config()`	Retrieve a configuration by ID
`list_configs()`	List all configurations
`update_config()`	Update an existing configuration
`delete_config()`	Delete a configuration
`add_version()`	Add a new version to a config
`get_version()`	Get a specific version

Sources: src/evidently/sdk/configs.py:95-145

Descriptor Configuration

The Descriptor type is used to store and version descriptor configurations:

DescriptorConfigAPI:
    - add_descriptor() -> ConfigVersion
    - get_descriptor() -> Descriptor

Sources: src/evidently/sdk/configs.py:165-180

Workspace Abstraction

The Workspace abstract class defines the contract for project management:

    A[Workspace] --> B[LocalWorkspace]
    A --> C[CloudWorkspace]
    A --> D[RemoteWorkspace]

Abstract Methods

Method	Parameters	Returns	Description
`create_project()`	name, description, org_id	Project	Creates a new project
`add_project()`	project: ProjectModel, org_id	Project	Adds project to workspace
`get_project()`	project_id: STR_UUID	Optional[Project]	Retrieves project by ID
`delete_project()`	project_id: STR_UUID	None	Removes project from workspace
`list_projects()`	org_id: Optional	Sequence[Project]	Lists all projects

Sources: src/evidently/ui/workspace.py:40-80

Frontend Components

Project Card Component

The ProjectCard component handles project display and editing:

ProjectCardProps:
    - project: Project
    - disabled?: boolean
    - onEditProject: (args: { name: string; description: string }) => void
    - LinkToProject: ComponentType

Features:

Toggle between view and edit modes
Uses EditProjectInfoForm for inline editing
Displays ProjectInfoCard in view mode

Sources: ui/packages/evidently-ui-lib/src/components/Project/ProjectCard.tsx:60-80

Prompts Table

The PromptsTable component renders a sortable, paginated list of prompts:

Column	Render	Features
ID	ID with copy button	`TextWithCopyIcon` component
Name	Truncated text (max 200px)	Typography component
Created at	Date formatted	`dayjs` locale formatting
Actions	Link + Delete button	Edit/delete operations

Sources: ui/packages/evidently-ui-lib/src/components/Prompts/PromptsTable.tsx:40-75

Traces Table

The TracesTable component displays trace data with extended metadata:

Column	Features
Tags	`HidedTags` component, 250px min width
Metadata	`JsonViewThemed` with clipboard support
Type	Chip showing trace origin
Created at	Sortable date column
Actions	Dataset link, edit dialog, delete

Sources: ui/packages/evidently-ui-lib/src/components/Traces/TracesTable.tsx:50-90

API Endpoint Structure

Projects API

/projects
  ├── GET    /              - List all projects
  ├── POST   /              - Create new project
  └── GET    /{project_id}  - Get project details

/projects/{project_id}/
  ├── prompts/
  │   ├── GET    /          - List prompts
  │   ├── POST   /          - Create prompt
  │   └── GET    /{id}      - Get prompt details
  ├── datasets/
  │   ├── GET    /          - List datasets
  │   └── POST   /          - Create dataset
  └── traces/
      ├── GET    /          - List traces
      └── POST   /          - Create trace

Prompts API

Endpoint	Method	Purpose
`/prompts`	GET	List all prompts for a project
`/prompts`	POST	Create a new prompt
`/prompts/{id}`	GET	Get prompt details

Sources: ui/service/src/routes/.../index-prompts-list/index-prompts-list-main.tsx:40-60

Configuration Management

Artifact Version Management

The ArtifactConfigAPI handles versioned artifact storage:

ArtifactConfigAPI:
    - create_version()      # Create new artifact version
    - list_versions()       # List all versions
    - get_version()         # Get specific version
    - get_version_by_id()   # Get by version ID

Sources: src/evidently/sdk/adapters.py:45-70

Version Conversion

Bidirectional conversion between SDK and API models:

    A[ArtifactVersion] -->|convert| B[ConfigVersion]
    B -->|convert| A

Methods:

_artifact_version_to_config_version() - SDK to API
_config_version_to_artifact_version() - API to SDK

UI Utilities

HTML Link Templates

The utils.py module provides HTML templates for dashboard rendering:

HTML_LINK_WITH_ID_TEMPLATE   # Link with button and ID display
FILE_LINK_WITH_ID_TEMPLATE   # File link with ID
RUNNING_SERVICE_LINK_TEMPLATE # Service link with label

Sources: src/evidently/ui/utils.py:30-55

Workflow: Project Lifecycle

graph TD
    A[Create Project] --> B[Add Datasets]
    B --> C[Create Prompts]
    C --> D[Run Evals]
    D --> E[Store Traces]
    E --> F[View Results]
    F --> G[Monitor]
    
    A -.->|via| H[Workspace.add_project]
    C -.->|via| I[Prompts API]
    E -.->|via| J[Traces API]

Configuration Options

Option	Type	Default	Description
`org_id`	UUID	None	Organization identifier
`project_id`	UUID	Auto	Project identifier
`version`	int	"latest"	Config version selector

Key SDK Classes

Class	File	Responsibility
`ConfigAPI`	configs.py	Generic config CRUD operations
`DescriptorConfigAPI`	configs.py	Descriptor-specific operations
`CloudConfigAPI`	configs.py	Remote backend communication
`Workspace`	workspace.py	Abstract workspace interface
`ProjectModel`	workspace.py	Project data structure

Technology Stack

Component	Technology	Version
Backend Framework	FastAPI	-
SDK Core	Python	3.11+
Frontend Framework	React	-
UI Components	MUI	-
State Management	React Hooks	-
Date Handling	dayjs	-

Development Workflow

Running the Service

# In ui/service folder
pnpm dev

Code Quality

# In ui folder
pnpm code-check          # Format, sort imports, lint
pnpm code-check --fix    # Apply fixes automatically

Sources: ui/README.md:15-25

Building

pnpm build

Summary

The Evidently UI Service Backend provides:

REST API layer for frontend-backend communication
Project management via Workspace abstraction
Versioned configuration storage using ConfigAPI
Tracing support for evaluation history
Prompts management for LLM-powered evaluations

The architecture cleanly separates concerns between the FastAPI backend, Evidently SDK business logic, and React frontend components, enabling modular development and testing.

Sources: [src/evidently/ui/workspace.py:50-55]()

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Update scikit-learn version requirement to support v1.6.0

First-time setup may fail or require extra isolation and rollback planning.

high PromptOptimizer throws OpenAIError when using Vertex AI judge

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

high IndexError in infer_column_type when column contains only null values

The project should not be treated as fully validated until this signal is reviewed.

high Update evidently hashlib usage for FIPS-Compliant Systems and Security Best Practices

The project may affect permissions, credentials, data exposure, or host boundaries.

Doramagic Pitfall Log

Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: Update scikit-learn version requirement to support v1.6.0

Severity: high
Finding: Installation risk is backed by a source signal: Update scikit-learn version requirement to support v1.6.0. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1407

2. Configuration risk: PromptOptimizer throws OpenAIError when using Vertex AI judge

Severity: high
Finding: Configuration risk is backed by a source signal: PromptOptimizer throws OpenAIError when using Vertex AI judge. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1856

3. Project risk: IndexError in infer_column_type when column contains only null values

Severity: high
Finding: Project risk is backed by a source signal: IndexError in infer_column_type when column contains only null values. Treat it as a review item until the current version is checked.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1764

4. Security or permission risk: Update evidently hashlib usage for FIPS-Compliant Systems and Security Best Practices

Severity: high
Finding: Security or permission risk is backed by a source signal: Update evidently hashlib usage for FIPS-Compliant Systems and Security Best Practices. Treat it as a review item until the current version is checked.
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1410

5. Installation risk: Numpy 2.x support?

Severity: medium
Finding: Installation risk is backed by a source signal: Numpy 2.x support?. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1557

6. Installation risk: v0.7.12

Severity: medium
Finding: Installation risk is backed by a source signal: v0.7.12. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.12

7. Installation risk: v0.7.15

Severity: medium
Finding: Installation risk is backed by a source signal: v0.7.15. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.15

8. Installation risk: v0.7.20

Severity: medium
Finding: Installation risk is backed by a source signal: v0.7.20. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.20

9. Capability assumption: v0.7.19

Severity: medium
Finding: Capability assumption is backed by a source signal: v0.7.19. Treat it as a review item until the current version is checked.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.19

10. Capability assumption: v0.7.21

Severity: medium
Finding: Capability assumption is backed by a source signal: v0.7.21. Treat it as a review item until the current version is checked.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.21

11. Capability assumption: README/documentation is current enough for a first validation pass.

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: capability.assumptions | github_repo:315977578 | https://github.com/evidentlyai/evidently | README/documentation is current enough for a first validation pass.

12. Maintenance risk: Maintainer activity is unknown

Severity: medium
Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: evidence.maintainer_signals | github_repo:315977578 | https://github.com/evidentlyai/evidently | last_activity_observed missing

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using evidently with real data or production workflows.

IndexError in infer_column_type when column contains only null values - github / github_issue
Protect this repo from AI-generated PRs - github / github_issue
Numpy 2.x support? - github / github_issue
PromptOptimizer throws OpenAIError when using Vertex AI judge - github / github_issue
Update scikit-learn version requirement to support v1.6.0 - github / github_issue
Update evidently hashlib usage for FIPS-Compliant Systems and Security B - github / github_issue
v0.7.21 - github / github_release
v0.7.20 - github / github_release
v0.7.19 - github / github_release
v0.7.18 - github / github_release
v0.7.17 - github / github_release
v0.7.16 - github / github_release

Source: Project Pack community evidence and pitfall evidence