Doramagic Project Pack · Human Manual

evidently

Related topics: Core Components, Data Management and Data Flow

Architecture Overview

Related topics: Core Components, Data Management and Data Flow

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Key Imports for LLM Evals

Continue reading this section for the full explanation and source context.

Section CloudConfigAPI

Continue reading this section for the full explanation and source context.

Section Config Version Management

Continue reading this section for the full explanation and source context.

Related topics: Core Components, Data Management and Data Flow

Architecture Overview

Introduction

Evidently is an open-source Python framework designed to evaluate, test, and monitor ML and LLM-powered systems. The architecture follows a modular design pattern that separates concerns between core evaluation logic, SDK APIs, user interface components, and configuration management.

Sources: README.md:1-30

High-Level Architecture

The Evidently platform consists of several interconnected layers:

graph TB
    subgraph "User Interface Layer"
        UI["UI Components<br/>(React/TypeScript)"]
    end
    
    subgraph "SDK Layer"
        SDK["Python SDK<br/>(evidently.sdk)"]
        API["Cloud Config API"]
    end
    
    subgraph "Core Layer"
        CORE["Core Modules<br/>(metrics, descriptors, reports)"]
        LEGACY["Legacy Module<br/>(evidently.legacy)"]
        FUTURE["Future Module<br/>(evidently.future)"]
    end
    
    subgraph "Data Layer"
        WS["Workspace Abstraction"]
        CFG["Config System"]
        ADP["Adapters"]
    end
    
    UI --> SDK
    SDK --> API
    SDK --> WS
    SDK --> CFG
    SDK --> ADP
    ADP --> WS
    CFG --> WS

Core Module Structure

The main evidently package provides the primary user-facing APIs for ML evaluation:

ComponentPurpose
ReportGenerate evaluation reports combining multiple metrics
DatasetContainer for evaluation data with descriptors
DataDefinitionSchema definition for dataset structure
MetricsIndividual evaluation metrics
DescriptorsRow-level evaluators (e.g., Sentiment, TextLength)
PresetsPre-configured evaluation suites (e.g., TextEvals)

Sources: README.md:30-55

Key Imports for LLM Evals

from evidently import Report
from evidently import Dataset, DataDefinition
from evidently.descriptors import Sentiment, TextLength, Contains
from evidently.presets import TextEvals

SDK Architecture

The SDK layer provides programmatic access to remote configurations and cloud features.

CloudConfigAPI

The CloudConfigAPI class manages remote configuration operations:

MethodPurpose
create_config()Create a new remote configuration
update_config()Update an existing configuration
get_config()Retrieve a configuration by ID
list_configs()List all configurations for a project
delete_config()Remove a configuration
create_version()Create a new version of a configuration
list_versions()List all versions of a configuration
get_version()Get a specific version

Sources: src/evidently/sdk/configs.py:1-50

Config Version Management

The configuration system uses a ConfigVersion model to track changes:

classDiagram
    class ConfigMetadata {
        +str created_at
        +str updated_at
        +str author
        +str description
    }
    
    class ConfigVersion {
        +str id
        +str artifact_id
        +int version
        +Any content
        +ConfigVersionMetadata metadata
    }
    
    class ConfigVersionMetadata {
        +str created_at
        +str updated_at
        +str author
        +str comment
    }
    
    ConfigVersion --> ConfigVersionMetadata
    ConfigVersionMetadata --|> ConfigMetadata

Sources: src/evidently/sdk/configs.py:50-150

Adapter Pattern

The SDK uses adapters to convert between internal config models and domain objects:

graph LR
    A["ConfigVersion"] -->|Adapter| B["ArtifactVersion"]
    A -->|Adapter| C["Descriptor"]
    A -->|Adapter| D["Prompt"]
Adapter ClassSource ConfigTarget Domain Object
ArtifactAdapterConfigVersionArtifactVersion
DescriptorAdapterConfigVersionDescriptor
PromptAdapterConfigVersionPrompt

Sources: src/evidently/sdk/adapters.py:1-100

Workspace Architecture

The workspace abstraction provides a unified interface for project management:

Abstract Base Class

classDiagram
    class Workspace {
        <<abstract>>
        +create_project(name, description, org_id) Project
        +add_project(project, org_id) Project
        +get_project(project_id) Optional~Project~
        +delete_project(project_id)
        +list_projects(org_id) Sequence~Project~
    }

Core Workspace Methods

MethodParametersReturnsDescription
create_projectname: str, description: str, org_id: Optional[OrgID]ProjectCreates and adds a new project
add_projectproject: ProjectModel, org_id: Optional[OrgID]ProjectAdds an existing project model
get_projectproject_id: STR_UUIDOptional[Project]Retrieves project by UUID
delete_projectproject_id: STR_UUIDNoneRemoves project from workspace
list_projectsorg_id: Optional[OrgID]Sequence[Project]Lists projects with optional filtering

Sources: src/evidently/ui/workspace.py:1-100

Project Model

Projects are stored using a ProjectModel data structure:

ProjectModel(
    name=str,
    description=str,
    org_id=Optional[OrgID]
)

UI Component Architecture

The frontend is built with React and TypeScript, organized as separate packages:

graph TD
    subgraph "ui/packages/evidently-ui-lib"
        WC["Widgets"]
        CC["Components"]
        FP["Forms"]
        TB["Tables"]
    end
    
    subgraph "Components"
        DT["Dashboard"]
        TR["Traces"]
        PR["Prompts"]
        DS["Descriptors"]
    end
    
    CC --> DT
    CC --> TR
    CC --> PR
    CC --> DS
    CC --> FP
    CC --> TB

Component Categories

Package PathPurpose
src/widgets/Dashboard widgets and test suite components
src/components/Dashboard/Dashboard-specific UI elements
src/components/Traces/Trace viewing and management
src/components/Prompts/Prompt template management
src/components/Descriptors/Descriptor configuration forms
src/components/Utils/Shared utility components

The UI layer uses template functions to generate HTML for external links and reports:

TemplatePurpose
HTML_LINK_WITH_ID_TEMPLATEReport links with button and ID display
FILE_LINK_WITH_ID_TEMPLATEFile links with ID metadata
RUNNING_SERVICE_LINK_TEMPLATEService endpoint links
EVIDENTLY_STYLES_COMMONShared CSS styles for links

Sources: src/evidently/ui/utils.py:1-60

Template Structure

<div class="evidently-links container">
    <a target="_blank" href="{button_url}">{button_title}</a>
    <p><b>{id_title}:</b> <span>{id}</span></p>
</div>

Data Flow Architecture

graph TB
    A["User Code"] --> B["Dataset Creation"]
    B --> C["Descriptor Application"]
    C --> D["Report Generation"]
    D --> E["Workspace Storage"]
    
    F["Cloud API"] --> G["Config Management"]
    G --> E
    
    E --> H["UI Display"]

API Reference Documentation

The project includes automated API documentation generation:

CommandDescription
./api-reference/generate.py --local-source-codeGenerate docs from local source
./api-reference/generate.py --git-revision <ref>Generate docs from git revision
./api-reference/generate.py --additional-modulesInclude extra modules

Documentation is output to api-reference/dist/ organized by revision.

Sources: api-reference/README.md:1-50

Summary

The Evidently architecture is organized into three main layers:

  1. Core Layer - Python packages for evaluation logic (evidently, evidently.core)
  2. SDK Layer - Remote configuration and cloud integration (evidently.sdk)
  3. UI Layer - React/TypeScript web interface for visualization and management

The modular design allows users to:

  • Use the Python SDK directly for programmatic evaluation
  • Store and version configurations in the cloud
  • Access results through the web UI
  • Extend functionality through the descriptor and metric system

Sources: [README.md:1-30]()

Core Components

Related topics: Architecture Overview, Reports and Test Suites, Custom Metrics and Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Purpose and Scope

Continue reading this section for the full explanation and source context.

Section Key Data Structures

Continue reading this section for the full explanation and source context.

Section Metric Architecture

Continue reading this section for the full explanation and source context.

Related topics: Architecture Overview, Reports and Test Suites, Custom Metrics and Extensibility

Core Components

Overview

The Core Components form the foundational architecture of the Evidently framework, providing the essential building blocks for evaluation, testing, and monitoring of ML and LLM-powered systems. These components establish the base type system, metric definitions, reporting mechanisms, and registry patterns that enable the framework's extensibility and modularity.

The Core Components encompass several interconnected modules that work together to provide a unified approach to ML evaluation. At the heart of this system is a registration-based architecture that allows metrics, tests, and components to be dynamically discovered, configured, and executed within the Evidently ecosystem.

This architecture follows a plugin-like pattern where individual components can be registered, versioned, and managed through centralized registries. This design enables users to extend the framework's functionality by implementing custom metrics and tests while maintaining compatibility with the existing evaluation pipeline.

Base Types Module

Purpose and Scope

The base types module defines the fundamental data structures and abstractions that underpin all Evidently components. These types establish common interfaces and data models used throughout the framework, ensuring consistency and interoperability between different modules.

The base types primarily focus on defining what data flows through the evaluation pipeline, including data definitions, dataset representations, and result containers. This module serves as the foundation upon which all higher-level abstractions are built.

Key Data Structures

The base types module defines several critical classes and interfaces that form the backbone of the Evidently type system. These structures provide a standardized way to represent input data, evaluation results, and configuration options across all components.

The module establishes the DataDefinition class, which serves as a schema for describing the structure of datasets being evaluated. This includes column definitions, data types, and metadata that describe the characteristics of the data being analyzed. Data definitions enable Evidently to understand the structure of incoming data and apply appropriate transformations and evaluations.

The module also defines base classes for Dataset objects, which encapsulate reference and current data batches used in evaluations. These dataset representations include functionality for data validation, transformation, and slicing based on various criteria.

Metric Types Module

Metric Architecture

The metric types module defines the core abstraction for all metrics within Evidently. Metrics are the primary mechanism for evaluating data quality, model performance, and statistical properties of datasets. The module establishes a class hierarchy that enables both simple single-value metrics and complex multi-dimensional metric calculations.

Metrics in Evidently follow a consistent pattern where each metric is associated with specific columns or data subsets and produces standardized result objects. This design enables metrics to be composed, combined, and analyzed together within reports and dashboards.

Metric Execution Model

The metric execution model defines how metrics are calculated, what inputs they receive, and how results are structured. Each metric implementation receives a data context containing the relevant dataset columns and produces a MetricResult object that encapsulates the calculated values and metadata.

The execution model supports both eager and lazy evaluation strategies. Some metrics compute their results immediately upon execution, while others may defer computation until results are actually needed. This flexibility enables optimizations in scenarios where multiple metrics share intermediate calculations or where memory efficiency is a concern.

Metrics can be configured with parameters that control their behavior, including aggregation methods, thresholds, and visualization options. The parameter system uses type hints and validation to ensure configuration correctness while providing clear documentation through the API reference.

Report System

Report Architecture

The Report class serves as the primary interface for executing evaluations in Evidently. Reports orchestrate the execution of metrics and tests, collect results, and generate visualizations and summaries. The Report system provides a flexible framework for combining multiple evaluation components into cohesive analysis workflows.

Reports maintain state throughout their lifecycle, tracking which metrics and tests have been executed, their results, and any errors or warnings encountered during execution. This state management enables features like incremental recalculation, where only changed components need to be re-evaluated.

graph TD
    A[Create Report] --> B[Add Metrics]
    B --> C[Add Tests]
    C --> D[Run Report]
    D --> E[Collect Results]
    E --> F[Generate Visualizations]
    F --> G[Export Report]
    
    H[Reference Data] --> D
    I[Current Data] --> D
    
    E --> J[Metric Results]
    E --> K[Test Results]

Report Configuration

Reports support extensive configuration options that control execution behavior, result aggregation, and output formatting. Configuration options include parallel execution settings, result caching policies, and visualization preferences.

The Report class provides methods for both synchronous and asynchronous execution, enabling integration with various application architectures. Asynchronous execution is particularly useful for long-running evaluations or when reports are generated as part of batch processing workflows.

Report Export

The Report system includes export functionality that generates output in various formats including HTML, JSON, and Python dictionary structures. Export options enable customization of included visualizations, result detail levels, and styling preferences.

Testing Framework

Test Definition

The testing module extends Evidently's evaluation capabilities to include pass/fail style assertions. Tests in Evidently are designed to validate specific conditions about data or model behavior, returning boolean results that indicate whether defined criteria are met.

Tests can be defined inline within metrics or as standalone components that check specific conditions. This dual approach provides flexibility in how validation logic is structured and reused across different evaluation scenarios.

Test Execution

Tests execute alongside metrics within the Report framework, allowing unified execution of both evaluation and validation logic. The test execution model mirrors the metric execution model, receiving data contexts and producing structured results that include pass/fail status and diagnostic information.

Test results include detailed failure messages that explain why a test did not pass, helping users understand data quality issues or model behavior problems. These diagnostic messages reference specific rows, values, or statistical properties that contributed to the test failure.

Component Registry System

Registry Architecture

The registry system provides a centralized mechanism for managing and discovering Evidently components. Registries maintain mappings between component identifiers and their implementations, enabling dynamic resolution of components at runtime.

Evidently uses registries extensively for metrics, tests, descriptors, and other extensible components. This architecture allows the framework to support third-party extensions while maintaining a consistent interface for component discovery and execution.

Component Registration

Components are registered with their respective registries using decorators or explicit registration calls. The registration process captures metadata about each component, including its name, version, parameters, and dependencies. This metadata enables automatic documentation generation and configuration interfaces.

Registration supports versioning, allowing multiple versions of a component to coexist and enabling rollback to previous versions when needed. Version management is particularly important for production deployments where stability and reproducibility are critical.

Metrics Registry

The metrics registry extends the component registry pattern specifically for metrics. It provides specialized functionality for metric discovery, parameter validation, and result aggregation.

graph TD
    A[Metrics Registry] --> B[Metric Base Class]
    A --> C[Metric Parameters]
    A --> D[Metric Results]
    
    B --> E[Column Metrics]
    B --> F[Dataset Metrics]
    B --> G[Statistical Metrics]
    
    D --> H[Visualization Configs]
    D --> I[Aggregation Methods]

Metric Tests Registry

The metric tests registry manages test definitions that validate metric results or data conditions. This registry enables tests to reference metrics by name and access their results for validation purposes.

Tests registered in this registry can be automatically discovered and included in reports based on configuration. This discovery mechanism enables declarative specification of validation requirements without requiring explicit test instantiation.

Integration Patterns

SDK Integration

The Core Components integrate with Evidently's SDK layer, which provides higher-level APIs for cloud and remote deployments. The SDK layer uses the Core Components as its foundation while adding capabilities for remote execution, result storage, and collaborative features.

The CloudConfigAPI class demonstrates this integration, providing methods for managing project configurations, descriptor configs, and artifact versions. These high-level APIs delegate core functionality to the Core Components while handling network communication and serialization.

Adapter Pattern

Evidently uses adapters to bridge different execution contexts and storage backends. Adapters transform between internal data structures and external representations, enabling Evidently to work with various data sources and deployment environments.

The adapter pattern is particularly important for cloud deployments where configurations and results are stored remotely. Adapters handle serialization, deserialization, and API communication while maintaining compatibility with the Core Component interfaces.

Configuration Management

Config Versioning

The Core Components support configuration versioning through specialized config classes. Each versioned configuration maintains a history of changes, enabling audit trails and rollback capabilities.

Configurations are organized by project, with separate namespaces for different config types like metrics, tests, and descriptors. This organization enables clear separation of concerns while maintaining relationships between related configurations.

Remote Configuration

For deployments requiring centralized configuration management, Evidently supports remote configuration storage through the CloudConfigAPI. Remote configurations can be fetched, updated, and versioned through the SDK, with changes automatically synchronized to connected clients.

Summary

The Core Components provide the essential infrastructure for Evidently's evaluation capabilities. Through a combination of base types, metric abstractions, reporting mechanisms, and registry systems, these components establish a flexible and extensible framework for ML evaluation.

The modular design enables users to leverage individual components for specific use cases or combine them into comprehensive evaluation pipelines. The registration-based architecture ensures extensibility while maintaining consistency across the framework.

Understanding these core concepts is essential for effectively using Evidently and for extending the framework with custom metrics, tests, and integrations.

Source: https://github.com/evidentlyai/evidently / Human Manual

Data Management and Data Flow

Related topics: Architecture Overview, ML Model Evaluation, LLM Evaluation and Judging

Section Related Pages

Continue reading this section for the full explanation and source context.

Section The Dataset Class

Continue reading this section for the full explanation and source context.

Section Dataset Factory Methods

Continue reading this section for the full explanation and source context.

Section Container-Dataset Relationship

Continue reading this section for the full explanation and source context.

Related topics: Architecture Overview, ML Model Evaluation, LLM Evaluation and Judging

Data Management and Data Flow

Overview

Data Management and Data Flow in Evidently encompasses the mechanisms by which data is ingested, transformed, stored, and processed throughout the evaluation lifecycle. The system provides a unified abstraction layer that handles multiple data sources (pandas DataFrames, CSV files, Parquet files) and integrates with the broader evaluation pipeline including Reports, Test Suites, and LLM-specific processing like RAG (Retrieval-Augmented Generation) systems.

The architecture separates concerns between core data structures (Dataset, Container), SDK-level abstractions for cloud/local deployment, and specialized LLM data processing components. This modularity allows users to work with familiar Python data structures while benefiting from Evidently's evaluation and monitoring capabilities.

Sources: src/evidently/core/datasets.py:1-50

Core Data Abstraction

The Dataset Class

The Dataset class serves as the primary data container throughout Evidently. It wraps various data sources into a standardized interface that supports:

  • Direct creation from pandas DataFrames
  • Automatic type inference via DataDefinition
  • Descriptor-based feature computation
  • Metadata and tagging support
# Basic Dataset creation from pandas
from evidently import Dataset, DataDefinition

dataset = Dataset.from_pandas(
    dataframe,
    data_definition=DataDefinition(),
    metadata={"source": "production"},
    tags=["eval", "2024"]
)

Sources: src/evidently/core/datasets.py:30-45

Dataset Factory Methods

MethodPurposeInput Types
from_pandas()Create Dataset from pandas DataFramepd.DataFrame
from_any()Convert various types to Datasetpd.DataFrame, Dataset
as_dataframe()Extract underlying DataFrameN/A (output)
column()Retrieve specific column as DatasetColumncolumn name
subdataset()Filter dataset by column valuecolumn name, label

Sources: src/evidently/core/datasets.py:50-80

The from_any() static method implements a factory pattern that handles type conversion automatically:

@staticmethod
def from_any(dataset: PossibleDatasetTypes) -> "Dataset":
    if isinstance(dataset, Dataset):
        return dataset
    if isinstance(dataset, pd.DataFrame):
        return Dataset.from_pandas(dataset)
    raise ValueError(f"Unsupported dataset type: {type(dataset)}")

Sources: src/evidently/core/datasets.py:60-70

Data Flow Architecture

graph TD
    A[Input Data: pd.DataFrame / CSV / Parquet] --> B[Dataset.from_any]
    B --> C{Data Type Check}
    C -->|pd.DataFrame| D[Dataset.from_pandas]
    C -->|Already Dataset| E[Return as-is]
    D --> F[DataDefinition Type Inference]
    F --> G[Add Descriptors]
    G --> H[Dataset Object]
    H --> I[Report.run / TestSuite.run]
    I --> J[Snapshot with Results]
    J --> K[Export: JSON / HTML / Dict]

Container Architecture

The Container class provides the underlying storage and management mechanism for datasets within the Evidently ecosystem. Containers maintain references to data assets and provide CRUD operations for dataset lifecycle management.

Key container responsibilities include:

  • Storing dataset references and metadata
  • Managing dataset versioning
  • Providing query and filtering capabilities
  • Handling persistence to storage backends

Sources: src/evidently/core/container.py:1-30

Container-Dataset Relationship

graph LR
    A[Container] -->|manages| B[Dataset Registry]
    B -->|references| C[Dataset v1]
    B -->|references| D[Dataset v2]
    B -->|references| E[Dataset vN]
    C -->|wraps| F[pd.DataFrame]
    D -->|wraps| G[pd.DataFrame]
    E -->|wraps| H[pd.DataFrame]

SDK-Level Data Management

The SDK layer provides deployment-agnostic data handling through the CloudConfigAPI and local storage adapters. This abstraction enables consistent data operations whether running locally or connected to Evidently Cloud.

Sources: src/evidently/sdk/datasets.py:1-50

SDK Dataset Operations

OperationMethod SignatureDescription
Createadd_dataset(project_id, dataset, name, description, link)Add dataset to project
Readload_dataset(dataset_id)Retrieve dataset by UUID
Listlist_datasets(project, origins)List all datasets in project
Updateupdate_dataset(dataset_id, dataset)Modify existing dataset
Deletedelete_dataset(dataset_id)Remove dataset from storage

Sources: src/evidently/ui/workspace.py:50-100

LLM-Specific Data Processing

RAG Data Pipeline

For LLM applications using Retrieval-Augmented Generation, Evidently provides specialized data processing components that handle document ingestion, chunking, and indexing.

Sources: src/evidently/llm/rag/index.py:1-50

graph TD
    A[Raw Documents] --> B[RAG Splitter]
    B --> C[Document Chunks]
    C --> D[RAG Index]
    D --> E[Vector Store]
    E --> F[Retrieval Query]
    F --> G[Context + Query]
    G --> H[LLM Response]

Document Splitting

The splitter.py module handles text chunking with configurable parameters:

  • Chunk size: Target size for each text segment
  • Overlap: Amount of overlap between consecutive chunks
  • Separators: Text boundaries for splitting priority

Sources: src/evidently/llm/rag/splitter.py:1-40

RAG Indexing

The index module manages the vector storage and retrieval:

  • Embedding generation for document chunks
  • Vector store integration (FAISS, ChromaDB, etc.)
  • Similarity search capabilities
  • Metadata preservation for filtering

Sources: src/evidently/llm/rag/index.py:50-100

Data Generation for LLM Evals

The datagen/base.py module provides infrastructure for synthetic data generation, enabling users to create evaluation datasets programmatically.

Sources: src/evidently/llm/datagen/base.py:1-50

Data Generation Workflow

graph LR
    A[Seed Data / Templates] --> B[LLM Generator]
    B --> C[Generated Samples]
    C --> D[Quality Validation]
    D -->|Pass| E[Evaluation Dataset]
    D -->|Fail| F[Regeneration Loop]
    F --> B

DataGeneratorBase Class

The base class defines the contract for data generation:

MethodPurpose
generate()Create new data samples
validate()Check generated data quality
expand()Augment existing datasets

Sources: src/evidently/llm/datagen/base.py:30-60

Data Flow in Report Execution

Reports represent the primary consumer of dataset objects within Evidently. The run() method orchestrates the complete data flow from input to output.

def run(
    self,
    current_data,
    reference_data=None,
    additional_data=None,
    timestamp=None,
    metadata=None,
    tags=None,
    name=None,
) -> Snapshot:
    # Data validation
    if isinstance(current_data, pd.DataFrame) and current_data.empty:
        raise ValueError("current_data must contain at least one column...")
    
    # Convert to Dataset objects
    current_dataset = Dataset.from_any(current_data)
    reference_dataset = Dataset.from_any(reference_data) if reference_data else None
    
    # Execute metrics and generate snapshot
    ...

Sources: src/evidently/core/report.py:100-130

Data Validation Rules

ConditionError Raised
Empty current_data DataFrameValueError: current_data must contain at least one column; received an empty DataFrame
Empty reference_data DataFrameValueError: reference_data must contain at least one column; received an empty DataFrame
Unsupported dataset typeValueError: Unsupported dataset type: {type(dataset)}

Sources: src/evidently/core/datasets.py:65-70

Workspace Dataset Management

The Workspace class provides high-level dataset management capabilities for organizing evaluations across projects.

class Workspace:
    def add_dataset(
        self,
        project_id: STR_UUID,
        dataset: Dataset,
        name: str,
        description: Optional[str] = None,
        link: Optional[SnapshotLink] = None,
    ) -> DatasetID:
        """Add a dataset to a project."""
        return self.datasets.add(project_id=project_id, dataset=dataset, ...)
    
    def load_dataset(self, dataset_id: DatasetID) -> Dataset:
        """Load a dataset by ID."""
        return self.datasets.load(dataset_id)
    
    def list_datasets(
        self,
        project: STR_UUID,
        origins: Optional[List[str]] = None,
    ) -> DatasetList:
        """List all datasets in a project."""
        return self.datasets.list(project, origins=origins)

Sources: src/evidently/ui/workspace.py:50-80

Dataset Metadata Schema

FieldTypeRequiredDescription
namestrYesHuman-readable dataset name
descriptionstrNoDetailed description
linkSnapshotLinkNoAssociated snapshot reference
created_atdatetimeAutoCreation timestamp
authorstrAutoCreator identifier

Summary

Data Management in Evidently follows a consistent pattern across the platform:

  1. Ingestion: Data enters through Dataset.from_pandas() or Dataset.from_any() factory methods
  2. Validation: Empty DataFrames and unsupported types are rejected early
  3. Processing: Descriptors and metrics operate on the standardized Dataset interface
  4. Output: Results are packaged into Snapshots with multiple export formats
  5. Persistence: Datasets can be stored and retrieved through Workspace and SDK APIs

The separation between core data structures, SDK abstractions, and specialized LLM components enables flexible deployment while maintaining a simple user-facing API based on pandas DataFrames.

Sources: src/evidently/core/datasets.py:1-100 Sources: src/evidently/core/report.py:100-150 Sources: src/evidently/sdk/datasets.py:1-50 Sources: src/evidently/llm/rag/index.py:1-100 Sources: src/evidently/llm/rag/splitter.py:1-50 Sources: src/evidently/llm/datagen/base.py:1-60

Sources: [src/evidently/core/datasets.py:1-50]()

ML Model Evaluation

Related topics: LLM Evaluation and Judging, Descriptors and Features System, Presets and Metric Presets

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Architecture

Continue reading this section for the full explanation and source context.

Section Supported Evaluation Types

Continue reading this section for the full explanation and source context.

Section Purpose and Scope

Continue reading this section for the full explanation and source context.

Related topics: LLM Evaluation and Judging, Descriptors and Features System, Presets and Metric Presets

ML Model Evaluation

Overview

ML Model Evaluation in Evidently is a comprehensive module that enables data scientists and ML engineers to assess, test, and monitor machine learning models throughout their lifecycle—from experiments to production. The evaluation system supports both predictive tasks (classification, regression) and recommendation systems.

The evaluation framework is built around the concept of Metrics and Reports, where metrics are individual evaluation components that can be composed into reports for comprehensive model assessment.

Sources: src/evidently/core/report.py:1-50

Core Architecture

graph TD
    A[Dataset] --> B[Report]
    A --> C[Test Suite]
    B --> D[Interactive Report]
    B --> E[JSON/Dict Output]
    C --> F[Pass/Fail Results]
    G[Metrics] --> B
    H[Presets] --> B
    G --> C

Supported Evaluation Types

Evaluation TypePurposeKey Metrics
ClassificationEvaluate classifier performanceAccuracy, Precision, Recall, F1, ROC-AUC
RegressionEvaluate regression modelsMAE, MSE, RMSE, R2
Data QualityAssess input data integrityMissing values, duplicates, correlations
Data DriftDetect distribution changesPSI, KS test, L Infinity
Recommendation SystemsEvaluate recsys modelsPrecision@K, Recall@K, NDCG
EmbeddingsEvaluate embedding qualityCosine similarity, drift detection

Classification Metrics

Purpose and Scope

Classification metrics evaluate the performance of classification models by comparing predicted labels against actual labels. Evidently supports both binary and multiclass classification evaluation scenarios.

Sources: src/evidently/metrics/classification.py:1-100

Available Metrics

MetricDescriptionApplicable Task Types
AccuracyProportion of correct predictionsBinary, Multiclass
PrecisionPositive predictive valueBinary, Multiclass
RecallSensitivity, true positive rateBinary, Multiclass
F1ScoreHarmonic mean of precision and recallBinary, Multiclass
RocAucArea under the ROC curveBinary, Multiclass
ConfusionMatrixCross-tabulation of predictions vs actualsBinary, Multiclass
PrecisionRecallCurvePrecision-Recall tradeoff visualizationBinary
ClassRepresentationDistribution of classes in predictionsMulticlass

Usage Example

from evidently import Report
from evidently.metrics import Accuracy, Precision, Recall, F1Score, RocAuc

report = Report([
    Accuracy(),
    Precision(),
    Recall(),
    F1Score(),
    RocAuc()
])

result = report.run(current_data=current_dataset, reference_data=reference_dataset)

Legacy Classification Performance

The legacy module provides comprehensive classification evaluation with detailed visualizations.

Sources: src/evidently/legacy/metrics/classification_performance/__init__.py:1-50

Key components include:

  • ClassificationPerformanceMetrics: Core metrics container
  • Visualization widgets: Confusion matrix plots, ROC curves, PR curves
  • Per-class analysis: Detailed breakdown for multiclass scenarios

Regression Metrics

Purpose and Scope

Regression metrics assess the performance of regression models by computing various error measures and statistical properties of prediction residuals.

Sources: src/evidently/metrics/regression.py:1-100

Available Metrics

MetricDescriptionUnit
MeanErrorAverage prediction errorSame as target
MeanAbsoluteErrorMAE - average absolute errorSame as target
MeanSquaredErrorMSE - average squared errorSquared target units
RootMeanSquaredErrorRMSE - square root of MSESame as target
ErrorStdStandard deviation of errorsSame as target
ErrorNormalityShapiro-Wilk test for residual normalityp-value
R2ScoreCoefficient of determinationDimensionless
ErrorPercentilePercentile-based error analysisSame as target

Usage Example

from evidently import Report
from evidently.metrics import (
    MeanAbsoluteError,
    MeanSquaredError,
    R2Score,
    ErrorPercentile
)

report = Report([
    MeanAbsoluteError(),
    MeanSquaredError(),
    R2Score(),
    ErrorPercentile(percentile=95)
])

result = report.run(current_data=current_dataset, reference_data=reference_dataset)

Legacy Regression Performance

The legacy regression performance module provides time-series analysis of predictions.

Sources: src/evidently/legacy/metrics/regression_performance/__init__.py:1-50

Features include:

  • Predicted vs Actual plots: Time-series visualization
  • Residual distribution analysis: Histograms and QQ plots
  • Error distribution by feature: Feature-level error breakdown
graph LR
    A[Current Data] --> B[RegressionReport]
    C[Reference Data] --> B
    B --> D[Predicted vs Actual]
    B --> E[Error Distribution]
    B --> F[Performance Metrics]

Data Quality Metrics

Purpose and Scope

Data quality metrics evaluate the integrity and quality of input data. These metrics are essential for understanding whether model predictions are reliable and for identifying data pipeline issues.

Sources: src/evidently/metrics/data_quality.py:1-100

Available Metrics

MetricDescription
ColumnCountNumber of columns in dataset
RowCountNumber of rows in dataset
ValueStatsStatistical summary (min, max, mean, std)
MissingValuesMetricCount and percentage of missing values
UniqueValuesCountNumber of unique values per column
DataInconsistencyMetricDetection of data inconsistencies
TextStatsStatistics for text columns (length, word count)

Usage Example

from evidently import Report, Dataset
from evidently.metrics import ColumnCount, ValueStats, MissingValuesMetric

dataset = Dataset.from_pandas(df, data_definition=DataDefinition())

report = Report([
    ColumnCount(),
    ValueStats(column="target"),
    MissingValuesMetric()
])

result = report.run(dataset, None)

Dataset Statistics

Purpose and Scope

Dataset statistics provide comprehensive descriptive statistics about datasets, enabling quick overview and comparison between reference and current data distributions.

Sources: src/evidently/metrics/dataset_statistics.py:1-100

Available Presets

PresetDescription
DataSummaryPresetComplete dataset overview
DataDriftPresetDistribution comparison between reference and current
TargetDriftPresetTarget variable drift analysis
NumTargetDriftPresetNumeric target drift analysis
CatTargetDriftPresetCategorical target drift analysis

Statistical Tests for Drift Detection

Sources: src/evidently/legacy/calculations/data_drift.py:1-100

TestApplicable Data TypeDescription
KSNumericalKolmogorov-Smirnov test
ZScoreNumericalZ-score based drift detection
TTestNumericalTwo-sample t-test for mean comparison
ChiSquareCategoricalChi-square test for category distribution
PSIBothPopulation Stability Index
L InfinityBothMax absolute difference

Data Drift Workflow

graph TD
    A[Reference Dataset] --> F[Drift Detection Engine]
    B[Current Dataset] --> F
    F --> C[Statistical Tests]
    F --> D[Distribution Comparison]
    C --> E[Drift Report]
    D --> E
    E --> G{Drift Detected?}
    G -->|Yes| H[Alert / Action]
    G -->|No| I[Continue Monitoring]

Recommendation System Metrics

Purpose and Scope

Recommendation system (recsys) metrics evaluate the quality of recommendations generated by recommendation models.

Sources: src/evidently/metrics/recsys.py:1-100

Available Metrics

MetricDescriptionK-Parameter
PrecisionAtKPrecision at top-K recommendationsRequired
RecallAtKRecall at top-K recommendationsRequired
NDCGAtKNormalized Discounted Cumulative Gain at KRequired
HitRateAtKHit rate at top-K recommendationsOptional
MRRAtKMean Reciprocal Rank at KRequired

Usage Example

from evidently import Report
from evidently.metrics import PrecisionAtK, RecallAtK, NDCGAtK

report = Report([
    PrecisionAtK(k=10),
    RecallAtK(k=10),
    NDCGAtK(k=10)
])

result = report.run(current_data=recs_dataset, reference_data=ref_recs_dataset)

Embeddings Metrics

Purpose and Scope

Embeddings metrics evaluate the quality and drift of vector embeddings, which are critical for LLM and semantic search applications.

Sources: src/evidently/metrics/embeddings.py:1-100

Available Metrics

MetricDescription
Embedding DriftDetect drift in embedding distributions
Cosine SimilarityAverage cosine similarity between embeddings
Retrieval MetricsEvaluate RAG and retrieval system quality

Key Features

  • Semantic Drift Detection: Identify changes in embedding space distribution
  • Cluster Analysis: Analyze embedding cluster stability
  • Pairwise Comparison: Compare individual embedding pairs

Report Configuration

Creating a Report

from evidently import Report
from evidently.presets import DataDriftPreset

# Basic report with preset
report = Report([DataDriftPreset()])

# Report with custom metadata
report = Report(
    metrics=[...],
    metadata={"model_version": "v1.2.3"},
    tags=["production", "monthly"]
)

# Run with reference data for comparison
snapshot = report.run(current_dataset, reference_dataset)

Report Output Formats

FormatMethodUse Case
InteractiveDirect notebook displayExploration, debugging
JSON.json()API responses, CI/CD
Python Dict.dict()Programmatic access
HTMLExport functionalityShareable reports

Test Suite Integration

Reports can be converted to Test Suites by adding pass/fail conditions, enabling automated regression testing.

from evidently.test_suite import TestSuite
from evidently.tests import TestColumnValue

suite = TestSuite([
    TestColumnValue(column="accuracy", gt=0.95),
])

suite.run(current_data=dataset, reference_data=reference)

Best Practices

1. Always Use Reference Data for Comparison

For meaningful evaluation, maintain a reference dataset representing expected data distribution or model performance.

2. Choose Appropriate Metrics by Task Type

Task TypeRecommended Metrics
Binary ClassificationAccuracy, Precision, Recall, F1, ROC-AUC
Multiclass ClassificationPer-class F1, Confusion Matrix, Macro/Micro Avg
RegressionMAE, RMSE, R2, Error Percentiles
RecommendationPrecision@K, Recall@K, NDCG@K
Data QualityMissing values, duplicates, consistency

3. Set Up Monitoring Schedules

For production models, schedule regular evaluations to detect performance degradation and data drift early.

4. Use Auto-generated Test Conditions

Evidently can auto-generate test thresholds from reference data:

from evidently.presets import DataDriftPreset

suite = TestSuite.from_reference(reference_data, [DataDriftPreset()])

Summary

Evidently's ML Model Evaluation module provides a comprehensive, modular framework for evaluating machine learning models across multiple dimensions:

  • Classification: Binary and multiclass classification metrics with visualizations
  • Regression: Comprehensive error metrics and residual analysis
  • Data Quality: Input data integrity checks
  • Data Drift: Distribution comparison and statistical tests
  • Recommendation Systems: Top-K recommendation quality metrics
  • Embeddings: Semantic drift detection for LLM applications

The system supports both one-off evaluations and continuous monitoring, with seamless integration into CI/CD pipelines through Test Suites.

Sources: [src/evidently/core/report.py:1-50]()

LLM Evaluation and Judging

Related topics: ML Model Evaluation, Descriptors and Features System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section LLM Judge Types

Continue reading this section for the full explanation and source context.

Section Feature Descriptor Pattern

Continue reading this section for the full explanation and source context.

Section Common Parameters

Continue reading this section for the full explanation and source context.

Related topics: ML Model Evaluation, Descriptors and Features System

LLM Evaluation and Judging

Overview

Evidently provides a comprehensive framework for evaluating Large Language Model (LLM) powered systems through specialized descriptors called LLM Judges. These judges leverage LLM APIs to automatically assess and score various quality dimensions of LLM outputs, including relevance, coherence, factual accuracy, and task completion.

LLM Judges serve as evaluators that can be integrated into monitoring pipelines to continuously assess LLM performance without requiring manual human evaluation.

Architecture

graph TD
    A[Dataset with LLM Inputs/Outputs] --> B[LLM Judge Descriptors]
    B --> C[Prompt Templates]
    C --> D[LLM Provider API]
    D --> E[Parsed Evaluation Results]
    E --> F[Feature Descriptors]
    F --> G[Metrics & Tests]
    G --> H[Evidently Reports/Monitors]
    
    B1[ContextRelevanceLLMEval] --> B
    B2[CompletenessLLMEval] --> B
    B3[ContextQualityLLMEval] --> B
    B4[GroundednessLLMEval] --> B
    B5[RelevanceLLMEval] --> B
    
    C1[System Prompt] --> C
    C2[User Query] --> C
    C3[Evaluation Criteria] --> C

Core Components

LLM Judge Types

Evidently implements multiple specialized LLM judges for different evaluation dimensions:

Judge TypePurposeOutput
ContextRelevanceLLMEvalMeasures relevance of retrieved context to the queryScore 0-1, reasoning
CompletenessLLMEvalEvaluates completeness of response coverageScore 0-1, reasoning
ContextQualityLLMEvalAssesses quality of context for answeringScore 0-1, reasoning
GroundednessLLMEvalChecks if response is grounded in provided contextScore 0-1, reasoning
RelevanceLLMEvalEvaluates semantic relevance of response to queryScore 0-1, reasoning

Sources: src/evidently/descriptors/generated_descriptors.py:1-200

Feature Descriptor Pattern

LLM Judges follow the FeatureDescriptor pattern, wrapping LLM evaluation features with optional aliasing and test definitions:

FeatureDescriptor(
    feature=feature,
    alias=alias,  # Custom display name
    tests=tests   # Optional validation tests
)

Sources: src/evidently/descriptors/generated_descriptors.py:30-35

Configuration Options

Common Parameters

All LLM Judge functions share a common parameter structure:

ParameterTypeRequiredDefaultDescription
column_namestrYes-Name of the text column to evaluate
providerstrNo"openai"LLM provider (openai, anthropic, etc.)
modelstrNo"gpt-4o-mini"Model identifier
additional_columnsDict[str, str]NoNoneExtra context columns for evaluation
include_categoryboolNoNoneInclude categorical classification in output
include_scoreboolNoNoneInclude numeric score in output
include_reasoningboolNoNoneInclude explanation in output
uncertaintyUncertaintyNoNoneUncertainty handling strategy
aliasstrNoNoneCustom name for the feature
testsListNoNoneTests to apply to the feature

Sources: src/evidently/descriptors/generated_descriptors.py:50-70

Uncertainty Handling

The uncertainty parameter enables robust evaluation by detecting when the LLM is uncertain about its assessment:

uncertainty: Optional[Uncertainty] = None

This allows the system to flag low-confidence evaluations rather than returning potentially incorrect scores.

Sources: src/evidently/descriptors/generated_descriptors.py:58

Prompt System

Prompt Templates

Evidently uses structured prompt templates for LLM evaluation:

@dataclasses.dataclass
class LLMJudgePrompts:
    system: str
    user: str

Prompts are rendered with dynamic variables including:

  • Query text
  • Response text
  • Reference context
  • Evaluation criteria

Sources: src/evidently/llm/prompts/__init__.py

Prompt Rendering

The prompt rendering system processes templates with type-safe variable substitution:

def render_prompt(
    template: PromptTemplate,
    **kwargs: Any
) -> str:

Sources: src/evidently/llm/utils/prompt_render.py

Output Parsing

Response Parsing

Evaluation results are parsed using structured output extraction:

ParserPurpose
parse_json_response()Extract JSON-structured evaluations
parse_score()Extract numeric scores
parse_reasoning()Extract explanation text
parse_category()Extract categorical labels

Sources: src/evidently/llm/utils/parsing.py

Score Calculation

Scores are calculated based on the evaluation criteria defined in each judge:

  • 0.0: Does not meet criteria
  • 0.5: Partially meets criteria
  • 1.0: Fully meets criteria

The score can be combined with reasoning and category outputs for comprehensive evaluation reports.

Integration with Evidently Reports

Usage in Reports

LLM Judges can be added to Evidently reports:

from evidently.llm import RelevanceLLMEval

report = Report(metrics=[
    RelevanceLLMEval(
        column_name="response",
        include_score=True,
        include_reasoning=True
    )
])

Usage in Monitoring

For continuous monitoring, LLM Judges are integrated into dashboard widgets:

from evidently.dashboard import Dashboard

dashboard = Dashboard([
    RelevanceLLMEval(column_name="response")
])

Sources: ui/packages/evidently-ui-lib/src/components/Descriptors/Features/LLMJudge/template.tsx

UI Components

LLMJudgeTemplate Component

The frontend provides visualization for LLM judge configurations:

PropertyTypeDescription
stateLLMJudgeStateCurrent judge configuration
errorsFormErrorsValidation errors
availableTagsstring[]Available categorization tags

Sources: ui/packages/evidently-ui-lib/src/components/Descriptors/Features/LLMJudge/template.tsx

Visualization States

The UI renders different states for judge configurations:

  1. Uncertainty Configuration: Shows uncertainty handling options
  2. Multiclass Classification: Displays class criteria for classification modes
  3. Criteria Preview: Shows evaluation criteria as formatted text
  4. Output Options: Displays configured output fields (reasoning, category, score)

Legacy Components

Legacy LLM Judges

The legacy implementation in evidently.legacy provides backward compatibility:

from evidently.legacy.descriptors.llm_judges import (
    CompletenessLLMEval as CompletenessLLMEvalV1
)

Sources: src/evidently/legacy/descriptors/llm_judges.py

Sentiment Descriptor

Specialized descriptor for sentiment analysis:

from evidently.legacy.descriptors.sentiment_descriptor import SentimentDescriptor

Features include:

  • Sentiment polarity detection (positive, negative, neutral)
  • Confidence scoring
  • Integration with reporting pipeline

Sources: src/evidently/legacy/descriptors/sentiment_descriptor.py

Scorers and Optimization

LLM Scorers

Scorers provide optimized evaluation functions:

from evidently.llm.optimization.scorers import LLMScorer

Scorers support:

  • Batch evaluation
  • Caching of results
  • Configurable thresholds

Sources: src/evidently/llm/optimization/scorers.py

Best Practices

1. Choosing Evaluation Dimensions

Use CaseRecommended Judges
RAG SystemsContextRelevance, Groundedness, Relevance
SummarizationCompleteness, Relevance
Q&A SystemsContextQuality, Groundedness
General ChatAll dimensions

2. Score Interpretation

  • 0.0 - 0.3: Significant issues detected
  • 0.4 - 0.6: Partial criteria met
  • 0.7 - 1.0: Criteria well satisfied

3. Uncertainty Handling

Enable uncertainty handling when:

  • Operating on edge cases
  • Using smaller models
  • Evaluating ambiguous inputs

See Also

Sources: [src/evidently/descriptors/generated_descriptors.py:1-200]()

Descriptors and Features System

Related topics: LLM Evaluation and Judging, ML Model Evaluation

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Directory Structure

Continue reading this section for the full explanation and source context.

Section FeatureDescriptor

Continue reading this section for the full explanation and source context.

Section DescriptorRegistry

Continue reading this section for the full explanation and source context.

Related topics: LLM Evaluation and Judging, ML Model Evaluation

Descriptors and Features System

Overview

The Descriptors and Features System is a core component of the Evidently framework that provides extensible ways to extract, compute, and evaluate characteristics from data columns. This system enables users to define custom descriptors that wrap features with validation tests, making it easy to assess data quality, text properties, and model behavior in a unified evaluation pipeline.

Descriptors serve as wrappers around feature implementations, adding metadata, display names, and optional test configurations. The system supports both legacy v1 features and modern descriptor implementations, providing backward compatibility while enabling new functionality.

Architecture

The system follows a layered architecture with clear separation between descriptor definitions, feature implementations, and registry management.

graph TD
    subgraph "Public API Layer"
        PD[Public Descriptors<br/>TextMatch, HuggingFace, etc.]
        CD[Custom Descriptors<br/>FeatureDescriptor]
    end
    
    subgraph "Registry Layer"
        REG[DescriptorRegistry]
    end
    
    subgraph "Implementation Layer"
        LEG[Legacy Features V1<br/>hf_feature, text_length]
        NEW[New Features<br/>_text_length, text_match]
    end
    
    subgraph "Evaluation Layer"
        EVAL[Evidently Metrics<br/>and Tests]
    end
    
    PD --> REG
    CD --> REG
    REG --> LEG
    REG --> NEW
    LEG --> EVAL
    NEW --> EVAL
    
    style PD fill:#90EE90
    style CD fill:#87CEEB
    style REG fill:#FFD700
    style LEG fill:#FFA07A
    style NEW fill:#DDA0DD

Directory Structure

src/evidently/
├── descriptors/
│   ├── __init__.py              # Public API exports
│   ├── _text_length.py          # Text length descriptor implementation
│   ├── _custom_descriptors.py   # FeatureDescriptor class
│   ├── generated_descriptors.py # Generated descriptor functions
│   └── text_match.py            # Text matching descriptor
├── core/
│   └── registries/
│       └── descriptors.py       # Descriptor registry implementation
└── legacy/
    ├── descriptors/             # Legacy descriptor implementations
    └── features/                # Legacy feature implementations

Core Components

FeatureDescriptor

The FeatureDescriptor class is the central abstraction that wraps a feature with additional metadata and optional tests.

# Source: src/evidently/descriptors/_custom_descriptors.py
class FeatureDescriptor(BaseDescriptor):
    """Descriptor that wraps a feature with tests and metadata."""
    
    def __init__(
        self,
        feature: AnyFeatureType,
        alias: Optional[str] = None,
        tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
    ):
        self.feature = feature
        self.alias = alias
        self.tests = tests or []

Parameters:

ParameterTypeDescription
featureAnyFeatureTypeThe underlying feature implementation
aliasOptional[str]Display name for the descriptor
testsOptional[List]Optional list of tests to apply

DescriptorRegistry

The registry manages descriptor registration and lookup, enabling dynamic descriptor discovery.

# Source: src/evidently/core/registries/descriptors.py
class DescriptorRegistry:
    """Registry for managing descriptor instances."""
    
    def __init__(self):
        self._descriptors: Dict[str, Descriptor] = {}
    
    def register(self, name: str, descriptor: Descriptor) -> None:
        """Register a descriptor with a given name."""
        self._descriptors[name] = descriptor
    
    def get(self, name: str) -> Optional[Descriptor]:
        """Retrieve a descriptor by name."""
        return self._descriptors.get(name)

Available Descriptors

Text Length Descriptor

Computes text length statistics for string columns, including character count, word count, and sentence statistics.

Source: src/evidently/descriptors/_text_length.py

def TextLength(
    column_name: str,
    alias: Optional[str] = None,
    mode: str = "chars",
    tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
    """Compute text length metrics for a column.
    
    Args:
        column_name: Name of the text column to analyze.
        alias: Display name for the descriptor.
        mode: Measurement mode - "chars" or "words".
        tests: Optional list of tests to apply.
    """
    from evidently.legacy.features.text_length import TextLength as TextLengthV1
    
    feature = TextLengthV1(column_name=column_name, mode=mode, display_name=alias)
    return FeatureDescriptor(feature=feature, alias=alias, tests=tests)

Parameters:

ParameterTypeDefaultDescription
column_namestrRequiredColumn to analyze
aliasOptional[str]NoneCustom display name
modestr"chars""chars" for characters, "words" for word count
testsOptional[List]NoneTests to attach

Text Match Descriptor

Matches text content against patterns or lists, useful for validation and filtering.

Source: src/evidently/descriptors/text_match.py

def TextMatch(
    column_name: str,
    alias: str,
    words_list: List[str],
    mode: str = "match",
    lemmatize: bool = False,
    tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
    """Match text against a word list.
    
    Args:
        column_name: Text column to match against.
        alias: Name for the descriptor.
        words_list: List of words/patterns to match.
        mode: Matching mode - "match", "any", "all".
        lemmatize: Whether to apply lemmatization.
        tests: Optional test list.
    """
    from evidently.legacy.features.text_features import TextMatch as TextMatchV1
    
    feature = TextMatchV1(
        column_name=column_name,
        words_list=words_list,
        mode=mode,
        lemmatize=lemmatize,
        display_name=alias
    )
    return FeatureDescriptor(feature=feature, alias=alias, tests=tests)

Parameters:

ParameterTypeDefaultDescription
column_namestrRequiredTarget text column
aliasstrRequiredDescriptor name
words_listList[str]RequiredWords to match
modestr"match""match", "any", or "all"
lemmatizeboolFalseEnable lemmatization
testsOptional[List]NoneTests to apply

HuggingFace Descriptor

Applies HuggingFace models to text columns for various NLP tasks including toxicity detection.

Source: src/evidently/descriptors/generated_descriptors.py

def HuggingFace(
    column_name: str,
    model: str,
    params: dict,
    alias: str,
    tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
    """Apply a HuggingFace model to text column.
    
    Args:
        column_name: Name of the text column to process.
        model: HuggingFace model name or path.
        params: Additional parameters for the model.
        alias: Alias for the descriptor.
        tests: Optional list of tests to apply.
    """
    from evidently.legacy.features.hf_feature import HuggingFaceFeature
    
    feature = HuggingFaceFeature(
        column_name=column_name,
        model=model,
        params=params,
        display_name=alias
    )
    return FeatureDescriptor(feature=feature, alias=alias, tests=tests)

HuggingFace Toxicity Descriptor

Specialized descriptor for detecting toxic content using HuggingFace models.

Source: src/evidently/descriptors/generated_descriptors.py:39-56

def HuggingFaceToxicity(
    column_name: str,
    alias: str,
    model: Optional[str] = None,
    toxic_label: Optional[str] = None,
    tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
    """Detect toxicity in text using HuggingFace models.
    
    Args:
        column_name: Name of the text column to check.
        alias: Alias for the descriptor.
        model: HuggingFace model name or path. If None, uses default.
        toxic_label: Label for toxic content.
        tests: Optional list of tests to apply.
    """

Integration with Evidently Metrics

Descriptors are designed to integrate seamlessly with Evidently's metric and test evaluation system. When a descriptor is included in an evaluation, the underlying feature is computed and the results are automatically formatted for display.

graph LR
    A[Input Data] --> B[Descriptor Definition]
    B --> C[Feature Computation]
    C --> D[Test Evaluation]
    D --> E[Metric Results]
    C --> F[Visualization Data]
    F --> G[HTML Widgets]
    
    style A fill:#87CEEB
    style E fill:#90EE90
    style G fill:#FFD700

Legacy Features System

The legacy features system (evidently.legacy.features) provides backward compatibility for existing integrations. These features are wrapped by modern descriptors but can also be used directly.

Source: src/evidently/legacy/features/__init__.py

Available Legacy Features

FeatureModulePurpose
TextLengthtext_lengthText length statistics
TextMatchtext_featuresPattern matching in text
HuggingFaceFeaturehf_featureHuggingFace model integration

Usage Example

from evidently.descriptors import TextLength, TextMatch, HuggingFaceToxicity

# Create descriptors for evaluation
text_length_desc = TextLength(
    column_name="review_text",
    alias="Review Length",
    mode="words"
)

toxicity_desc = HuggingFaceToxicity(
    column_name="review_text",
    alias="Toxicity Score",
    toxic_label="toxic"
)

# Use in Evidently Dashboard
dashboard = Dashboard(metrics=[
    ColumnMetrics(column_name="review_text")
        .with_descriptors([
            text_length_desc,
            toxicity_desc
        ])
])

Descriptor Tests Integration

Descriptors can be associated with tests that validate the computed values:

from evidently.descriptors import TextLength
from evidently.test_suits import ColumnTestSuite

test_suite = ColumnTestSuite(
    tests=[
        TextLength(
            column_name="description",
            alias="Description Length",
            tests=[
                # Test within the descriptor
                GreaterThan("Description Length", threshold=10),
                LessThan("Description Length", threshold=500)
            ]
        )
    ]
)

Summary

The Descriptors and Features System provides a flexible, extensible mechanism for computing and evaluating column characteristics within Evidently. Key aspects include:

  • Wrapper Pattern: FeatureDescriptor wraps features with metadata and tests
  • Registry Pattern: Centralized descriptor management via DescriptorRegistry
  • Backward Compatibility: Legacy features wrapped for v1 compatibility
  • Test Integration: Native support for attaching tests to descriptors
  • Visualization: Automatic integration with Evidently's HTML rendering pipeline

This architecture enables users to define custom descriptors for any data transformation or evaluation need while maintaining consistency with the Evidently evaluation framework.

Source: https://github.com/evidentlyai/evidently / Human Manual

Reports and Test Suites

Related topics: Presets and Metric Presets, Custom Metrics and Extensibility, Core Components

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Dataset Structure

Continue reading this section for the full explanation and source context.

Section DataDefinition

Continue reading this section for the full explanation and source context.

Related topics: Presets and Metric Presets, Custom Metrics and Extensibility, Core Components

Reports and Test Suites

Overview

Evidently provides two primary evaluation constructs for ML and LLM-powered systems: Reports and Test Suites. Both are designed to evaluate, analyze, and monitor data and model quality, but they serve different purposes and use cases.

ComponentPurposeUse Case
ReportGenerate comprehensive analysis with metrics, visualizations, and insightsExploratory analysis, documentation, stakeholder communication
Test SuiteRun structured checks with pass/fail outcomesCI/CD pipelines, automated quality gates, regression testing

Sources: src/evidently/legacy/report/report.py

Architecture

graph TD
    subgraph "Core Evaluation Engine"
        A[Dataset / DataDefinition] --> B[Report / Test Suite]
        C[Descriptors / Metrics] --> B
        D[Test Cases] --> B
    end
    
    subgraph "Report Output"
        B --> E[Metric Results]
        B --> F[Visualizations]
        B --> G[JSON/HTML Export]
    end
    
    subgraph "Test Suite Output"
        B --> H[Test Results]
        H --> I[Pass / Fail Status]
        H --> J[Failure Details]
    end

Core Components

The evaluation system is built on several key components:

ComponentFile LocationRole
Reportsrc/evidently/core/report.pyMain class for generating analytical reports
TestSuitesrc/evidently/core/tests.pyOrchestrates test execution
DatasetCore data structureHolds data with schema definitions
DataDefinitionSchema definitionDefines column types and expectations
DescriptorFeature extractionRow-level evaluators (e.g., Sentiment, TextLength)
MetricMetric calculationComputes statistical measures

Sources: src/evidently/core/report.py

Datasets and Data Definition

Dataset Structure

Evidently uses a Dataset object to wrap pandas DataFrames with additional metadata:

from evidently import Dataset, DataDefinition

eval_dataset = Dataset.from_pandas(
    pd.DataFrame(eval_df),
    data_definition=DataDefinition(),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        Contains("answer", items=["sorry", "apologize"], mode="any")
    ]
)

Sources: README.md

DataDefinition

The DataDefinition class defines the schema for the dataset:

PropertyTypeDescription
column_schemadictMaps column names to data types
relationshipslistDefines relationships between datasets
timestamp_columnstrColumn containing datetime values

Descriptors

Descriptors are row-level evaluators that extract features or apply checks to individual rows. They can be used within datasets or standalone.

Available Descriptors

DescriptorPurposeParameters
SentimentDetect sentiment in textcolumn_name, alias
TextLengthMeasure text lengthcolumn_name, alias
ContainsCheck for substring presencecolumn_name, items, mode, case_sensitive
BeginsWithVerify text prefixcolumn_name, prefix, case_sensitive
HuggingFaceApply HF modelscolumn_name, model, params
HuggingFaceToxicityDetect toxic contentcolumn_name, model, toxic_label

Sources: src/evidently/descriptors/generated_descriptors.py

Descriptor Parameters

ParameterTypeRequiredDescription
column_namestrYesName of the column to evaluate
aliasstrNoDisplay name for the result
testslistNoOptional test cases to apply
modestrNoMatching mode: "any" or "all"
case_sensitiveboolNoWhether comparison is case-sensitive

Reports

Creating a Report

Reports generate comprehensive HTML or JSON output with metrics and visualizations:

from evidently import Report
from evidently.presets import TextEvals

report = Report(
    metrics=[
        TextEvals(),
    ]
)

result = report.run(reference_data=reference_df, current_data=current_df)
report.save_html("report.html")

Sources: src/evidently/legacy/report/report.py

Report Configuration

ParameterTypeDefaultDescription
metricslistRequiredList of metrics to compute
timestampdatetimeNowReport generation time
include_testsboolFalseInclude test results in report

Report Output Formats

FormatMethodUse Case
HTMLsave_html(path)Interactive visualization
JSONsave_json(path) or as_dict()Machine parsing, APIs
Python dictas_dict()Programmatic access

Sources: src/evidently/core/serialization.py

Test Suites

Creating a Test Suite

Test Suites run structured checks and return pass/fail status:

from evidently import TestSuite
from evidently.test_suite import TestSuite

suite = TestSuite(tests=[
    # Test definitions
])

result = suite.run(reference_data=reference_df, current_data=current_df)
suite.save("test_results.json")

Sources: src/evidently/legacy/test_suite/test_suite.py

Test Suite Workflow

graph LR
    A[Input Data] --> B[Test Suite]
    B --> C{Execute Tests}
    C --> D[Test 1]
    C --> E[Test 2]
    C --> F[Test N]
    D --> G[Test Results]
    E --> G
    F --> G
    G --> H{Pass All?}
    H -->|Yes| I[Success]
    H -->|No| J[Failure Report]

Test Results Structure

FieldTypeDescription
statusstr"PASSED", "FAILED", "WARNING"
namestrTest identifier
groupstrTest category
detailsdictAdditional context and metrics
timestampdatetimeWhen the test was run

LLM Evaluation Features

LLM-Powered Judgments

Evidently supports LLM-based evaluation through specialized descriptors:

from evidently.descriptors import (
    ContextQualityLLMEval,
    CompletenessLLMEval,
)
DescriptorPurposeKey Parameters
ContextQualityLLMEvalEvaluate context relevancequestion, provider, model
CompletenessLLMEvalCheck response completenesscontext, provider, model, include_reasoning

Sources: src/evidently/descriptors/generated_descriptors.py

LLM Provider Configuration

ProviderModel DefaultConfiguration
openaigpt-4o-miniAPI key required
anthropicClaude modelsAPI key required
azureConfigurableEndpoint + key

Evaluation Options

ParameterTypeDescription
include_categoryboolInclude categorical classification
include_scoreboolInclude numerical score
include_reasoningboolInclude LLM reasoning
uncertaintyUncertaintyStrategy for handling uncertainty

Metrics and Presets

Preset Configurations

Evidently provides pre-configured metric sets:

PresetDescriptionFile
TextEvalsText quality metricssrc/evidently/presets
DataDriftDrift detectionLegacy metrics
DataQualityQuality checksLegacy metrics

Sources: src/evidently/metrics/_legacy.py

Legacy vs. Current API

Evidently maintains both legacy and current APIs for backward compatibility:

Legacy API Structure

src/evidently/legacy/
├── report/
│   └── report.py          # Legacy Report class
└── test_suite/
    └── test_suite.py      # Legacy TestSuite class

Current API Structure

src/evidently/
├── core/
│   ├── report.py          # Current Report class
│   ├── tests.py           # Current test infrastructure
│   ├── compare.py         # Data comparison utilities
│   └── serialization.py   # Output serialization

Sources: src/evidently/metrics/_legacy.py

Usage Example: LLM Evaluation Pipeline

import pandas as pd
from evidently import Report, Dataset, DataDefinition
from evidently.descriptors import Sentiment, TextLength, Contains
from evidently.presets import TextEvals

# 1. Prepare data
eval_df = pd.DataFrame([
    ["What is the capital of Japan?", "The capital of Japan is Tokyo."],
    ["Who painted the Mona Lisa?", "Leonardo da Vinci."],
    ["Can you write an essay?", "I'm sorry, but I can't assist with homework."]],
    columns=["question", "answer"])

# 2. Create dataset with descriptors
eval_dataset = Dataset.from_pandas(
    pd.DataFrame(eval_df),
    data_definition=DataDefinition(),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        Contains("answer", items=["sorry", "apologize"], mode="any")
    ]
)

# 3. Run report
report = Report(metrics=[TextEvals()])
result = report.run(reference_data=None, current_data=eval_dataset)
report.save_html("llm_eval_report.html")

Sources: README.md

Comparison: Report vs. Test Suite

AspectReportTest Suite
OutputMetrics, visualizations, insightsPass/fail status, error details
Use CaseExploratory analysisAutomated quality gates
IntegrationDashboards, documentationCI/CD, monitoring alerts
ThresholdsConfigurable, informativeStrict pass/fail boundaries
Exit CodeAlways 0 (informational)0 (pass) or 1 (fail)

Sources: src/evidently/core/tests.py

Serialization

Supported Formats

FormatExtensionUse Case
JSON.jsonAPIs, automation, storage
HTML.htmlHuman review, sharing
Parquet.parquetLarge-scale data storage

Serialization Options

# Save as JSON
report.save_json("report.json")

# Save as HTML
report.save_html("report.html")

# Export as dictionary
data = report.as_dict()

Sources: src/evidently/core/serialization.py

Summary

Reports and Test Suites in Evidently provide complementary approaches to ML and LLM evaluation:

  • Reports excel at providing comprehensive, visual analysis suitable for exploration and documentation
  • Test Suites provide structured, automated quality checks ideal for CI/CD integration
  • Both share the same underlying dataset and descriptor infrastructure
  • LLM-powered evaluations are available through dedicated descriptors like ContextQualityLLMEval and CompletenessLLMEval
  • Output can be serialized to JSON, HTML, or accessed programmatically via Python dictionaries

Sources: [src/evidently/legacy/report/report.py](https://github.com/evidentlyai/evidently/blob/main/src/evidently/legacy/report/report.py)

Presets and Metric Presets

Related topics: Reports and Test Suites, Custom Metrics and Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Standard Presets (src/evidently/presets/)

Continue reading this section for the full explanation and source context.

Section Legacy Presets (src/evidently/legacy/)

Continue reading this section for the full explanation and source context.

Section Model Validation

Continue reading this section for the full explanation and source context.

Related topics: Reports and Test Suites, Custom Metrics and Extensibility

Presets and Metric Presets

Overview

Presets in Evidently are pre-configured collections of metrics and evaluations designed to simplify the process of assessing machine learning models and LLM-powered systems. They provide ready-to-use evaluation templates that bundle relevant metrics, thresholds, and reporting components into cohesive units. This abstraction layer allows users to perform comprehensive model assessment without manually configuring individual metrics.

The preset system follows a modular architecture where each preset type addresses a specific use case domain. Users can instantiate presets and pass them directly to the Report or TestSuite objects, which execute the bundled evaluations and generate structured results.

Preset Architecture

graph TD
    A[Evidently Presets] --> B[Standard Presets]
    A --> C[Legacy Presets]
    
    B --> B1[DataDriftPreset]
    B --> B2[TargetDriftPreset]
    B --> B3[ClassificationPreset]
    B --> B4[RegressionPreset]
    B --> B5[RecSysPreset]
    B --> B6[TextEvals]
    
    C --> C1[MetricPreset]
    C --> C2[TestPreset]
    
    D[Report / TestSuite] --> E[Integrates Presets]
    E --> B
    E --> C

Preset Types

Standard Presets (src/evidently/presets/)

The modern preset implementation resides in the src/evidently/presets/ directory. These presets integrate directly with the current Evidently reporting API.

#### DataDriftPreset

The DataDriftPreset evaluates feature-level drift between reference and current datasets. It calculates drift scores for individual features and provides aggregate drift statistics. This preset is particularly useful for monitoring data pipeline changes and detecting distribution shifts that may impact model performance.

from evidently.presets import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df, current_data=current_df)

#### TargetDriftPreset

The TargetDriftPreset specifically monitors drift in the target variable distribution. It is essential for supervised learning scenarios where changes in label distribution can signal underlying data quality issues or concept drift.

#### ClassificationPreset

The ClassificationPreset bundles metrics relevant to classification model evaluation. It includes accuracy metrics, confusion matrix analysis, class-level performance indicators, and probability calibration assessments. This preset supports both binary and multiclass classification scenarios.

#### RegressionPreset

The RegressionPreset provides comprehensive regression model evaluation including error distribution analysis, quantile-based metrics, and residual diagnostics. It helps identify heteroscedasticity, non-linearity, and systematic prediction biases.

#### RecSysPreset

The RecSysPreset is specialized for recommendation system evaluation. It includes ranking metrics, coverage indicators, and user-item interaction analysis tailored to collaborative filtering and content-based recommendation approaches.

#### TextEvals

The TextEvals preset targets LLM and NLP model evaluation. It encompasses text-specific metrics such as sentiment analysis, text length distributions, and content quality indicators. This preset integrates with the descriptors system for row-level text evaluations.

from evidently.presets import TextEvals

eval_dataset = Dataset.from_pandas(
    pd.DataFrame(eval_df),
    data_definition=DataDefinition(),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        Contains("answer", words=["sorry", "cannot"], alias="Denial")
    ]
)

Legacy Presets (src/evidently/legacy/)

The legacy preset system provided foundational abstractions for metric and test evaluation. These classes served as the original implementation pattern before the current preset architecture.

#### MetricPreset

The MetricPreset class defines the base interface for metric collection presets. It maintains a list of metrics and provides methods for execution and result aggregation.

#### TestPreset

The TestPreset class extends the preset concept to testing scenarios, enabling batch execution of statistical tests against model outputs. It provides assertions and threshold-based pass/fail criteria.

Preset Composition

Presets can be combined within a single report to create comprehensive evaluation suites:

from evidently import Report
from evidently.presets import DataDriftPreset, ClassificationPreset

report = Report(
    metrics=[
        DataDriftPreset(),
        ClassificationPreset()
    ]
)

Integration with Report API

Presets integrate seamlessly with the Evidently Report class:

ComponentRoleIntegration
ReportExecution containerAccepts presets as metrics parameter
DatasetData wrapperProvides reference and current data
DataDefinitionSchema definitionDescribes columns and their roles
PresetMetric bundleContains metric instances

Descriptor System

The preset system works in conjunction with the descriptor system for row-level evaluations. Descriptors apply per-row transformations and evaluations:

from evidently.descriptors import Sentiment, TextLength, Contains

Descriptors included in the Dataset configuration are evaluated alongside preset metrics, enabling both aggregate and granular assessment within a single report.

Use Cases

Model Validation

Before deploying a model, use presets to validate performance on held-out data:

report = Report(metrics=[RegressionPreset()])
report.run(reference_data=training_df, current_data=validation_df)

Production Monitoring

Continuously monitor deployed models for performance degradation:

report = Report(metrics=[DataDriftPreset(), ClassificationPreset()])
report.run(reference_data=baseline_df, current_data=current_df)

LLM Evaluation

Assess LLM responses using text-specific presets:

from evidently.presets import TextEvals
from evidently import Dataset, DataDefinition

eval_dataset = Dataset.from_pandas(response_df, data_definition=DataDefinition())
report = Report(metrics=[TextEvals()], descriptors=[...])

Summary Table: Preset Comparison

PresetDomainKey MetricsUse Case
DataDriftPresetDataDrift scores, feature importanceData pipeline monitoring
TargetDriftPresetDataTarget distribution, PSIConcept drift detection
ClassificationPresetMLAccuracy, F1, ROC-AUCClassifier evaluation
RegressionPresetMLMAE, RMSE, R²Regression analysis
RecSysPresetMLRanking metrics, coverageRecommendation systems
TextEvalsNLP/LLMSentiment, length, contentText model evaluation

Conclusion

The preset system in Evidently provides a powerful abstraction for organizing and executing model evaluations. By bundling related metrics into coherent units, presets reduce boilerplate and enable consistent evaluation practices across different model types and deployment scenarios.

Source: https://github.com/evidentlyai/evidently / Human Manual

Custom Metrics and Extensibility

Related topics: Reports and Test Suites, Presets and Metric Presets

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Registry Components

Continue reading this section for the full explanation and source context.

Section Metric Structure

Continue reading this section for the full explanation and source context.

Section Metric Container System

Continue reading this section for the full explanation and source context.

Related topics: Reports and Test Suites, Presets and Metric Presets

Custom Metrics and Extensibility

Overview

Evidently provides a comprehensive extensibility system that allows users to create custom metrics, features, and test configurations. The framework follows a plugin-like architecture where custom implementations can be registered and discovered at runtime through the registry system.

The extensibility model encompasses several key areas:

  • Custom Metrics: User-defined metrics that compute specific evaluation logic
  • Custom Features: Row-level evaluators that extract or compute values from data
  • Generators: Automatic metric and feature generation from column specifications
  • Bound Tests: Test configurations attached to metrics with threshold definitions

This architecture enables users to extend the framework's built-in capabilities while maintaining consistency with the existing evaluation pipeline.

Registry System Architecture

The registry system serves as the central mechanism for discovering and managing extensions within Evidently. It provides a unified interface for registering, retrieving, and invoking custom implementations.

graph TD
    A[User Code] --> B[Registry API]
    B --> C[MetricRegistry]
    B --> D[FeatureRegistry]
    B --> E[TestRegistry]
    C --> F[Metric Implementations]
    D --> G[Feature Implementations]
    E --> H[Test Configurations]

Core Registry Components

ComponentPurposeFile Location
MetricRegistryManages custom metric registrationssrc/evidently/core/registries/__init__.py
ConfigRegistryStores configuration metadatasrc/evidently/core/registries/configs.py
BoundTestRegistryHandles test bindings and thresholdssrc/evidently/core/registries/bound_tests.py

The registry pattern follows a key-based lookup system where each extension is identified by a unique name. This allows for dynamic discovery and late binding of implementations at runtime.

Custom Metrics

Custom metrics extend the base Metric class to implement domain-specific evaluation logic. The framework provides both legacy and modern approaches for metric creation.

Metric Structure

A custom metric typically consists of:

  1. Configuration: Defines the metric's parameters and metadata
  2. Calculation Logic: Implements the calculate method to produce results
  3. Visualization: Provides rendering information through widget definitions
class CustomMetric(Metric):
    def __init__(self, column: str, threshold: float = 0.5):
        self.column = column
        self.threshold = threshold
    
    def calculate(self, context: Context) -> MetricResult:
        # Custom calculation logic
        pass

Sources: src/evidently/legacy/metrics/custom_metric.py:1-50

Metric Container System

The MetricContainer abstract base class provides the foundation for grouping related metrics. It implements a caching mechanism to avoid redundant metric generation.

graph LR
    A[Context] --> B{Container Fingerprint}
    B -->|Cache Hit| C[Return Cached Metrics]
    B -->|Cache Miss| D[generate_metrics]
    D --> E[Store in Context]
    E --> C

The generate_metrics method must be implemented by subclasses to define the metrics to be computed:

def generate_metrics(self, context: "Context") -> Sequence[MetricOrContainer]:
    """Generate metrics based on the container configuration.
    
    Args:
        context: Context containing datasets and configuration.
    
    Returns:
        Sequence of Metric or MetricContainer objects to compute.
    """
    raise NotImplementedError()

Sources: src/evidently/core/container.py:1-80

Rendering and Widgets

Metrics can contribute visualization widgets through the render method. The widget system supports hierarchical composition where parent containers aggregate widgets from child metrics:

def render(
    self,
    context: "Context",
    child_widgets: Optional[List[Tuple[Optional[MetricId], List[BaseWidgetInfo]]]] = None,
) -> List[BaseWidgetInfo]:
    """Render visualization widgets for this container.
    
    Combines widgets from all child metrics/containers.
    """

Sources: src/evidently/core/container.py:80-120

Custom Features

Custom features extend the Descriptor base class to provide row-level evaluation capabilities. They can be applied to individual data points during dataset processing.

Feature Implementation Pattern

Features implement the descriptor pattern, wrapping column values with evaluation logic:

class CustomFeature(Descriptor):
    def __init__(self, column: str, alias: str = None):
        self.column = column
        self.alias = alias or column
    
    def apply(self, data: pd.Series) -> pd.Series:
        # Custom feature calculation
        return result

Sources: src/evidently/legacy/features/custom_feature.py:1-40

Integration with Descriptors

The descriptor system integrates with the data definition framework, allowing features to be declared as part of a dataset configuration:

eval_dataset = Dataset.from_pandas(
    pd.DataFrame(eval_df),
    data_definition=DataDefinition(),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        Contains("answer", ["denial", "refuse"], alias="ContainsDenial")
    ]
)

Sources: README.md:1-50

Generators System

Generators provide an automated way to create metrics and features based on column specifications. This reduces boilerplate when working with large numbers of similar evaluations.

Column Generator

The column generator creates standardized metrics for numeric and categorical columns:

generator = ColumnGenerator(
    columns=["feature_1", "feature_2", "feature_3"],
    generators=[
        MeanGenerator(),
        StdGenerator(),
        NullCountGenerator()
    ]
)

Sources: src/evidently/generators/column.py:1-60

Generator Configuration

Generators can be configured with:

ParameterTypeDescription
columnsList[str]Target columns for generation
include_missingboolInclude columns with missing values
exclude_patternsList[str]Regex patterns for column exclusion
generatorsList[Generator]Specific generators to apply

Sources: src/evidently/generators/__init__.py:1-50

Bound Tests and Thresholds

Bound tests attach threshold configurations to metrics, enabling automated pass/fail determinations based on metric results.

Test Binding Pattern

Tests are bound to metrics through the registry system:

class BoundTest:
    def __init__(self, metric: MetricId, threshold: TestThreshold):
        self.metric = metric
        self.threshold = threshold
    
    def evaluate(self, metric_result: MetricResult) -> TestResult:
        # Compare metric value against threshold
        pass

Sources: src/evidently/core/registries/bound_tests.py:1-50

Threshold Types

The framework supports multiple threshold configuration types:

Threshold TypeDescriptionUse Case
absoluteFixed value comparisonFixed acceptable ranges
relativePercentage-based comparisonDrift detection
sigmaStandard deviation boundsAnomaly detection
quantilePercentile-based thresholdsDistribution extremes

Sources: src/evidently/core/registries/bound_tests.py:50-100

Configuration and Versioning

The extensibility system includes robust configuration management with versioning support for metrics and descriptors.

Config Version Model

Configurations are versioned to track changes over time:

@dataclass
class ConfigVersion:
    id: STR_UUID
    artifact_id: STR_UUID
    version: int
    content: Any
    metadata: ConfigVersionMetadata

Sources: src/evidently/sdk/configs.py:1-80

Metadata Structure

FieldTypeDescription
created_atdatetimeVersion creation timestamp
updated_atdatetimeLast modification timestamp
authorstrUser who created/modified
commentstrChange description

Sources: src/evidently/sdk/adapters.py:1-100

Workflow Diagram

The complete extensibility workflow from definition to execution:

graph TD
    A[Define Custom Metric/Feature] --> B[Register in Registry]
    B --> C[Create Dataset with Descriptors]
    C --> D[Generate Metrics from Container]
    D --> E[Calculate Results]
    E --> F[Bind Tests with Thresholds]
    F --> G[Evaluate Pass/Fail]
    G --> H[Render Widgets]
    H --> I[Display in Report/Dashboard]

Best Practices

Performance Considerations

  • Caching: Use the container fingerprinting mechanism to cache generated metrics and avoid redundant calculations
  • Lazy Evaluation: Implement metrics using lazy evaluation patterns when possible to defer expensive computations
  • Batch Processing: Design features to operate on pandas Series rather than individual values for vectorized performance

Extensibility Guidelines

  1. Consistent Naming: Follow the established naming conventions for custom implementations
  2. Type Hints: Include comprehensive type hints for all public interfaces
  3. Documentation: Document parameters and return values using docstring conventions
  4. Testing: Create unit tests for custom metric calculation logic

Integration Points

Custom extensions integrate with the framework through several standardized interfaces:

  • Metric Protocol: Implement the calculate(self, context) method
  • Descriptor Protocol: Implement the apply(self, data) method
  • Widget Protocol: Return List[BaseWidgetInfo] from render() method
  • Test Protocol: Implement the evaluate(self, result) method for bound tests

API Reference

Key Classes and Functions

Class/FunctionModulePurpose
MetricContainerevidently.core.containerBase class for metric containers
Descriptorevidently.legacy.featuresBase class for features
ColumnGeneratorevidently.generators.columnColumn-based metric generation
BoundTestevidently.core.registriesThreshold-bound test configuration
ConfigVersionevidently.sdk.configsVersioned configuration storage

Context Management

The Context object provides access to datasets and configuration during metric calculation:

class Context:
    def metrics_container(self, fingerprint: str) -> Optional[List[MetricOrContainer]]:
        """Retrieve cached metrics for container fingerprint."""
    
    def set_metric_container_data(
        self, 
        fingerprint: str, 
        metrics: List[MetricOrContainer]
    ) -> None:
        """Store generated metrics in cache."""

Sources: src/evidently/core/container.py:50-80

Summary

Evidently's extensibility system provides a comprehensive framework for customizing evaluation logic. The registry-based architecture enables dynamic discovery of custom implementations while maintaining consistency with built-in features. Custom metrics and features extend the base classes to implement domain-specific logic, and the generator system automates repetitive metric definitions. The bound test mechanism attaches configurable thresholds to metrics for automated validation, making the system suitable for both exploratory analysis and production monitoring scenarios.

Sources: [src/evidently/legacy/metrics/custom_metric.py:1-50]()

UI Service Backend

The Evidently UI Service Backend is a FastAPI-based REST API layer that powers the Evidently web interface, providing endpoints for managing projects, datasets, prompts, and trace data. It...

Section Component Stack

Continue reading this section for the full explanation and source context.

Section Project Model

Continue reading this section for the full explanation and source context.

Section Config API Abstraction

Continue reading this section for the full explanation and source context.

Section Descriptor Configuration

Continue reading this section for the full explanation and source context.

The Evidently UI Service Backend is a FastAPI-based REST API layer that powers the Evidently web interface, providing endpoints for managing projects, datasets, prompts, and trace data. It serves as the bridge between the React-based frontend and the Evidently SDK's core functionality.

Architecture Overview

The UI Service Backend follows a layered architecture:

    A[Frontend: React/TypeScript] --> B[UI Service Backend: FastAPI]
    B --> C[Evidently SDK]
    C --> D[Workspace Abstraction]
    C --> E[Cloud Config API]
    D --> F[(Local Storage)]
    E --> G[(Remote Backend)]

Component Stack

LayerTechnologyPurpose
FrontendReact, TypeScript, MUIUser interface components
Backend APIFastAPI/PythonREST API endpoints
Business LogicEvidently SDKCore evaluation logic
Data LayerLocal FS / Remote APIPersistence

Core Data Models

Project Model

Projects are the primary organizational unit in the UI. The ProjectModel represents:

ProjectModel:
    - name: str
    - description: Optional[str]
    - org_id: Optional[OrgID]

Sources: src/evidently/ui/workspace.py:50-55

Config API Abstraction

The ConfigAPI class provides a generic interface for managing versioned configurations:

MethodPurpose
create_config()Create a new configuration
get_config()Retrieve a configuration by ID
list_configs()List all configurations
update_config()Update an existing configuration
delete_config()Delete a configuration
add_version()Add a new version to a config
get_version()Get a specific version

Sources: src/evidently/sdk/configs.py:95-145

Descriptor Configuration

The Descriptor type is used to store and version descriptor configurations:

DescriptorConfigAPI:
    - add_descriptor() -> ConfigVersion
    - get_descriptor() -> Descriptor

Sources: src/evidently/sdk/configs.py:165-180

Workspace Abstraction

The Workspace abstract class defines the contract for project management:

    A[Workspace] --> B[LocalWorkspace]
    A --> C[CloudWorkspace]
    A --> D[RemoteWorkspace]

Abstract Methods

MethodParametersReturnsDescription
create_project()name, description, org_idProjectCreates a new project
add_project()project: ProjectModel, org_idProjectAdds project to workspace
get_project()project_id: STR_UUIDOptional[Project]Retrieves project by ID
delete_project()project_id: STR_UUIDNoneRemoves project from workspace
list_projects()org_id: OptionalSequence[Project]Lists all projects

Sources: src/evidently/ui/workspace.py:40-80

Frontend Components

Project Card Component

The ProjectCard component handles project display and editing:

ProjectCardProps:
    - project: Project
    - disabled?: boolean
    - onEditProject: (args: { name: string; description: string }) => void
    - LinkToProject: ComponentType

Features:

  • Toggle between view and edit modes
  • Uses EditProjectInfoForm for inline editing
  • Displays ProjectInfoCard in view mode

Sources: ui/packages/evidently-ui-lib/src/components/Project/ProjectCard.tsx:60-80

Prompts Table

The PromptsTable component renders a sortable, paginated list of prompts:

ColumnRenderFeatures
IDID with copy buttonTextWithCopyIcon component
NameTruncated text (max 200px)Typography component
Created atDate formatteddayjs locale formatting
ActionsLink + Delete buttonEdit/delete operations

Sources: ui/packages/evidently-ui-lib/src/components/Prompts/PromptsTable.tsx:40-75

Traces Table

The TracesTable component displays trace data with extended metadata:

ColumnFeatures
TagsHidedTags component, 250px min width
MetadataJsonViewThemed with clipboard support
TypeChip showing trace origin
Created atSortable date column
ActionsDataset link, edit dialog, delete

Sources: ui/packages/evidently-ui-lib/src/components/Traces/TracesTable.tsx:50-90

API Endpoint Structure

Projects API

/projects
  ├── GET    /              - List all projects
  ├── POST   /              - Create new project
  └── GET    /{project_id}  - Get project details

/projects/{project_id}/
  ├── prompts/
  │   ├── GET    /          - List prompts
  │   ├── POST   /          - Create prompt
  │   └── GET    /{id}      - Get prompt details
  ├── datasets/
  │   ├── GET    /          - List datasets
  │   └── POST   /          - Create dataset
  └── traces/
      ├── GET    /          - List traces
      └── POST   /          - Create trace

Prompts API

EndpointMethodPurpose
/promptsGETList all prompts for a project
/promptsPOSTCreate a new prompt
/prompts/{id}GETGet prompt details

Sources: ui/service/src/routes/.../index-prompts-list/index-prompts-list-main.tsx:40-60

Configuration Management

Artifact Version Management

The ArtifactConfigAPI handles versioned artifact storage:

ArtifactConfigAPI:
    - create_version()      # Create new artifact version
    - list_versions()       # List all versions
    - get_version()         # Get specific version
    - get_version_by_id()   # Get by version ID

Sources: src/evidently/sdk/adapters.py:45-70

Version Conversion

Bidirectional conversion between SDK and API models:

    A[ArtifactVersion] -->|convert| B[ConfigVersion]
    B -->|convert| A

Methods:

  • _artifact_version_to_config_version() - SDK to API
  • _config_version_to_artifact_version() - API to SDK

UI Utilities

The utils.py module provides HTML templates for dashboard rendering:

HTML_LINK_WITH_ID_TEMPLATE   # Link with button and ID display
FILE_LINK_WITH_ID_TEMPLATE   # File link with ID
RUNNING_SERVICE_LINK_TEMPLATE # Service link with label

Sources: src/evidently/ui/utils.py:30-55

Workflow: Project Lifecycle

graph TD
    A[Create Project] --> B[Add Datasets]
    B --> C[Create Prompts]
    C --> D[Run Evals]
    D --> E[Store Traces]
    E --> F[View Results]
    F --> G[Monitor]
    
    A -.->|via| H[Workspace.add_project]
    C -.->|via| I[Prompts API]
    E -.->|via| J[Traces API]

Configuration Options

OptionTypeDefaultDescription
org_idUUIDNoneOrganization identifier
project_idUUIDAutoProject identifier
versionint"latest"Config version selector

Key SDK Classes

ClassFileResponsibility
ConfigAPIconfigs.pyGeneric config CRUD operations
DescriptorConfigAPIconfigs.pyDescriptor-specific operations
CloudConfigAPIconfigs.pyRemote backend communication
Workspaceworkspace.pyAbstract workspace interface
ProjectModelworkspace.pyProject data structure

Technology Stack

ComponentTechnologyVersion
Backend FrameworkFastAPI-
SDK CorePython3.11+
Frontend FrameworkReact-
UI ComponentsMUI-
State ManagementReact Hooks-
Date Handlingdayjs-

Development Workflow

Running the Service

# In ui/service folder
pnpm dev

Code Quality

# In ui folder
pnpm code-check          # Format, sort imports, lint
pnpm code-check --fix    # Apply fixes automatically

Sources: ui/README.md:15-25

Building

pnpm build

Summary

The Evidently UI Service Backend provides:

  1. REST API layer for frontend-backend communication
  2. Project management via Workspace abstraction
  3. Versioned configuration storage using ConfigAPI
  4. Tracing support for evaluation history
  5. Prompts management for LLM-powered evaluations

The architecture cleanly separates concerns between the FastAPI backend, Evidently SDK business logic, and React frontend components, enabling modular development and testing.

Sources: [src/evidently/ui/workspace.py:50-55]()

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Update scikit-learn version requirement to support v1.6.0

First-time setup may fail or require extra isolation and rollback planning.

high PromptOptimizer throws OpenAIError when using Vertex AI judge

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

high IndexError in infer_column_type when column contains only null values

The project should not be treated as fully validated until this signal is reviewed.

high Update evidently hashlib usage for FIPS-Compliant Systems and Security Best Practices

The project may affect permissions, credentials, data exposure, or host boundaries.

Doramagic Pitfall Log

Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: Update scikit-learn version requirement to support v1.6.0

  • Severity: high
  • Finding: Installation risk is backed by a source signal: Update scikit-learn version requirement to support v1.6.0. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1407

2. Configuration risk: PromptOptimizer throws OpenAIError when using Vertex AI judge

  • Severity: high
  • Finding: Configuration risk is backed by a source signal: PromptOptimizer throws OpenAIError when using Vertex AI judge. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1856

3. Project risk: IndexError in infer_column_type when column contains only null values

  • Severity: high
  • Finding: Project risk is backed by a source signal: IndexError in infer_column_type when column contains only null values. Treat it as a review item until the current version is checked.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1764

4. Security or permission risk: Update evidently hashlib usage for FIPS-Compliant Systems and Security Best Practices

  • Severity: high
  • Finding: Security or permission risk is backed by a source signal: Update evidently hashlib usage for FIPS-Compliant Systems and Security Best Practices. Treat it as a review item until the current version is checked.
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1410

5. Installation risk: Numpy 2.x support?

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: Numpy 2.x support?. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1557

6. Installation risk: v0.7.12

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v0.7.12. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.12

7. Installation risk: v0.7.15

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v0.7.15. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.15

8. Installation risk: v0.7.20

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v0.7.20. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.20

9. Capability assumption: v0.7.19

  • Severity: medium
  • Finding: Capability assumption is backed by a source signal: v0.7.19. Treat it as a review item until the current version is checked.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.19

10. Capability assumption: v0.7.21

  • Severity: medium
  • Finding: Capability assumption is backed by a source signal: v0.7.21. Treat it as a review item until the current version is checked.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.21

11. Capability assumption: README/documentation is current enough for a first validation pass.

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: capability.assumptions | github_repo:315977578 | https://github.com/evidentlyai/evidently | README/documentation is current enough for a first validation pass.

12. Maintenance risk: Maintainer activity is unknown

  • Severity: medium
  • Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: evidence.maintainer_signals | github_repo:315977578 | https://github.com/evidentlyai/evidently | last_activity_observed missing

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using evidently with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence