Doramagic Project Pack · Human Manual
evidently
Related topics: Core Components, Data Management and Data Flow
Architecture Overview
Related topics: Core Components, Data Management and Data Flow
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Components, Data Management and Data Flow
Architecture Overview
Introduction
Evidently is an open-source Python framework designed to evaluate, test, and monitor ML and LLM-powered systems. The architecture follows a modular design pattern that separates concerns between core evaluation logic, SDK APIs, user interface components, and configuration management.
Sources: README.md:1-30
High-Level Architecture
The Evidently platform consists of several interconnected layers:
graph TB
subgraph "User Interface Layer"
UI["UI Components<br/>(React/TypeScript)"]
end
subgraph "SDK Layer"
SDK["Python SDK<br/>(evidently.sdk)"]
API["Cloud Config API"]
end
subgraph "Core Layer"
CORE["Core Modules<br/>(metrics, descriptors, reports)"]
LEGACY["Legacy Module<br/>(evidently.legacy)"]
FUTURE["Future Module<br/>(evidently.future)"]
end
subgraph "Data Layer"
WS["Workspace Abstraction"]
CFG["Config System"]
ADP["Adapters"]
end
UI --> SDK
SDK --> API
SDK --> WS
SDK --> CFG
SDK --> ADP
ADP --> WS
CFG --> WSCore Module Structure
The main evidently package provides the primary user-facing APIs for ML evaluation:
| Component | Purpose |
|---|---|
Report | Generate evaluation reports combining multiple metrics |
Dataset | Container for evaluation data with descriptors |
DataDefinition | Schema definition for dataset structure |
Metrics | Individual evaluation metrics |
Descriptors | Row-level evaluators (e.g., Sentiment, TextLength) |
Presets | Pre-configured evaluation suites (e.g., TextEvals) |
Sources: README.md:30-55
Key Imports for LLM Evals
from evidently import Report
from evidently import Dataset, DataDefinition
from evidently.descriptors import Sentiment, TextLength, Contains
from evidently.presets import TextEvals
SDK Architecture
The SDK layer provides programmatic access to remote configurations and cloud features.
CloudConfigAPI
The CloudConfigAPI class manages remote configuration operations:
| Method | Purpose |
|---|---|
create_config() | Create a new remote configuration |
update_config() | Update an existing configuration |
get_config() | Retrieve a configuration by ID |
list_configs() | List all configurations for a project |
delete_config() | Remove a configuration |
create_version() | Create a new version of a configuration |
list_versions() | List all versions of a configuration |
get_version() | Get a specific version |
Sources: src/evidently/sdk/configs.py:1-50
Config Version Management
The configuration system uses a ConfigVersion model to track changes:
classDiagram
class ConfigMetadata {
+str created_at
+str updated_at
+str author
+str description
}
class ConfigVersion {
+str id
+str artifact_id
+int version
+Any content
+ConfigVersionMetadata metadata
}
class ConfigVersionMetadata {
+str created_at
+str updated_at
+str author
+str comment
}
ConfigVersion --> ConfigVersionMetadata
ConfigVersionMetadata --|> ConfigMetadataSources: src/evidently/sdk/configs.py:50-150
Adapter Pattern
The SDK uses adapters to convert between internal config models and domain objects:
graph LR
A["ConfigVersion"] -->|Adapter| B["ArtifactVersion"]
A -->|Adapter| C["Descriptor"]
A -->|Adapter| D["Prompt"]| Adapter Class | Source Config | Target Domain Object |
|---|---|---|
ArtifactAdapter | ConfigVersion | ArtifactVersion |
DescriptorAdapter | ConfigVersion | Descriptor |
PromptAdapter | ConfigVersion | Prompt |
Sources: src/evidently/sdk/adapters.py:1-100
Workspace Architecture
The workspace abstraction provides a unified interface for project management:
Abstract Base Class
classDiagram
class Workspace {
<<abstract>>
+create_project(name, description, org_id) Project
+add_project(project, org_id) Project
+get_project(project_id) Optional~Project~
+delete_project(project_id)
+list_projects(org_id) Sequence~Project~
}Core Workspace Methods
| Method | Parameters | Returns | Description |
|---|---|---|---|
create_project | name: str, description: str, org_id: Optional[OrgID] | Project | Creates and adds a new project |
add_project | project: ProjectModel, org_id: Optional[OrgID] | Project | Adds an existing project model |
get_project | project_id: STR_UUID | Optional[Project] | Retrieves project by UUID |
delete_project | project_id: STR_UUID | None | Removes project from workspace |
list_projects | org_id: Optional[OrgID] | Sequence[Project] | Lists projects with optional filtering |
Sources: src/evidently/ui/workspace.py:1-100
Project Model
Projects are stored using a ProjectModel data structure:
ProjectModel(
name=str,
description=str,
org_id=Optional[OrgID]
)
UI Component Architecture
The frontend is built with React and TypeScript, organized as separate packages:
graph TD
subgraph "ui/packages/evidently-ui-lib"
WC["Widgets"]
CC["Components"]
FP["Forms"]
TB["Tables"]
end
subgraph "Components"
DT["Dashboard"]
TR["Traces"]
PR["Prompts"]
DS["Descriptors"]
end
CC --> DT
CC --> TR
CC --> PR
CC --> DS
CC --> FP
CC --> TBComponent Categories
| Package Path | Purpose |
|---|---|
src/widgets/ | Dashboard widgets and test suite components |
src/components/Dashboard/ | Dashboard-specific UI elements |
src/components/Traces/ | Trace viewing and management |
src/components/Prompts/ | Prompt template management |
src/components/Descriptors/ | Descriptor configuration forms |
src/components/Utils/ | Shared utility components |
Template and Link Generation
The UI layer uses template functions to generate HTML for external links and reports:
| Template | Purpose |
|---|---|
HTML_LINK_WITH_ID_TEMPLATE | Report links with button and ID display |
FILE_LINK_WITH_ID_TEMPLATE | File links with ID metadata |
RUNNING_SERVICE_LINK_TEMPLATE | Service endpoint links |
EVIDENTLY_STYLES_COMMON | Shared CSS styles for links |
Sources: src/evidently/ui/utils.py:1-60
Template Structure
<div class="evidently-links container">
<a target="_blank" href="{button_url}">{button_title}</a>
<p><b>{id_title}:</b> <span>{id}</span></p>
</div>
Data Flow Architecture
graph TB
A["User Code"] --> B["Dataset Creation"]
B --> C["Descriptor Application"]
C --> D["Report Generation"]
D --> E["Workspace Storage"]
F["Cloud API"] --> G["Config Management"]
G --> E
E --> H["UI Display"]API Reference Documentation
The project includes automated API documentation generation:
| Command | Description |
|---|---|
./api-reference/generate.py --local-source-code | Generate docs from local source |
./api-reference/generate.py --git-revision <ref> | Generate docs from git revision |
./api-reference/generate.py --additional-modules | Include extra modules |
Documentation is output to api-reference/dist/ organized by revision.
Sources: api-reference/README.md:1-50
Summary
The Evidently architecture is organized into three main layers:
- Core Layer - Python packages for evaluation logic (
evidently,evidently.core) - SDK Layer - Remote configuration and cloud integration (
evidently.sdk) - UI Layer - React/TypeScript web interface for visualization and management
The modular design allows users to:
- Use the Python SDK directly for programmatic evaluation
- Store and version configurations in the cloud
- Access results through the web UI
- Extend functionality through the descriptor and metric system
Sources: [README.md:1-30]()
Core Components
Related topics: Architecture Overview, Reports and Test Suites, Custom Metrics and Extensibility
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Architecture Overview, Reports and Test Suites, Custom Metrics and Extensibility
Core Components
Overview
The Core Components form the foundational architecture of the Evidently framework, providing the essential building blocks for evaluation, testing, and monitoring of ML and LLM-powered systems. These components establish the base type system, metric definitions, reporting mechanisms, and registry patterns that enable the framework's extensibility and modularity.
The Core Components encompass several interconnected modules that work together to provide a unified approach to ML evaluation. At the heart of this system is a registration-based architecture that allows metrics, tests, and components to be dynamically discovered, configured, and executed within the Evidently ecosystem.
This architecture follows a plugin-like pattern where individual components can be registered, versioned, and managed through centralized registries. This design enables users to extend the framework's functionality by implementing custom metrics and tests while maintaining compatibility with the existing evaluation pipeline.
Base Types Module
Purpose and Scope
The base types module defines the fundamental data structures and abstractions that underpin all Evidently components. These types establish common interfaces and data models used throughout the framework, ensuring consistency and interoperability between different modules.
The base types primarily focus on defining what data flows through the evaluation pipeline, including data definitions, dataset representations, and result containers. This module serves as the foundation upon which all higher-level abstractions are built.
Key Data Structures
The base types module defines several critical classes and interfaces that form the backbone of the Evidently type system. These structures provide a standardized way to represent input data, evaluation results, and configuration options across all components.
The module establishes the DataDefinition class, which serves as a schema for describing the structure of datasets being evaluated. This includes column definitions, data types, and metadata that describe the characteristics of the data being analyzed. Data definitions enable Evidently to understand the structure of incoming data and apply appropriate transformations and evaluations.
The module also defines base classes for Dataset objects, which encapsulate reference and current data batches used in evaluations. These dataset representations include functionality for data validation, transformation, and slicing based on various criteria.
Metric Types Module
Metric Architecture
The metric types module defines the core abstraction for all metrics within Evidently. Metrics are the primary mechanism for evaluating data quality, model performance, and statistical properties of datasets. The module establishes a class hierarchy that enables both simple single-value metrics and complex multi-dimensional metric calculations.
Metrics in Evidently follow a consistent pattern where each metric is associated with specific columns or data subsets and produces standardized result objects. This design enables metrics to be composed, combined, and analyzed together within reports and dashboards.
Metric Execution Model
The metric execution model defines how metrics are calculated, what inputs they receive, and how results are structured. Each metric implementation receives a data context containing the relevant dataset columns and produces a MetricResult object that encapsulates the calculated values and metadata.
The execution model supports both eager and lazy evaluation strategies. Some metrics compute their results immediately upon execution, while others may defer computation until results are actually needed. This flexibility enables optimizations in scenarios where multiple metrics share intermediate calculations or where memory efficiency is a concern.
Metrics can be configured with parameters that control their behavior, including aggregation methods, thresholds, and visualization options. The parameter system uses type hints and validation to ensure configuration correctness while providing clear documentation through the API reference.
Report System
Report Architecture
The Report class serves as the primary interface for executing evaluations in Evidently. Reports orchestrate the execution of metrics and tests, collect results, and generate visualizations and summaries. The Report system provides a flexible framework for combining multiple evaluation components into cohesive analysis workflows.
Reports maintain state throughout their lifecycle, tracking which metrics and tests have been executed, their results, and any errors or warnings encountered during execution. This state management enables features like incremental recalculation, where only changed components need to be re-evaluated.
graph TD
A[Create Report] --> B[Add Metrics]
B --> C[Add Tests]
C --> D[Run Report]
D --> E[Collect Results]
E --> F[Generate Visualizations]
F --> G[Export Report]
H[Reference Data] --> D
I[Current Data] --> D
E --> J[Metric Results]
E --> K[Test Results]Report Configuration
Reports support extensive configuration options that control execution behavior, result aggregation, and output formatting. Configuration options include parallel execution settings, result caching policies, and visualization preferences.
The Report class provides methods for both synchronous and asynchronous execution, enabling integration with various application architectures. Asynchronous execution is particularly useful for long-running evaluations or when reports are generated as part of batch processing workflows.
Report Export
The Report system includes export functionality that generates output in various formats including HTML, JSON, and Python dictionary structures. Export options enable customization of included visualizations, result detail levels, and styling preferences.
Testing Framework
Test Definition
The testing module extends Evidently's evaluation capabilities to include pass/fail style assertions. Tests in Evidently are designed to validate specific conditions about data or model behavior, returning boolean results that indicate whether defined criteria are met.
Tests can be defined inline within metrics or as standalone components that check specific conditions. This dual approach provides flexibility in how validation logic is structured and reused across different evaluation scenarios.
Test Execution
Tests execute alongside metrics within the Report framework, allowing unified execution of both evaluation and validation logic. The test execution model mirrors the metric execution model, receiving data contexts and producing structured results that include pass/fail status and diagnostic information.
Test results include detailed failure messages that explain why a test did not pass, helping users understand data quality issues or model behavior problems. These diagnostic messages reference specific rows, values, or statistical properties that contributed to the test failure.
Component Registry System
Registry Architecture
The registry system provides a centralized mechanism for managing and discovering Evidently components. Registries maintain mappings between component identifiers and their implementations, enabling dynamic resolution of components at runtime.
Evidently uses registries extensively for metrics, tests, descriptors, and other extensible components. This architecture allows the framework to support third-party extensions while maintaining a consistent interface for component discovery and execution.
Component Registration
Components are registered with their respective registries using decorators or explicit registration calls. The registration process captures metadata about each component, including its name, version, parameters, and dependencies. This metadata enables automatic documentation generation and configuration interfaces.
Registration supports versioning, allowing multiple versions of a component to coexist and enabling rollback to previous versions when needed. Version management is particularly important for production deployments where stability and reproducibility are critical.
Metrics Registry
The metrics registry extends the component registry pattern specifically for metrics. It provides specialized functionality for metric discovery, parameter validation, and result aggregation.
graph TD
A[Metrics Registry] --> B[Metric Base Class]
A --> C[Metric Parameters]
A --> D[Metric Results]
B --> E[Column Metrics]
B --> F[Dataset Metrics]
B --> G[Statistical Metrics]
D --> H[Visualization Configs]
D --> I[Aggregation Methods]Metric Tests Registry
The metric tests registry manages test definitions that validate metric results or data conditions. This registry enables tests to reference metrics by name and access their results for validation purposes.
Tests registered in this registry can be automatically discovered and included in reports based on configuration. This discovery mechanism enables declarative specification of validation requirements without requiring explicit test instantiation.
Integration Patterns
SDK Integration
The Core Components integrate with Evidently's SDK layer, which provides higher-level APIs for cloud and remote deployments. The SDK layer uses the Core Components as its foundation while adding capabilities for remote execution, result storage, and collaborative features.
The CloudConfigAPI class demonstrates this integration, providing methods for managing project configurations, descriptor configs, and artifact versions. These high-level APIs delegate core functionality to the Core Components while handling network communication and serialization.
Adapter Pattern
Evidently uses adapters to bridge different execution contexts and storage backends. Adapters transform between internal data structures and external representations, enabling Evidently to work with various data sources and deployment environments.
The adapter pattern is particularly important for cloud deployments where configurations and results are stored remotely. Adapters handle serialization, deserialization, and API communication while maintaining compatibility with the Core Component interfaces.
Configuration Management
Config Versioning
The Core Components support configuration versioning through specialized config classes. Each versioned configuration maintains a history of changes, enabling audit trails and rollback capabilities.
Configurations are organized by project, with separate namespaces for different config types like metrics, tests, and descriptors. This organization enables clear separation of concerns while maintaining relationships between related configurations.
Remote Configuration
For deployments requiring centralized configuration management, Evidently supports remote configuration storage through the CloudConfigAPI. Remote configurations can be fetched, updated, and versioned through the SDK, with changes automatically synchronized to connected clients.
Summary
The Core Components provide the essential infrastructure for Evidently's evaluation capabilities. Through a combination of base types, metric abstractions, reporting mechanisms, and registry systems, these components establish a flexible and extensible framework for ML evaluation.
The modular design enables users to leverage individual components for specific use cases or combine them into comprehensive evaluation pipelines. The registration-based architecture ensures extensibility while maintaining consistency across the framework.
Understanding these core concepts is essential for effectively using Evidently and for extending the framework with custom metrics, tests, and integrations.
Source: https://github.com/evidentlyai/evidently / Human Manual
Data Management and Data Flow
Related topics: Architecture Overview, ML Model Evaluation, LLM Evaluation and Judging
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Architecture Overview, ML Model Evaluation, LLM Evaluation and Judging
Data Management and Data Flow
Overview
Data Management and Data Flow in Evidently encompasses the mechanisms by which data is ingested, transformed, stored, and processed throughout the evaluation lifecycle. The system provides a unified abstraction layer that handles multiple data sources (pandas DataFrames, CSV files, Parquet files) and integrates with the broader evaluation pipeline including Reports, Test Suites, and LLM-specific processing like RAG (Retrieval-Augmented Generation) systems.
The architecture separates concerns between core data structures (Dataset, Container), SDK-level abstractions for cloud/local deployment, and specialized LLM data processing components. This modularity allows users to work with familiar Python data structures while benefiting from Evidently's evaluation and monitoring capabilities.
Sources: src/evidently/core/datasets.py:1-50
Core Data Abstraction
The Dataset Class
The Dataset class serves as the primary data container throughout Evidently. It wraps various data sources into a standardized interface that supports:
- Direct creation from pandas DataFrames
- Automatic type inference via
DataDefinition - Descriptor-based feature computation
- Metadata and tagging support
# Basic Dataset creation from pandas
from evidently import Dataset, DataDefinition
dataset = Dataset.from_pandas(
dataframe,
data_definition=DataDefinition(),
metadata={"source": "production"},
tags=["eval", "2024"]
)
Sources: src/evidently/core/datasets.py:30-45
Dataset Factory Methods
| Method | Purpose | Input Types |
|---|---|---|
from_pandas() | Create Dataset from pandas DataFrame | pd.DataFrame |
from_any() | Convert various types to Dataset | pd.DataFrame, Dataset |
as_dataframe() | Extract underlying DataFrame | N/A (output) |
column() | Retrieve specific column as DatasetColumn | column name |
subdataset() | Filter dataset by column value | column name, label |
Sources: src/evidently/core/datasets.py:50-80
The from_any() static method implements a factory pattern that handles type conversion automatically:
@staticmethod
def from_any(dataset: PossibleDatasetTypes) -> "Dataset":
if isinstance(dataset, Dataset):
return dataset
if isinstance(dataset, pd.DataFrame):
return Dataset.from_pandas(dataset)
raise ValueError(f"Unsupported dataset type: {type(dataset)}")
Sources: src/evidently/core/datasets.py:60-70
Data Flow Architecture
graph TD
A[Input Data: pd.DataFrame / CSV / Parquet] --> B[Dataset.from_any]
B --> C{Data Type Check}
C -->|pd.DataFrame| D[Dataset.from_pandas]
C -->|Already Dataset| E[Return as-is]
D --> F[DataDefinition Type Inference]
F --> G[Add Descriptors]
G --> H[Dataset Object]
H --> I[Report.run / TestSuite.run]
I --> J[Snapshot with Results]
J --> K[Export: JSON / HTML / Dict]Container Architecture
The Container class provides the underlying storage and management mechanism for datasets within the Evidently ecosystem. Containers maintain references to data assets and provide CRUD operations for dataset lifecycle management.
Key container responsibilities include:
- Storing dataset references and metadata
- Managing dataset versioning
- Providing query and filtering capabilities
- Handling persistence to storage backends
Sources: src/evidently/core/container.py:1-30
Container-Dataset Relationship
graph LR
A[Container] -->|manages| B[Dataset Registry]
B -->|references| C[Dataset v1]
B -->|references| D[Dataset v2]
B -->|references| E[Dataset vN]
C -->|wraps| F[pd.DataFrame]
D -->|wraps| G[pd.DataFrame]
E -->|wraps| H[pd.DataFrame]SDK-Level Data Management
The SDK layer provides deployment-agnostic data handling through the CloudConfigAPI and local storage adapters. This abstraction enables consistent data operations whether running locally or connected to Evidently Cloud.
Sources: src/evidently/sdk/datasets.py:1-50
SDK Dataset Operations
| Operation | Method Signature | Description |
|---|---|---|
| Create | add_dataset(project_id, dataset, name, description, link) | Add dataset to project |
| Read | load_dataset(dataset_id) | Retrieve dataset by UUID |
| List | list_datasets(project, origins) | List all datasets in project |
| Update | update_dataset(dataset_id, dataset) | Modify existing dataset |
| Delete | delete_dataset(dataset_id) | Remove dataset from storage |
Sources: src/evidently/ui/workspace.py:50-100
LLM-Specific Data Processing
RAG Data Pipeline
For LLM applications using Retrieval-Augmented Generation, Evidently provides specialized data processing components that handle document ingestion, chunking, and indexing.
Sources: src/evidently/llm/rag/index.py:1-50
graph TD
A[Raw Documents] --> B[RAG Splitter]
B --> C[Document Chunks]
C --> D[RAG Index]
D --> E[Vector Store]
E --> F[Retrieval Query]
F --> G[Context + Query]
G --> H[LLM Response]Document Splitting
The splitter.py module handles text chunking with configurable parameters:
- Chunk size: Target size for each text segment
- Overlap: Amount of overlap between consecutive chunks
- Separators: Text boundaries for splitting priority
Sources: src/evidently/llm/rag/splitter.py:1-40
RAG Indexing
The index module manages the vector storage and retrieval:
- Embedding generation for document chunks
- Vector store integration (FAISS, ChromaDB, etc.)
- Similarity search capabilities
- Metadata preservation for filtering
Sources: src/evidently/llm/rag/index.py:50-100
Data Generation for LLM Evals
The datagen/base.py module provides infrastructure for synthetic data generation, enabling users to create evaluation datasets programmatically.
Sources: src/evidently/llm/datagen/base.py:1-50
Data Generation Workflow
graph LR
A[Seed Data / Templates] --> B[LLM Generator]
B --> C[Generated Samples]
C --> D[Quality Validation]
D -->|Pass| E[Evaluation Dataset]
D -->|Fail| F[Regeneration Loop]
F --> BDataGeneratorBase Class
The base class defines the contract for data generation:
| Method | Purpose |
|---|---|
generate() | Create new data samples |
validate() | Check generated data quality |
expand() | Augment existing datasets |
Sources: src/evidently/llm/datagen/base.py:30-60
Data Flow in Report Execution
Reports represent the primary consumer of dataset objects within Evidently. The run() method orchestrates the complete data flow from input to output.
def run(
self,
current_data,
reference_data=None,
additional_data=None,
timestamp=None,
metadata=None,
tags=None,
name=None,
) -> Snapshot:
# Data validation
if isinstance(current_data, pd.DataFrame) and current_data.empty:
raise ValueError("current_data must contain at least one column...")
# Convert to Dataset objects
current_dataset = Dataset.from_any(current_data)
reference_dataset = Dataset.from_any(reference_data) if reference_data else None
# Execute metrics and generate snapshot
...
Sources: src/evidently/core/report.py:100-130
Data Validation Rules
| Condition | Error Raised |
|---|---|
| Empty current_data DataFrame | ValueError: current_data must contain at least one column; received an empty DataFrame |
| Empty reference_data DataFrame | ValueError: reference_data must contain at least one column; received an empty DataFrame |
| Unsupported dataset type | ValueError: Unsupported dataset type: {type(dataset)} |
Sources: src/evidently/core/datasets.py:65-70
Workspace Dataset Management
The Workspace class provides high-level dataset management capabilities for organizing evaluations across projects.
class Workspace:
def add_dataset(
self,
project_id: STR_UUID,
dataset: Dataset,
name: str,
description: Optional[str] = None,
link: Optional[SnapshotLink] = None,
) -> DatasetID:
"""Add a dataset to a project."""
return self.datasets.add(project_id=project_id, dataset=dataset, ...)
def load_dataset(self, dataset_id: DatasetID) -> Dataset:
"""Load a dataset by ID."""
return self.datasets.load(dataset_id)
def list_datasets(
self,
project: STR_UUID,
origins: Optional[List[str]] = None,
) -> DatasetList:
"""List all datasets in a project."""
return self.datasets.list(project, origins=origins)
Sources: src/evidently/ui/workspace.py:50-80
Dataset Metadata Schema
| Field | Type | Required | Description |
|---|---|---|---|
name | str | Yes | Human-readable dataset name |
description | str | No | Detailed description |
link | SnapshotLink | No | Associated snapshot reference |
created_at | datetime | Auto | Creation timestamp |
author | str | Auto | Creator identifier |
Summary
Data Management in Evidently follows a consistent pattern across the platform:
- Ingestion: Data enters through
Dataset.from_pandas()orDataset.from_any()factory methods - Validation: Empty DataFrames and unsupported types are rejected early
- Processing: Descriptors and metrics operate on the standardized Dataset interface
- Output: Results are packaged into Snapshots with multiple export formats
- Persistence: Datasets can be stored and retrieved through Workspace and SDK APIs
The separation between core data structures, SDK abstractions, and specialized LLM components enables flexible deployment while maintaining a simple user-facing API based on pandas DataFrames.
Sources: src/evidently/core/datasets.py:1-100 Sources: src/evidently/core/report.py:100-150 Sources: src/evidently/sdk/datasets.py:1-50 Sources: src/evidently/llm/rag/index.py:1-100 Sources: src/evidently/llm/rag/splitter.py:1-50 Sources: src/evidently/llm/datagen/base.py:1-60
Sources: [src/evidently/core/datasets.py:1-50]()
ML Model Evaluation
Related topics: LLM Evaluation and Judging, Descriptors and Features System, Presets and Metric Presets
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: LLM Evaluation and Judging, Descriptors and Features System, Presets and Metric Presets
ML Model Evaluation
Overview
ML Model Evaluation in Evidently is a comprehensive module that enables data scientists and ML engineers to assess, test, and monitor machine learning models throughout their lifecycle—from experiments to production. The evaluation system supports both predictive tasks (classification, regression) and recommendation systems.
The evaluation framework is built around the concept of Metrics and Reports, where metrics are individual evaluation components that can be composed into reports for comprehensive model assessment.
Sources: src/evidently/core/report.py:1-50
Core Architecture
graph TD
A[Dataset] --> B[Report]
A --> C[Test Suite]
B --> D[Interactive Report]
B --> E[JSON/Dict Output]
C --> F[Pass/Fail Results]
G[Metrics] --> B
H[Presets] --> B
G --> CSupported Evaluation Types
| Evaluation Type | Purpose | Key Metrics |
|---|---|---|
| Classification | Evaluate classifier performance | Accuracy, Precision, Recall, F1, ROC-AUC |
| Regression | Evaluate regression models | MAE, MSE, RMSE, R2 |
| Data Quality | Assess input data integrity | Missing values, duplicates, correlations |
| Data Drift | Detect distribution changes | PSI, KS test, L Infinity |
| Recommendation Systems | Evaluate recsys models | Precision@K, Recall@K, NDCG |
| Embeddings | Evaluate embedding quality | Cosine similarity, drift detection |
Classification Metrics
Purpose and Scope
Classification metrics evaluate the performance of classification models by comparing predicted labels against actual labels. Evidently supports both binary and multiclass classification evaluation scenarios.
Sources: src/evidently/metrics/classification.py:1-100
Available Metrics
| Metric | Description | Applicable Task Types |
|---|---|---|
Accuracy | Proportion of correct predictions | Binary, Multiclass |
Precision | Positive predictive value | Binary, Multiclass |
Recall | Sensitivity, true positive rate | Binary, Multiclass |
F1Score | Harmonic mean of precision and recall | Binary, Multiclass |
RocAuc | Area under the ROC curve | Binary, Multiclass |
ConfusionMatrix | Cross-tabulation of predictions vs actuals | Binary, Multiclass |
PrecisionRecallCurve | Precision-Recall tradeoff visualization | Binary |
ClassRepresentation | Distribution of classes in predictions | Multiclass |
Usage Example
from evidently import Report
from evidently.metrics import Accuracy, Precision, Recall, F1Score, RocAuc
report = Report([
Accuracy(),
Precision(),
Recall(),
F1Score(),
RocAuc()
])
result = report.run(current_data=current_dataset, reference_data=reference_dataset)
Legacy Classification Performance
The legacy module provides comprehensive classification evaluation with detailed visualizations.
Sources: src/evidently/legacy/metrics/classification_performance/__init__.py:1-50
Key components include:
- ClassificationPerformanceMetrics: Core metrics container
- Visualization widgets: Confusion matrix plots, ROC curves, PR curves
- Per-class analysis: Detailed breakdown for multiclass scenarios
Regression Metrics
Purpose and Scope
Regression metrics assess the performance of regression models by computing various error measures and statistical properties of prediction residuals.
Sources: src/evidently/metrics/regression.py:1-100
Available Metrics
| Metric | Description | Unit |
|---|---|---|
MeanError | Average prediction error | Same as target |
MeanAbsoluteError | MAE - average absolute error | Same as target |
MeanSquaredError | MSE - average squared error | Squared target units |
RootMeanSquaredError | RMSE - square root of MSE | Same as target |
ErrorStd | Standard deviation of errors | Same as target |
ErrorNormality | Shapiro-Wilk test for residual normality | p-value |
R2Score | Coefficient of determination | Dimensionless |
ErrorPercentile | Percentile-based error analysis | Same as target |
Usage Example
from evidently import Report
from evidently.metrics import (
MeanAbsoluteError,
MeanSquaredError,
R2Score,
ErrorPercentile
)
report = Report([
MeanAbsoluteError(),
MeanSquaredError(),
R2Score(),
ErrorPercentile(percentile=95)
])
result = report.run(current_data=current_dataset, reference_data=reference_dataset)
Legacy Regression Performance
The legacy regression performance module provides time-series analysis of predictions.
Sources: src/evidently/legacy/metrics/regression_performance/__init__.py:1-50
Features include:
- Predicted vs Actual plots: Time-series visualization
- Residual distribution analysis: Histograms and QQ plots
- Error distribution by feature: Feature-level error breakdown
graph LR
A[Current Data] --> B[RegressionReport]
C[Reference Data] --> B
B --> D[Predicted vs Actual]
B --> E[Error Distribution]
B --> F[Performance Metrics]Data Quality Metrics
Purpose and Scope
Data quality metrics evaluate the integrity and quality of input data. These metrics are essential for understanding whether model predictions are reliable and for identifying data pipeline issues.
Sources: src/evidently/metrics/data_quality.py:1-100
Available Metrics
| Metric | Description |
|---|---|
ColumnCount | Number of columns in dataset |
RowCount | Number of rows in dataset |
ValueStats | Statistical summary (min, max, mean, std) |
MissingValuesMetric | Count and percentage of missing values |
UniqueValuesCount | Number of unique values per column |
DataInconsistencyMetric | Detection of data inconsistencies |
TextStats | Statistics for text columns (length, word count) |
Usage Example
from evidently import Report, Dataset
from evidently.metrics import ColumnCount, ValueStats, MissingValuesMetric
dataset = Dataset.from_pandas(df, data_definition=DataDefinition())
report = Report([
ColumnCount(),
ValueStats(column="target"),
MissingValuesMetric()
])
result = report.run(dataset, None)
Dataset Statistics
Purpose and Scope
Dataset statistics provide comprehensive descriptive statistics about datasets, enabling quick overview and comparison between reference and current data distributions.
Sources: src/evidently/metrics/dataset_statistics.py:1-100
Available Presets
| Preset | Description |
|---|---|
DataSummaryPreset | Complete dataset overview |
DataDriftPreset | Distribution comparison between reference and current |
TargetDriftPreset | Target variable drift analysis |
NumTargetDriftPreset | Numeric target drift analysis |
CatTargetDriftPreset | Categorical target drift analysis |
Statistical Tests for Drift Detection
Sources: src/evidently/legacy/calculations/data_drift.py:1-100
| Test | Applicable Data Type | Description |
|---|---|---|
KS | Numerical | Kolmogorov-Smirnov test |
ZScore | Numerical | Z-score based drift detection |
TTest | Numerical | Two-sample t-test for mean comparison |
ChiSquare | Categorical | Chi-square test for category distribution |
PSI | Both | Population Stability Index |
L Infinity | Both | Max absolute difference |
Data Drift Workflow
graph TD
A[Reference Dataset] --> F[Drift Detection Engine]
B[Current Dataset] --> F
F --> C[Statistical Tests]
F --> D[Distribution Comparison]
C --> E[Drift Report]
D --> E
E --> G{Drift Detected?}
G -->|Yes| H[Alert / Action]
G -->|No| I[Continue Monitoring]Recommendation System Metrics
Purpose and Scope
Recommendation system (recsys) metrics evaluate the quality of recommendations generated by recommendation models.
Sources: src/evidently/metrics/recsys.py:1-100
Available Metrics
| Metric | Description | K-Parameter |
|---|---|---|
PrecisionAtK | Precision at top-K recommendations | Required |
RecallAtK | Recall at top-K recommendations | Required |
NDCGAtK | Normalized Discounted Cumulative Gain at K | Required |
HitRateAtK | Hit rate at top-K recommendations | Optional |
MRRAtK | Mean Reciprocal Rank at K | Required |
Usage Example
from evidently import Report
from evidently.metrics import PrecisionAtK, RecallAtK, NDCGAtK
report = Report([
PrecisionAtK(k=10),
RecallAtK(k=10),
NDCGAtK(k=10)
])
result = report.run(current_data=recs_dataset, reference_data=ref_recs_dataset)
Embeddings Metrics
Purpose and Scope
Embeddings metrics evaluate the quality and drift of vector embeddings, which are critical for LLM and semantic search applications.
Sources: src/evidently/metrics/embeddings.py:1-100
Available Metrics
| Metric | Description |
|---|---|
Embedding Drift | Detect drift in embedding distributions |
Cosine Similarity | Average cosine similarity between embeddings |
Retrieval Metrics | Evaluate RAG and retrieval system quality |
Key Features
- Semantic Drift Detection: Identify changes in embedding space distribution
- Cluster Analysis: Analyze embedding cluster stability
- Pairwise Comparison: Compare individual embedding pairs
Report Configuration
Creating a Report
from evidently import Report
from evidently.presets import DataDriftPreset
# Basic report with preset
report = Report([DataDriftPreset()])
# Report with custom metadata
report = Report(
metrics=[...],
metadata={"model_version": "v1.2.3"},
tags=["production", "monthly"]
)
# Run with reference data for comparison
snapshot = report.run(current_dataset, reference_dataset)
Report Output Formats
| Format | Method | Use Case |
|---|---|---|
| Interactive | Direct notebook display | Exploration, debugging |
| JSON | .json() | API responses, CI/CD |
| Python Dict | .dict() | Programmatic access |
| HTML | Export functionality | Shareable reports |
Test Suite Integration
Reports can be converted to Test Suites by adding pass/fail conditions, enabling automated regression testing.
from evidently.test_suite import TestSuite
from evidently.tests import TestColumnValue
suite = TestSuite([
TestColumnValue(column="accuracy", gt=0.95),
])
suite.run(current_data=dataset, reference_data=reference)
Best Practices
1. Always Use Reference Data for Comparison
For meaningful evaluation, maintain a reference dataset representing expected data distribution or model performance.
2. Choose Appropriate Metrics by Task Type
| Task Type | Recommended Metrics |
|---|---|
| Binary Classification | Accuracy, Precision, Recall, F1, ROC-AUC |
| Multiclass Classification | Per-class F1, Confusion Matrix, Macro/Micro Avg |
| Regression | MAE, RMSE, R2, Error Percentiles |
| Recommendation | Precision@K, Recall@K, NDCG@K |
| Data Quality | Missing values, duplicates, consistency |
3. Set Up Monitoring Schedules
For production models, schedule regular evaluations to detect performance degradation and data drift early.
4. Use Auto-generated Test Conditions
Evidently can auto-generate test thresholds from reference data:
from evidently.presets import DataDriftPreset
suite = TestSuite.from_reference(reference_data, [DataDriftPreset()])
Summary
Evidently's ML Model Evaluation module provides a comprehensive, modular framework for evaluating machine learning models across multiple dimensions:
- Classification: Binary and multiclass classification metrics with visualizations
- Regression: Comprehensive error metrics and residual analysis
- Data Quality: Input data integrity checks
- Data Drift: Distribution comparison and statistical tests
- Recommendation Systems: Top-K recommendation quality metrics
- Embeddings: Semantic drift detection for LLM applications
The system supports both one-off evaluations and continuous monitoring, with seamless integration into CI/CD pipelines through Test Suites.
Sources: [src/evidently/core/report.py:1-50]()
LLM Evaluation and Judging
Related topics: ML Model Evaluation, Descriptors and Features System
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: ML Model Evaluation, Descriptors and Features System
LLM Evaluation and Judging
Overview
Evidently provides a comprehensive framework for evaluating Large Language Model (LLM) powered systems through specialized descriptors called LLM Judges. These judges leverage LLM APIs to automatically assess and score various quality dimensions of LLM outputs, including relevance, coherence, factual accuracy, and task completion.
LLM Judges serve as evaluators that can be integrated into monitoring pipelines to continuously assess LLM performance without requiring manual human evaluation.
Architecture
graph TD
A[Dataset with LLM Inputs/Outputs] --> B[LLM Judge Descriptors]
B --> C[Prompt Templates]
C --> D[LLM Provider API]
D --> E[Parsed Evaluation Results]
E --> F[Feature Descriptors]
F --> G[Metrics & Tests]
G --> H[Evidently Reports/Monitors]
B1[ContextRelevanceLLMEval] --> B
B2[CompletenessLLMEval] --> B
B3[ContextQualityLLMEval] --> B
B4[GroundednessLLMEval] --> B
B5[RelevanceLLMEval] --> B
C1[System Prompt] --> C
C2[User Query] --> C
C3[Evaluation Criteria] --> CCore Components
LLM Judge Types
Evidently implements multiple specialized LLM judges for different evaluation dimensions:
| Judge Type | Purpose | Output |
|---|---|---|
ContextRelevanceLLMEval | Measures relevance of retrieved context to the query | Score 0-1, reasoning |
CompletenessLLMEval | Evaluates completeness of response coverage | Score 0-1, reasoning |
ContextQualityLLMEval | Assesses quality of context for answering | Score 0-1, reasoning |
GroundednessLLMEval | Checks if response is grounded in provided context | Score 0-1, reasoning |
RelevanceLLMEval | Evaluates semantic relevance of response to query | Score 0-1, reasoning |
Sources: src/evidently/descriptors/generated_descriptors.py:1-200
Feature Descriptor Pattern
LLM Judges follow the FeatureDescriptor pattern, wrapping LLM evaluation features with optional aliasing and test definitions:
FeatureDescriptor(
feature=feature,
alias=alias, # Custom display name
tests=tests # Optional validation tests
)
Sources: src/evidently/descriptors/generated_descriptors.py:30-35
Configuration Options
Common Parameters
All LLM Judge functions share a common parameter structure:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
column_name | str | Yes | - | Name of the text column to evaluate |
provider | str | No | "openai" | LLM provider (openai, anthropic, etc.) |
model | str | No | "gpt-4o-mini" | Model identifier |
additional_columns | Dict[str, str] | No | None | Extra context columns for evaluation |
include_category | bool | No | None | Include categorical classification in output |
include_score | bool | No | None | Include numeric score in output |
include_reasoning | bool | No | None | Include explanation in output |
uncertainty | Uncertainty | No | None | Uncertainty handling strategy |
alias | str | No | None | Custom name for the feature |
tests | List | No | None | Tests to apply to the feature |
Sources: src/evidently/descriptors/generated_descriptors.py:50-70
Uncertainty Handling
The uncertainty parameter enables robust evaluation by detecting when the LLM is uncertain about its assessment:
uncertainty: Optional[Uncertainty] = None
This allows the system to flag low-confidence evaluations rather than returning potentially incorrect scores.
Sources: src/evidently/descriptors/generated_descriptors.py:58
Prompt System
Prompt Templates
Evidently uses structured prompt templates for LLM evaluation:
@dataclasses.dataclass
class LLMJudgePrompts:
system: str
user: str
Prompts are rendered with dynamic variables including:
- Query text
- Response text
- Reference context
- Evaluation criteria
Sources: src/evidently/llm/prompts/__init__.py
Prompt Rendering
The prompt rendering system processes templates with type-safe variable substitution:
def render_prompt(
template: PromptTemplate,
**kwargs: Any
) -> str:
Sources: src/evidently/llm/utils/prompt_render.py
Output Parsing
Response Parsing
Evaluation results are parsed using structured output extraction:
| Parser | Purpose |
|---|---|
parse_json_response() | Extract JSON-structured evaluations |
parse_score() | Extract numeric scores |
parse_reasoning() | Extract explanation text |
parse_category() | Extract categorical labels |
Sources: src/evidently/llm/utils/parsing.py
Score Calculation
Scores are calculated based on the evaluation criteria defined in each judge:
- 0.0: Does not meet criteria
- 0.5: Partially meets criteria
- 1.0: Fully meets criteria
The score can be combined with reasoning and category outputs for comprehensive evaluation reports.
Integration with Evidently Reports
Usage in Reports
LLM Judges can be added to Evidently reports:
from evidently.llm import RelevanceLLMEval
report = Report(metrics=[
RelevanceLLMEval(
column_name="response",
include_score=True,
include_reasoning=True
)
])
Usage in Monitoring
For continuous monitoring, LLM Judges are integrated into dashboard widgets:
from evidently.dashboard import Dashboard
dashboard = Dashboard([
RelevanceLLMEval(column_name="response")
])
Sources: ui/packages/evidently-ui-lib/src/components/Descriptors/Features/LLMJudge/template.tsx
UI Components
LLMJudgeTemplate Component
The frontend provides visualization for LLM judge configurations:
| Property | Type | Description |
|---|---|---|
state | LLMJudgeState | Current judge configuration |
errors | FormErrors | Validation errors |
availableTags | string[] | Available categorization tags |
Sources: ui/packages/evidently-ui-lib/src/components/Descriptors/Features/LLMJudge/template.tsx
Visualization States
The UI renders different states for judge configurations:
- Uncertainty Configuration: Shows uncertainty handling options
- Multiclass Classification: Displays class criteria for classification modes
- Criteria Preview: Shows evaluation criteria as formatted text
- Output Options: Displays configured output fields (reasoning, category, score)
Legacy Components
Legacy LLM Judges
The legacy implementation in evidently.legacy provides backward compatibility:
from evidently.legacy.descriptors.llm_judges import (
CompletenessLLMEval as CompletenessLLMEvalV1
)
Sources: src/evidently/legacy/descriptors/llm_judges.py
Sentiment Descriptor
Specialized descriptor for sentiment analysis:
from evidently.legacy.descriptors.sentiment_descriptor import SentimentDescriptor
Features include:
- Sentiment polarity detection (positive, negative, neutral)
- Confidence scoring
- Integration with reporting pipeline
Sources: src/evidently/legacy/descriptors/sentiment_descriptor.py
Scorers and Optimization
LLM Scorers
Scorers provide optimized evaluation functions:
from evidently.llm.optimization.scorers import LLMScorer
Scorers support:
- Batch evaluation
- Caching of results
- Configurable thresholds
Sources: src/evidently/llm/optimization/scorers.py
Best Practices
1. Choosing Evaluation Dimensions
| Use Case | Recommended Judges |
|---|---|
| RAG Systems | ContextRelevance, Groundedness, Relevance |
| Summarization | Completeness, Relevance |
| Q&A Systems | ContextQuality, Groundedness |
| General Chat | All dimensions |
2. Score Interpretation
- 0.0 - 0.3: Significant issues detected
- 0.4 - 0.6: Partial criteria met
- 0.7 - 1.0: Criteria well satisfied
3. Uncertainty Handling
Enable uncertainty handling when:
- Operating on edge cases
- Using smaller models
- Evaluating ambiguous inputs
See Also
Sources: [src/evidently/descriptors/generated_descriptors.py:1-200]()
Descriptors and Features System
Related topics: LLM Evaluation and Judging, ML Model Evaluation
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: LLM Evaluation and Judging, ML Model Evaluation
Descriptors and Features System
Overview
The Descriptors and Features System is a core component of the Evidently framework that provides extensible ways to extract, compute, and evaluate characteristics from data columns. This system enables users to define custom descriptors that wrap features with validation tests, making it easy to assess data quality, text properties, and model behavior in a unified evaluation pipeline.
Descriptors serve as wrappers around feature implementations, adding metadata, display names, and optional test configurations. The system supports both legacy v1 features and modern descriptor implementations, providing backward compatibility while enabling new functionality.
Architecture
The system follows a layered architecture with clear separation between descriptor definitions, feature implementations, and registry management.
graph TD
subgraph "Public API Layer"
PD[Public Descriptors<br/>TextMatch, HuggingFace, etc.]
CD[Custom Descriptors<br/>FeatureDescriptor]
end
subgraph "Registry Layer"
REG[DescriptorRegistry]
end
subgraph "Implementation Layer"
LEG[Legacy Features V1<br/>hf_feature, text_length]
NEW[New Features<br/>_text_length, text_match]
end
subgraph "Evaluation Layer"
EVAL[Evidently Metrics<br/>and Tests]
end
PD --> REG
CD --> REG
REG --> LEG
REG --> NEW
LEG --> EVAL
NEW --> EVAL
style PD fill:#90EE90
style CD fill:#87CEEB
style REG fill:#FFD700
style LEG fill:#FFA07A
style NEW fill:#DDA0DDDirectory Structure
src/evidently/
├── descriptors/
│ ├── __init__.py # Public API exports
│ ├── _text_length.py # Text length descriptor implementation
│ ├── _custom_descriptors.py # FeatureDescriptor class
│ ├── generated_descriptors.py # Generated descriptor functions
│ └── text_match.py # Text matching descriptor
├── core/
│ └── registries/
│ └── descriptors.py # Descriptor registry implementation
└── legacy/
├── descriptors/ # Legacy descriptor implementations
└── features/ # Legacy feature implementations
Core Components
FeatureDescriptor
The FeatureDescriptor class is the central abstraction that wraps a feature with additional metadata and optional tests.
# Source: src/evidently/descriptors/_custom_descriptors.py
class FeatureDescriptor(BaseDescriptor):
"""Descriptor that wraps a feature with tests and metadata."""
def __init__(
self,
feature: AnyFeatureType,
alias: Optional[str] = None,
tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
self.feature = feature
self.alias = alias
self.tests = tests or []
Parameters:
| Parameter | Type | Description |
|---|---|---|
feature | AnyFeatureType | The underlying feature implementation |
alias | Optional[str] | Display name for the descriptor |
tests | Optional[List] | Optional list of tests to apply |
DescriptorRegistry
The registry manages descriptor registration and lookup, enabling dynamic descriptor discovery.
# Source: src/evidently/core/registries/descriptors.py
class DescriptorRegistry:
"""Registry for managing descriptor instances."""
def __init__(self):
self._descriptors: Dict[str, Descriptor] = {}
def register(self, name: str, descriptor: Descriptor) -> None:
"""Register a descriptor with a given name."""
self._descriptors[name] = descriptor
def get(self, name: str) -> Optional[Descriptor]:
"""Retrieve a descriptor by name."""
return self._descriptors.get(name)
Available Descriptors
Text Length Descriptor
Computes text length statistics for string columns, including character count, word count, and sentence statistics.
Source: src/evidently/descriptors/_text_length.py
def TextLength(
column_name: str,
alias: Optional[str] = None,
mode: str = "chars",
tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
"""Compute text length metrics for a column.
Args:
column_name: Name of the text column to analyze.
alias: Display name for the descriptor.
mode: Measurement mode - "chars" or "words".
tests: Optional list of tests to apply.
"""
from evidently.legacy.features.text_length import TextLength as TextLengthV1
feature = TextLengthV1(column_name=column_name, mode=mode, display_name=alias)
return FeatureDescriptor(feature=feature, alias=alias, tests=tests)
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
column_name | str | Required | Column to analyze |
alias | Optional[str] | None | Custom display name |
mode | str | "chars" | "chars" for characters, "words" for word count |
tests | Optional[List] | None | Tests to attach |
Text Match Descriptor
Matches text content against patterns or lists, useful for validation and filtering.
Source: src/evidently/descriptors/text_match.py
def TextMatch(
column_name: str,
alias: str,
words_list: List[str],
mode: str = "match",
lemmatize: bool = False,
tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
"""Match text against a word list.
Args:
column_name: Text column to match against.
alias: Name for the descriptor.
words_list: List of words/patterns to match.
mode: Matching mode - "match", "any", "all".
lemmatize: Whether to apply lemmatization.
tests: Optional test list.
"""
from evidently.legacy.features.text_features import TextMatch as TextMatchV1
feature = TextMatchV1(
column_name=column_name,
words_list=words_list,
mode=mode,
lemmatize=lemmatize,
display_name=alias
)
return FeatureDescriptor(feature=feature, alias=alias, tests=tests)
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
column_name | str | Required | Target text column |
alias | str | Required | Descriptor name |
words_list | List[str] | Required | Words to match |
mode | str | "match" | "match", "any", or "all" |
lemmatize | bool | False | Enable lemmatization |
tests | Optional[List] | None | Tests to apply |
HuggingFace Descriptor
Applies HuggingFace models to text columns for various NLP tasks including toxicity detection.
Source: src/evidently/descriptors/generated_descriptors.py
def HuggingFace(
column_name: str,
model: str,
params: dict,
alias: str,
tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
"""Apply a HuggingFace model to text column.
Args:
column_name: Name of the text column to process.
model: HuggingFace model name or path.
params: Additional parameters for the model.
alias: Alias for the descriptor.
tests: Optional list of tests to apply.
"""
from evidently.legacy.features.hf_feature import HuggingFaceFeature
feature = HuggingFaceFeature(
column_name=column_name,
model=model,
params=params,
display_name=alias
)
return FeatureDescriptor(feature=feature, alias=alias, tests=tests)
HuggingFace Toxicity Descriptor
Specialized descriptor for detecting toxic content using HuggingFace models.
Source: src/evidently/descriptors/generated_descriptors.py:39-56
def HuggingFaceToxicity(
column_name: str,
alias: str,
model: Optional[str] = None,
toxic_label: Optional[str] = None,
tests: Optional[List[Union["DescriptorTest", "GenericTest"]]] = None,
):
"""Detect toxicity in text using HuggingFace models.
Args:
column_name: Name of the text column to check.
alias: Alias for the descriptor.
model: HuggingFace model name or path. If None, uses default.
toxic_label: Label for toxic content.
tests: Optional list of tests to apply.
"""
Integration with Evidently Metrics
Descriptors are designed to integrate seamlessly with Evidently's metric and test evaluation system. When a descriptor is included in an evaluation, the underlying feature is computed and the results are automatically formatted for display.
graph LR
A[Input Data] --> B[Descriptor Definition]
B --> C[Feature Computation]
C --> D[Test Evaluation]
D --> E[Metric Results]
C --> F[Visualization Data]
F --> G[HTML Widgets]
style A fill:#87CEEB
style E fill:#90EE90
style G fill:#FFD700Legacy Features System
The legacy features system (evidently.legacy.features) provides backward compatibility for existing integrations. These features are wrapped by modern descriptors but can also be used directly.
Source: src/evidently/legacy/features/__init__.py
Available Legacy Features
| Feature | Module | Purpose |
|---|---|---|
TextLength | text_length | Text length statistics |
TextMatch | text_features | Pattern matching in text |
HuggingFaceFeature | hf_feature | HuggingFace model integration |
Usage Example
from evidently.descriptors import TextLength, TextMatch, HuggingFaceToxicity
# Create descriptors for evaluation
text_length_desc = TextLength(
column_name="review_text",
alias="Review Length",
mode="words"
)
toxicity_desc = HuggingFaceToxicity(
column_name="review_text",
alias="Toxicity Score",
toxic_label="toxic"
)
# Use in Evidently Dashboard
dashboard = Dashboard(metrics=[
ColumnMetrics(column_name="review_text")
.with_descriptors([
text_length_desc,
toxicity_desc
])
])
Descriptor Tests Integration
Descriptors can be associated with tests that validate the computed values:
from evidently.descriptors import TextLength
from evidently.test_suits import ColumnTestSuite
test_suite = ColumnTestSuite(
tests=[
TextLength(
column_name="description",
alias="Description Length",
tests=[
# Test within the descriptor
GreaterThan("Description Length", threshold=10),
LessThan("Description Length", threshold=500)
]
)
]
)
Summary
The Descriptors and Features System provides a flexible, extensible mechanism for computing and evaluating column characteristics within Evidently. Key aspects include:
- Wrapper Pattern:
FeatureDescriptorwraps features with metadata and tests - Registry Pattern: Centralized descriptor management via
DescriptorRegistry - Backward Compatibility: Legacy features wrapped for v1 compatibility
- Test Integration: Native support for attaching tests to descriptors
- Visualization: Automatic integration with Evidently's HTML rendering pipeline
This architecture enables users to define custom descriptors for any data transformation or evaluation need while maintaining consistency with the Evidently evaluation framework.
Source: https://github.com/evidentlyai/evidently / Human Manual
Reports and Test Suites
Related topics: Presets and Metric Presets, Custom Metrics and Extensibility, Core Components
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Presets and Metric Presets, Custom Metrics and Extensibility, Core Components
Reports and Test Suites
Overview
Evidently provides two primary evaluation constructs for ML and LLM-powered systems: Reports and Test Suites. Both are designed to evaluate, analyze, and monitor data and model quality, but they serve different purposes and use cases.
| Component | Purpose | Use Case |
|---|---|---|
| Report | Generate comprehensive analysis with metrics, visualizations, and insights | Exploratory analysis, documentation, stakeholder communication |
| Test Suite | Run structured checks with pass/fail outcomes | CI/CD pipelines, automated quality gates, regression testing |
Sources: src/evidently/legacy/report/report.py
Architecture
graph TD
subgraph "Core Evaluation Engine"
A[Dataset / DataDefinition] --> B[Report / Test Suite]
C[Descriptors / Metrics] --> B
D[Test Cases] --> B
end
subgraph "Report Output"
B --> E[Metric Results]
B --> F[Visualizations]
B --> G[JSON/HTML Export]
end
subgraph "Test Suite Output"
B --> H[Test Results]
H --> I[Pass / Fail Status]
H --> J[Failure Details]
endCore Components
The evaluation system is built on several key components:
| Component | File Location | Role |
|---|---|---|
Report | src/evidently/core/report.py | Main class for generating analytical reports |
TestSuite | src/evidently/core/tests.py | Orchestrates test execution |
Dataset | Core data structure | Holds data with schema definitions |
DataDefinition | Schema definition | Defines column types and expectations |
Descriptor | Feature extraction | Row-level evaluators (e.g., Sentiment, TextLength) |
Metric | Metric calculation | Computes statistical measures |
Sources: src/evidently/core/report.py
Datasets and Data Definition
Dataset Structure
Evidently uses a Dataset object to wrap pandas DataFrames with additional metadata:
from evidently import Dataset, DataDefinition
eval_dataset = Dataset.from_pandas(
pd.DataFrame(eval_df),
data_definition=DataDefinition(),
descriptors=[
Sentiment("answer", alias="Sentiment"),
TextLength("answer", alias="Length"),
Contains("answer", items=["sorry", "apologize"], mode="any")
]
)
Sources: README.md
DataDefinition
The DataDefinition class defines the schema for the dataset:
| Property | Type | Description |
|---|---|---|
column_schema | dict | Maps column names to data types |
relationships | list | Defines relationships between datasets |
timestamp_column | str | Column containing datetime values |
Descriptors
Descriptors are row-level evaluators that extract features or apply checks to individual rows. They can be used within datasets or standalone.
Available Descriptors
| Descriptor | Purpose | Parameters |
|---|---|---|
Sentiment | Detect sentiment in text | column_name, alias |
TextLength | Measure text length | column_name, alias |
Contains | Check for substring presence | column_name, items, mode, case_sensitive |
BeginsWith | Verify text prefix | column_name, prefix, case_sensitive |
HuggingFace | Apply HF models | column_name, model, params |
HuggingFaceToxicity | Detect toxic content | column_name, model, toxic_label |
Sources: src/evidently/descriptors/generated_descriptors.py
Descriptor Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
column_name | str | Yes | Name of the column to evaluate |
alias | str | No | Display name for the result |
tests | list | No | Optional test cases to apply |
mode | str | No | Matching mode: "any" or "all" |
case_sensitive | bool | No | Whether comparison is case-sensitive |
Reports
Creating a Report
Reports generate comprehensive HTML or JSON output with metrics and visualizations:
from evidently import Report
from evidently.presets import TextEvals
report = Report(
metrics=[
TextEvals(),
]
)
result = report.run(reference_data=reference_df, current_data=current_df)
report.save_html("report.html")
Sources: src/evidently/legacy/report/report.py
Report Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
metrics | list | Required | List of metrics to compute |
timestamp | datetime | Now | Report generation time |
include_tests | bool | False | Include test results in report |
Report Output Formats
| Format | Method | Use Case |
|---|---|---|
| HTML | save_html(path) | Interactive visualization |
| JSON | save_json(path) or as_dict() | Machine parsing, APIs |
| Python dict | as_dict() | Programmatic access |
Sources: src/evidently/core/serialization.py
Test Suites
Creating a Test Suite
Test Suites run structured checks and return pass/fail status:
from evidently import TestSuite
from evidently.test_suite import TestSuite
suite = TestSuite(tests=[
# Test definitions
])
result = suite.run(reference_data=reference_df, current_data=current_df)
suite.save("test_results.json")
Sources: src/evidently/legacy/test_suite/test_suite.py
Test Suite Workflow
graph LR
A[Input Data] --> B[Test Suite]
B --> C{Execute Tests}
C --> D[Test 1]
C --> E[Test 2]
C --> F[Test N]
D --> G[Test Results]
E --> G
F --> G
G --> H{Pass All?}
H -->|Yes| I[Success]
H -->|No| J[Failure Report]Test Results Structure
| Field | Type | Description |
|---|---|---|
status | str | "PASSED", "FAILED", "WARNING" |
name | str | Test identifier |
group | str | Test category |
details | dict | Additional context and metrics |
timestamp | datetime | When the test was run |
LLM Evaluation Features
LLM-Powered Judgments
Evidently supports LLM-based evaluation through specialized descriptors:
from evidently.descriptors import (
ContextQualityLLMEval,
CompletenessLLMEval,
)
| Descriptor | Purpose | Key Parameters |
|---|---|---|
ContextQualityLLMEval | Evaluate context relevance | question, provider, model |
CompletenessLLMEval | Check response completeness | context, provider, model, include_reasoning |
Sources: src/evidently/descriptors/generated_descriptors.py
LLM Provider Configuration
| Provider | Model Default | Configuration |
|---|---|---|
openai | gpt-4o-mini | API key required |
anthropic | Claude models | API key required |
azure | Configurable | Endpoint + key |
Evaluation Options
| Parameter | Type | Description |
|---|---|---|
include_category | bool | Include categorical classification |
include_score | bool | Include numerical score |
include_reasoning | bool | Include LLM reasoning |
uncertainty | Uncertainty | Strategy for handling uncertainty |
Metrics and Presets
Preset Configurations
Evidently provides pre-configured metric sets:
| Preset | Description | File |
|---|---|---|
TextEvals | Text quality metrics | src/evidently/presets |
DataDrift | Drift detection | Legacy metrics |
DataQuality | Quality checks | Legacy metrics |
Sources: src/evidently/metrics/_legacy.py
Legacy vs. Current API
Evidently maintains both legacy and current APIs for backward compatibility:
Legacy API Structure
src/evidently/legacy/
├── report/
│ └── report.py # Legacy Report class
└── test_suite/
└── test_suite.py # Legacy TestSuite class
Current API Structure
src/evidently/
├── core/
│ ├── report.py # Current Report class
│ ├── tests.py # Current test infrastructure
│ ├── compare.py # Data comparison utilities
│ └── serialization.py # Output serialization
Sources: src/evidently/metrics/_legacy.py
Usage Example: LLM Evaluation Pipeline
import pandas as pd
from evidently import Report, Dataset, DataDefinition
from evidently.descriptors import Sentiment, TextLength, Contains
from evidently.presets import TextEvals
# 1. Prepare data
eval_df = pd.DataFrame([
["What is the capital of Japan?", "The capital of Japan is Tokyo."],
["Who painted the Mona Lisa?", "Leonardo da Vinci."],
["Can you write an essay?", "I'm sorry, but I can't assist with homework."]],
columns=["question", "answer"])
# 2. Create dataset with descriptors
eval_dataset = Dataset.from_pandas(
pd.DataFrame(eval_df),
data_definition=DataDefinition(),
descriptors=[
Sentiment("answer", alias="Sentiment"),
TextLength("answer", alias="Length"),
Contains("answer", items=["sorry", "apologize"], mode="any")
]
)
# 3. Run report
report = Report(metrics=[TextEvals()])
result = report.run(reference_data=None, current_data=eval_dataset)
report.save_html("llm_eval_report.html")
Sources: README.md
Comparison: Report vs. Test Suite
| Aspect | Report | Test Suite |
|---|---|---|
| Output | Metrics, visualizations, insights | Pass/fail status, error details |
| Use Case | Exploratory analysis | Automated quality gates |
| Integration | Dashboards, documentation | CI/CD, monitoring alerts |
| Thresholds | Configurable, informative | Strict pass/fail boundaries |
| Exit Code | Always 0 (informational) | 0 (pass) or 1 (fail) |
Sources: src/evidently/core/tests.py
Serialization
Supported Formats
| Format | Extension | Use Case |
|---|---|---|
| JSON | .json | APIs, automation, storage |
| HTML | .html | Human review, sharing |
| Parquet | .parquet | Large-scale data storage |
Serialization Options
# Save as JSON
report.save_json("report.json")
# Save as HTML
report.save_html("report.html")
# Export as dictionary
data = report.as_dict()
Sources: src/evidently/core/serialization.py
Summary
Reports and Test Suites in Evidently provide complementary approaches to ML and LLM evaluation:
- Reports excel at providing comprehensive, visual analysis suitable for exploration and documentation
- Test Suites provide structured, automated quality checks ideal for CI/CD integration
- Both share the same underlying dataset and descriptor infrastructure
- LLM-powered evaluations are available through dedicated descriptors like
ContextQualityLLMEvalandCompletenessLLMEval - Output can be serialized to JSON, HTML, or accessed programmatically via Python dictionaries
Sources: [src/evidently/legacy/report/report.py](https://github.com/evidentlyai/evidently/blob/main/src/evidently/legacy/report/report.py)
Presets and Metric Presets
Related topics: Reports and Test Suites, Custom Metrics and Extensibility
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Reports and Test Suites, Custom Metrics and Extensibility
Presets and Metric Presets
Overview
Presets in Evidently are pre-configured collections of metrics and evaluations designed to simplify the process of assessing machine learning models and LLM-powered systems. They provide ready-to-use evaluation templates that bundle relevant metrics, thresholds, and reporting components into cohesive units. This abstraction layer allows users to perform comprehensive model assessment without manually configuring individual metrics.
The preset system follows a modular architecture where each preset type addresses a specific use case domain. Users can instantiate presets and pass them directly to the Report or TestSuite objects, which execute the bundled evaluations and generate structured results.
Preset Architecture
graph TD
A[Evidently Presets] --> B[Standard Presets]
A --> C[Legacy Presets]
B --> B1[DataDriftPreset]
B --> B2[TargetDriftPreset]
B --> B3[ClassificationPreset]
B --> B4[RegressionPreset]
B --> B5[RecSysPreset]
B --> B6[TextEvals]
C --> C1[MetricPreset]
C --> C2[TestPreset]
D[Report / TestSuite] --> E[Integrates Presets]
E --> B
E --> CPreset Types
Standard Presets (src/evidently/presets/)
The modern preset implementation resides in the src/evidently/presets/ directory. These presets integrate directly with the current Evidently reporting API.
#### DataDriftPreset
The DataDriftPreset evaluates feature-level drift between reference and current datasets. It calculates drift scores for individual features and provides aggregate drift statistics. This preset is particularly useful for monitoring data pipeline changes and detecting distribution shifts that may impact model performance.
from evidently.presets import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df, current_data=current_df)
#### TargetDriftPreset
The TargetDriftPreset specifically monitors drift in the target variable distribution. It is essential for supervised learning scenarios where changes in label distribution can signal underlying data quality issues or concept drift.
#### ClassificationPreset
The ClassificationPreset bundles metrics relevant to classification model evaluation. It includes accuracy metrics, confusion matrix analysis, class-level performance indicators, and probability calibration assessments. This preset supports both binary and multiclass classification scenarios.
#### RegressionPreset
The RegressionPreset provides comprehensive regression model evaluation including error distribution analysis, quantile-based metrics, and residual diagnostics. It helps identify heteroscedasticity, non-linearity, and systematic prediction biases.
#### RecSysPreset
The RecSysPreset is specialized for recommendation system evaluation. It includes ranking metrics, coverage indicators, and user-item interaction analysis tailored to collaborative filtering and content-based recommendation approaches.
#### TextEvals
The TextEvals preset targets LLM and NLP model evaluation. It encompasses text-specific metrics such as sentiment analysis, text length distributions, and content quality indicators. This preset integrates with the descriptors system for row-level text evaluations.
from evidently.presets import TextEvals
eval_dataset = Dataset.from_pandas(
pd.DataFrame(eval_df),
data_definition=DataDefinition(),
descriptors=[
Sentiment("answer", alias="Sentiment"),
TextLength("answer", alias="Length"),
Contains("answer", words=["sorry", "cannot"], alias="Denial")
]
)
Legacy Presets (src/evidently/legacy/)
The legacy preset system provided foundational abstractions for metric and test evaluation. These classes served as the original implementation pattern before the current preset architecture.
#### MetricPreset
The MetricPreset class defines the base interface for metric collection presets. It maintains a list of metrics and provides methods for execution and result aggregation.
#### TestPreset
The TestPreset class extends the preset concept to testing scenarios, enabling batch execution of statistical tests against model outputs. It provides assertions and threshold-based pass/fail criteria.
Preset Composition
Presets can be combined within a single report to create comprehensive evaluation suites:
from evidently import Report
from evidently.presets import DataDriftPreset, ClassificationPreset
report = Report(
metrics=[
DataDriftPreset(),
ClassificationPreset()
]
)
Integration with Report API
Presets integrate seamlessly with the Evidently Report class:
| Component | Role | Integration |
|---|---|---|
Report | Execution container | Accepts presets as metrics parameter |
Dataset | Data wrapper | Provides reference and current data |
DataDefinition | Schema definition | Describes columns and their roles |
| Preset | Metric bundle | Contains metric instances |
Descriptor System
The preset system works in conjunction with the descriptor system for row-level evaluations. Descriptors apply per-row transformations and evaluations:
from evidently.descriptors import Sentiment, TextLength, Contains
Descriptors included in the Dataset configuration are evaluated alongside preset metrics, enabling both aggregate and granular assessment within a single report.
Use Cases
Model Validation
Before deploying a model, use presets to validate performance on held-out data:
report = Report(metrics=[RegressionPreset()])
report.run(reference_data=training_df, current_data=validation_df)
Production Monitoring
Continuously monitor deployed models for performance degradation:
report = Report(metrics=[DataDriftPreset(), ClassificationPreset()])
report.run(reference_data=baseline_df, current_data=current_df)
LLM Evaluation
Assess LLM responses using text-specific presets:
from evidently.presets import TextEvals
from evidently import Dataset, DataDefinition
eval_dataset = Dataset.from_pandas(response_df, data_definition=DataDefinition())
report = Report(metrics=[TextEvals()], descriptors=[...])
Summary Table: Preset Comparison
| Preset | Domain | Key Metrics | Use Case |
|---|---|---|---|
DataDriftPreset | Data | Drift scores, feature importance | Data pipeline monitoring |
TargetDriftPreset | Data | Target distribution, PSI | Concept drift detection |
ClassificationPreset | ML | Accuracy, F1, ROC-AUC | Classifier evaluation |
RegressionPreset | ML | MAE, RMSE, R² | Regression analysis |
RecSysPreset | ML | Ranking metrics, coverage | Recommendation systems |
TextEvals | NLP/LLM | Sentiment, length, content | Text model evaluation |
Conclusion
The preset system in Evidently provides a powerful abstraction for organizing and executing model evaluations. By bundling related metrics into coherent units, presets reduce boilerplate and enable consistent evaluation practices across different model types and deployment scenarios.
Source: https://github.com/evidentlyai/evidently / Human Manual
Custom Metrics and Extensibility
Related topics: Reports and Test Suites, Presets and Metric Presets
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Reports and Test Suites, Presets and Metric Presets
Custom Metrics and Extensibility
Overview
Evidently provides a comprehensive extensibility system that allows users to create custom metrics, features, and test configurations. The framework follows a plugin-like architecture where custom implementations can be registered and discovered at runtime through the registry system.
The extensibility model encompasses several key areas:
- Custom Metrics: User-defined metrics that compute specific evaluation logic
- Custom Features: Row-level evaluators that extract or compute values from data
- Generators: Automatic metric and feature generation from column specifications
- Bound Tests: Test configurations attached to metrics with threshold definitions
This architecture enables users to extend the framework's built-in capabilities while maintaining consistency with the existing evaluation pipeline.
Registry System Architecture
The registry system serves as the central mechanism for discovering and managing extensions within Evidently. It provides a unified interface for registering, retrieving, and invoking custom implementations.
graph TD
A[User Code] --> B[Registry API]
B --> C[MetricRegistry]
B --> D[FeatureRegistry]
B --> E[TestRegistry]
C --> F[Metric Implementations]
D --> G[Feature Implementations]
E --> H[Test Configurations]Core Registry Components
| Component | Purpose | File Location |
|---|---|---|
| MetricRegistry | Manages custom metric registrations | src/evidently/core/registries/__init__.py |
| ConfigRegistry | Stores configuration metadata | src/evidently/core/registries/configs.py |
| BoundTestRegistry | Handles test bindings and thresholds | src/evidently/core/registries/bound_tests.py |
The registry pattern follows a key-based lookup system where each extension is identified by a unique name. This allows for dynamic discovery and late binding of implementations at runtime.
Custom Metrics
Custom metrics extend the base Metric class to implement domain-specific evaluation logic. The framework provides both legacy and modern approaches for metric creation.
Metric Structure
A custom metric typically consists of:
- Configuration: Defines the metric's parameters and metadata
- Calculation Logic: Implements the
calculatemethod to produce results - Visualization: Provides rendering information through widget definitions
class CustomMetric(Metric):
def __init__(self, column: str, threshold: float = 0.5):
self.column = column
self.threshold = threshold
def calculate(self, context: Context) -> MetricResult:
# Custom calculation logic
pass
Sources: src/evidently/legacy/metrics/custom_metric.py:1-50
Metric Container System
The MetricContainer abstract base class provides the foundation for grouping related metrics. It implements a caching mechanism to avoid redundant metric generation.
graph LR
A[Context] --> B{Container Fingerprint}
B -->|Cache Hit| C[Return Cached Metrics]
B -->|Cache Miss| D[generate_metrics]
D --> E[Store in Context]
E --> CThe generate_metrics method must be implemented by subclasses to define the metrics to be computed:
def generate_metrics(self, context: "Context") -> Sequence[MetricOrContainer]:
"""Generate metrics based on the container configuration.
Args:
context: Context containing datasets and configuration.
Returns:
Sequence of Metric or MetricContainer objects to compute.
"""
raise NotImplementedError()
Sources: src/evidently/core/container.py:1-80
Rendering and Widgets
Metrics can contribute visualization widgets through the render method. The widget system supports hierarchical composition where parent containers aggregate widgets from child metrics:
def render(
self,
context: "Context",
child_widgets: Optional[List[Tuple[Optional[MetricId], List[BaseWidgetInfo]]]] = None,
) -> List[BaseWidgetInfo]:
"""Render visualization widgets for this container.
Combines widgets from all child metrics/containers.
"""
Sources: src/evidently/core/container.py:80-120
Custom Features
Custom features extend the Descriptor base class to provide row-level evaluation capabilities. They can be applied to individual data points during dataset processing.
Feature Implementation Pattern
Features implement the descriptor pattern, wrapping column values with evaluation logic:
class CustomFeature(Descriptor):
def __init__(self, column: str, alias: str = None):
self.column = column
self.alias = alias or column
def apply(self, data: pd.Series) -> pd.Series:
# Custom feature calculation
return result
Sources: src/evidently/legacy/features/custom_feature.py:1-40
Integration with Descriptors
The descriptor system integrates with the data definition framework, allowing features to be declared as part of a dataset configuration:
eval_dataset = Dataset.from_pandas(
pd.DataFrame(eval_df),
data_definition=DataDefinition(),
descriptors=[
Sentiment("answer", alias="Sentiment"),
TextLength("answer", alias="Length"),
Contains("answer", ["denial", "refuse"], alias="ContainsDenial")
]
)
Sources: README.md:1-50
Generators System
Generators provide an automated way to create metrics and features based on column specifications. This reduces boilerplate when working with large numbers of similar evaluations.
Column Generator
The column generator creates standardized metrics for numeric and categorical columns:
generator = ColumnGenerator(
columns=["feature_1", "feature_2", "feature_3"],
generators=[
MeanGenerator(),
StdGenerator(),
NullCountGenerator()
]
)
Sources: src/evidently/generators/column.py:1-60
Generator Configuration
Generators can be configured with:
| Parameter | Type | Description |
|---|---|---|
columns | List[str] | Target columns for generation |
include_missing | bool | Include columns with missing values |
exclude_patterns | List[str] | Regex patterns for column exclusion |
generators | List[Generator] | Specific generators to apply |
Sources: src/evidently/generators/__init__.py:1-50
Bound Tests and Thresholds
Bound tests attach threshold configurations to metrics, enabling automated pass/fail determinations based on metric results.
Test Binding Pattern
Tests are bound to metrics through the registry system:
class BoundTest:
def __init__(self, metric: MetricId, threshold: TestThreshold):
self.metric = metric
self.threshold = threshold
def evaluate(self, metric_result: MetricResult) -> TestResult:
# Compare metric value against threshold
pass
Sources: src/evidently/core/registries/bound_tests.py:1-50
Threshold Types
The framework supports multiple threshold configuration types:
| Threshold Type | Description | Use Case |
|---|---|---|
absolute | Fixed value comparison | Fixed acceptable ranges |
relative | Percentage-based comparison | Drift detection |
sigma | Standard deviation bounds | Anomaly detection |
quantile | Percentile-based thresholds | Distribution extremes |
Sources: src/evidently/core/registries/bound_tests.py:50-100
Configuration and Versioning
The extensibility system includes robust configuration management with versioning support for metrics and descriptors.
Config Version Model
Configurations are versioned to track changes over time:
@dataclass
class ConfigVersion:
id: STR_UUID
artifact_id: STR_UUID
version: int
content: Any
metadata: ConfigVersionMetadata
Sources: src/evidently/sdk/configs.py:1-80
Metadata Structure
| Field | Type | Description |
|---|---|---|
created_at | datetime | Version creation timestamp |
updated_at | datetime | Last modification timestamp |
author | str | User who created/modified |
comment | str | Change description |
Sources: src/evidently/sdk/adapters.py:1-100
Workflow Diagram
The complete extensibility workflow from definition to execution:
graph TD
A[Define Custom Metric/Feature] --> B[Register in Registry]
B --> C[Create Dataset with Descriptors]
C --> D[Generate Metrics from Container]
D --> E[Calculate Results]
E --> F[Bind Tests with Thresholds]
F --> G[Evaluate Pass/Fail]
G --> H[Render Widgets]
H --> I[Display in Report/Dashboard]Best Practices
Performance Considerations
- Caching: Use the container fingerprinting mechanism to cache generated metrics and avoid redundant calculations
- Lazy Evaluation: Implement metrics using lazy evaluation patterns when possible to defer expensive computations
- Batch Processing: Design features to operate on pandas Series rather than individual values for vectorized performance
Extensibility Guidelines
- Consistent Naming: Follow the established naming conventions for custom implementations
- Type Hints: Include comprehensive type hints for all public interfaces
- Documentation: Document parameters and return values using docstring conventions
- Testing: Create unit tests for custom metric calculation logic
Integration Points
Custom extensions integrate with the framework through several standardized interfaces:
- Metric Protocol: Implement the
calculate(self, context)method - Descriptor Protocol: Implement the
apply(self, data)method - Widget Protocol: Return
List[BaseWidgetInfo]fromrender()method - Test Protocol: Implement the
evaluate(self, result)method for bound tests
API Reference
Key Classes and Functions
| Class/Function | Module | Purpose |
|---|---|---|
MetricContainer | evidently.core.container | Base class for metric containers |
Descriptor | evidently.legacy.features | Base class for features |
ColumnGenerator | evidently.generators.column | Column-based metric generation |
BoundTest | evidently.core.registries | Threshold-bound test configuration |
ConfigVersion | evidently.sdk.configs | Versioned configuration storage |
Context Management
The Context object provides access to datasets and configuration during metric calculation:
class Context:
def metrics_container(self, fingerprint: str) -> Optional[List[MetricOrContainer]]:
"""Retrieve cached metrics for container fingerprint."""
def set_metric_container_data(
self,
fingerprint: str,
metrics: List[MetricOrContainer]
) -> None:
"""Store generated metrics in cache."""
Sources: src/evidently/core/container.py:50-80
Summary
Evidently's extensibility system provides a comprehensive framework for customizing evaluation logic. The registry-based architecture enables dynamic discovery of custom implementations while maintaining consistency with built-in features. Custom metrics and features extend the base classes to implement domain-specific logic, and the generator system automates repetitive metric definitions. The bound test mechanism attaches configurable thresholds to metrics for automated validation, making the system suitable for both exploratory analysis and production monitoring scenarios.
Sources: [src/evidently/legacy/metrics/custom_metric.py:1-50]()
UI Service Backend
The Evidently UI Service Backend is a FastAPI-based REST API layer that powers the Evidently web interface, providing endpoints for managing projects, datasets, prompts, and trace data. It...
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
The Evidently UI Service Backend is a FastAPI-based REST API layer that powers the Evidently web interface, providing endpoints for managing projects, datasets, prompts, and trace data. It serves as the bridge between the React-based frontend and the Evidently SDK's core functionality.
Architecture Overview
The UI Service Backend follows a layered architecture:
A[Frontend: React/TypeScript] --> B[UI Service Backend: FastAPI]
B --> C[Evidently SDK]
C --> D[Workspace Abstraction]
C --> E[Cloud Config API]
D --> F[(Local Storage)]
E --> G[(Remote Backend)]
Component Stack
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | React, TypeScript, MUI | User interface components |
| Backend API | FastAPI/Python | REST API endpoints |
| Business Logic | Evidently SDK | Core evaluation logic |
| Data Layer | Local FS / Remote API | Persistence |
Core Data Models
Project Model
Projects are the primary organizational unit in the UI. The ProjectModel represents:
ProjectModel:
- name: str
- description: Optional[str]
- org_id: Optional[OrgID]
Sources: src/evidently/ui/workspace.py:50-55
Config API Abstraction
The ConfigAPI class provides a generic interface for managing versioned configurations:
| Method | Purpose |
|---|---|
create_config() | Create a new configuration |
get_config() | Retrieve a configuration by ID |
list_configs() | List all configurations |
update_config() | Update an existing configuration |
delete_config() | Delete a configuration |
add_version() | Add a new version to a config |
get_version() | Get a specific version |
Sources: src/evidently/sdk/configs.py:95-145
Descriptor Configuration
The Descriptor type is used to store and version descriptor configurations:
DescriptorConfigAPI:
- add_descriptor() -> ConfigVersion
- get_descriptor() -> Descriptor
Sources: src/evidently/sdk/configs.py:165-180
Workspace Abstraction
The Workspace abstract class defines the contract for project management:
A[Workspace] --> B[LocalWorkspace]
A --> C[CloudWorkspace]
A --> D[RemoteWorkspace]
Abstract Methods
| Method | Parameters | Returns | Description |
|---|---|---|---|
create_project() | name, description, org_id | Project | Creates a new project |
add_project() | project: ProjectModel, org_id | Project | Adds project to workspace |
get_project() | project_id: STR_UUID | Optional[Project] | Retrieves project by ID |
delete_project() | project_id: STR_UUID | None | Removes project from workspace |
list_projects() | org_id: Optional | Sequence[Project] | Lists all projects |
Sources: src/evidently/ui/workspace.py:40-80
Frontend Components
Project Card Component
The ProjectCard component handles project display and editing:
ProjectCardProps:
- project: Project
- disabled?: boolean
- onEditProject: (args: { name: string; description: string }) => void
- LinkToProject: ComponentType
Features:
- Toggle between view and edit modes
- Uses
EditProjectInfoFormfor inline editing - Displays
ProjectInfoCardin view mode
Sources: ui/packages/evidently-ui-lib/src/components/Project/ProjectCard.tsx:60-80
Prompts Table
The PromptsTable component renders a sortable, paginated list of prompts:
| Column | Render | Features |
|---|---|---|
| ID | ID with copy button | TextWithCopyIcon component |
| Name | Truncated text (max 200px) | Typography component |
| Created at | Date formatted | dayjs locale formatting |
| Actions | Link + Delete button | Edit/delete operations |
Sources: ui/packages/evidently-ui-lib/src/components/Prompts/PromptsTable.tsx:40-75
Traces Table
The TracesTable component displays trace data with extended metadata:
| Column | Features |
|---|---|
| Tags | HidedTags component, 250px min width |
| Metadata | JsonViewThemed with clipboard support |
| Type | Chip showing trace origin |
| Created at | Sortable date column |
| Actions | Dataset link, edit dialog, delete |
Sources: ui/packages/evidently-ui-lib/src/components/Traces/TracesTable.tsx:50-90
API Endpoint Structure
Projects API
/projects
├── GET / - List all projects
├── POST / - Create new project
└── GET /{project_id} - Get project details
/projects/{project_id}/
├── prompts/
│ ├── GET / - List prompts
│ ├── POST / - Create prompt
│ └── GET /{id} - Get prompt details
├── datasets/
│ ├── GET / - List datasets
│ └── POST / - Create dataset
└── traces/
├── GET / - List traces
└── POST / - Create trace
Prompts API
| Endpoint | Method | Purpose |
|---|---|---|
/prompts | GET | List all prompts for a project |
/prompts | POST | Create a new prompt |
/prompts/{id} | GET | Get prompt details |
Sources: ui/service/src/routes/.../index-prompts-list/index-prompts-list-main.tsx:40-60
Configuration Management
Artifact Version Management
The ArtifactConfigAPI handles versioned artifact storage:
ArtifactConfigAPI:
- create_version() # Create new artifact version
- list_versions() # List all versions
- get_version() # Get specific version
- get_version_by_id() # Get by version ID
Sources: src/evidently/sdk/adapters.py:45-70
Version Conversion
Bidirectional conversion between SDK and API models:
A[ArtifactVersion] -->|convert| B[ConfigVersion]
B -->|convert| A
Methods:
_artifact_version_to_config_version()- SDK to API_config_version_to_artifact_version()- API to SDK
UI Utilities
HTML Link Templates
The utils.py module provides HTML templates for dashboard rendering:
HTML_LINK_WITH_ID_TEMPLATE # Link with button and ID display
FILE_LINK_WITH_ID_TEMPLATE # File link with ID
RUNNING_SERVICE_LINK_TEMPLATE # Service link with label
Sources: src/evidently/ui/utils.py:30-55
Workflow: Project Lifecycle
graph TD
A[Create Project] --> B[Add Datasets]
B --> C[Create Prompts]
C --> D[Run Evals]
D --> E[Store Traces]
E --> F[View Results]
F --> G[Monitor]
A -.->|via| H[Workspace.add_project]
C -.->|via| I[Prompts API]
E -.->|via| J[Traces API]Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
org_id | UUID | None | Organization identifier |
project_id | UUID | Auto | Project identifier |
version | int | "latest" | Config version selector |
Key SDK Classes
| Class | File | Responsibility |
|---|---|---|
ConfigAPI | configs.py | Generic config CRUD operations |
DescriptorConfigAPI | configs.py | Descriptor-specific operations |
CloudConfigAPI | configs.py | Remote backend communication |
Workspace | workspace.py | Abstract workspace interface |
ProjectModel | workspace.py | Project data structure |
Technology Stack
| Component | Technology | Version |
|---|---|---|
| Backend Framework | FastAPI | - |
| SDK Core | Python | 3.11+ |
| Frontend Framework | React | - |
| UI Components | MUI | - |
| State Management | React Hooks | - |
| Date Handling | dayjs | - |
Development Workflow
Running the Service
# In ui/service folder
pnpm dev
Code Quality
# In ui folder
pnpm code-check # Format, sort imports, lint
pnpm code-check --fix # Apply fixes automatically
Sources: ui/README.md:15-25
Building
pnpm build
Summary
The Evidently UI Service Backend provides:
- REST API layer for frontend-backend communication
- Project management via Workspace abstraction
- Versioned configuration storage using ConfigAPI
- Tracing support for evaluation history
- Prompts management for LLM-powered evaluations
The architecture cleanly separates concerns between the FastAPI backend, Evidently SDK business logic, and React frontend components, enabling modular development and testing.
Sources: [src/evidently/ui/workspace.py:50-55]()
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
First-time setup may fail or require extra isolation and rollback planning.
Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
The project should not be treated as fully validated until this signal is reviewed.
The project may affect permissions, credentials, data exposure, or host boundaries.
Doramagic Pitfall Log
Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.
1. Installation risk: Update scikit-learn version requirement to support v1.6.0
- Severity: high
- Finding: Installation risk is backed by a source signal: Update scikit-learn version requirement to support v1.6.0. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1407
2. Configuration risk: PromptOptimizer throws OpenAIError when using Vertex AI judge
- Severity: high
- Finding: Configuration risk is backed by a source signal: PromptOptimizer throws OpenAIError when using Vertex AI judge. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1856
3. Project risk: IndexError in infer_column_type when column contains only null values
- Severity: high
- Finding: Project risk is backed by a source signal: IndexError in infer_column_type when column contains only null values. Treat it as a review item until the current version is checked.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1764
4. Security or permission risk: Update evidently hashlib usage for FIPS-Compliant Systems and Security Best Practices
- Severity: high
- Finding: Security or permission risk is backed by a source signal: Update evidently hashlib usage for FIPS-Compliant Systems and Security Best Practices. Treat it as a review item until the current version is checked.
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1410
5. Installation risk: Numpy 2.x support?
- Severity: medium
- Finding: Installation risk is backed by a source signal: Numpy 2.x support?. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/issues/1557
6. Installation risk: v0.7.12
- Severity: medium
- Finding: Installation risk is backed by a source signal: v0.7.12. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.12
7. Installation risk: v0.7.15
- Severity: medium
- Finding: Installation risk is backed by a source signal: v0.7.15. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.15
8. Installation risk: v0.7.20
- Severity: medium
- Finding: Installation risk is backed by a source signal: v0.7.20. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.20
9. Capability assumption: v0.7.19
- Severity: medium
- Finding: Capability assumption is backed by a source signal: v0.7.19. Treat it as a review item until the current version is checked.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.19
10. Capability assumption: v0.7.21
- Severity: medium
- Finding: Capability assumption is backed by a source signal: v0.7.21. Treat it as a review item until the current version is checked.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/evidentlyai/evidently/releases/tag/v0.7.21
11. Capability assumption: README/documentation is current enough for a first validation pass.
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: capability.assumptions | github_repo:315977578 | https://github.com/evidentlyai/evidently | README/documentation is current enough for a first validation pass.
12. Maintenance risk: Maintainer activity is unknown
- Severity: medium
- Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: evidence.maintainer_signals | github_repo:315977578 | https://github.com/evidentlyai/evidently | last_activity_observed missing
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using evidently with real data or production workflows.
- IndexError in infer_column_type when column contains only null values - github / github_issue
- Protect this repo from AI-generated PRs - github / github_issue
- Numpy 2.x support? - github / github_issue
- PromptOptimizer throws OpenAIError when using Vertex AI judge - github / github_issue
- Update scikit-learn version requirement to support v1.6.0 - github / github_issue
- Update evidently hashlib usage for FIPS-Compliant Systems and Security B - github / github_issue
- v0.7.21 - github / github_release
- v0.7.20 - github / github_release
- v0.7.19 - github / github_release
- v0.7.18 - github / github_release
- v0.7.17 - github / github_release
- v0.7.16 - github / github_release
Source: Project Pack community evidence and pitfall evidence