ragas Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

ragas

Supercharge Your LLM Application Evaluations 🚀

Overview, Installation, and Architecture

Related topics: Evaluation Metrics, Collections, and Cost Tracking, Testset Generation, Datasets, and Storage Backends, LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting

Section Related Pages

Continue reading this section for the full explanation and source context.

Section LLM Adapters

Continue reading this section for the full explanation and source context.

Section Prompt Engine

Continue reading this section for the full explanation and source context.

Section Testset Generation

Continue reading this section for the full explanation and source context.

Overview, Installation, and Architecture

Ragas is a toolkit for evaluating and optimizing Large Language Model (LLM) applications. It provides objective metrics, intelligent test data generation, and integration hooks for popular LLM frameworks, observability platforms, and agent runtimes. According to the project README, Ragas aims to replace time-consuming subjective assessments with data-driven evaluation workflows, and can also produce production-aligned test sets when no evaluation data is available. Source: README.md:1-18

Core Capabilities

The repository exposes four high-level feature pillars that frame the rest of the project:

Pillar	Purpose
Objective Metrics	Evaluate LLM applications using LLM-based and traditional metrics
Test Data Generation	Automatically create comprehensive test datasets
Seamless Integrations	LangChain, LlamaIndex, observability tools, agent frameworks
Feedback Loops	Use production data to continually improve LLM applications

Source: README.md:11-15

Installation

Ragas is distributed on PyPI and can be installed via pip. Two installation paths are documented in the README:

# PyPI install
pip install ragas

# Install directly from the source repository
pip install git+https://github.com/vibrantlabsai/ragas

Source: README.md:20-30

Authentication for the default OPENAI_API_KEY environment variable is the most common setup; the README explicitly notes that this key must be set before running the examples. Source: README.md:97-101

High-Level Architecture

Ragas is organized as a small, layered core surrounded by a rich ecosystem of pluggable adapters and integrations. The architecture can be summarized as follows:

flowchart TB
    User[User / Application] --> CLI["ragas CLI<br/>(quickstart templates)"]
    User --> SDK["ragas SDK<br/>(Python API)"]
    SDK --> Metrics["Metrics<br/>(DiscreteMetric, AspectCritic)"]
    SDK --> Testset["Testset Generation<br/>(synthesizers + transforms)"]
    SDK --> Prompts["Prompt Engine<br/>(PydanticPrompt, MultiModalPrompt)"]
    Metrics --> LLM["LLM Adapters<br/>(llm_factory)"]
    Testset --> LLM
    Prompts --> LLM
    LLM --> Providers[(OpenAI, Bedrock,<br/>Mistral, Gemini, ...)]
    SDK --> Backends["Backend Registry<br/>(pluggable storage)"]
    SDK --> Integrations["Integrations<br/>(Langfuse, MLflow,<br/>LangSmith, Helicone,<br/>LangChain, LlamaIndex,<br/>LangGraph, Griptape,<br/>Swarm, AG-UI)"]
    Integrations --> Observability[(Observability Platforms)]
    Backends --> Storage[(Local / Remote Storage)]

This diagram reflects how user code enters the system through either the CLI or the Python SDK, and how the SDK funnels evaluation, generation, and prompting through a unified LLM adapter layer and an extensible backend / integration layer.

Key Modules

LLM Adapters

A central abstraction is StructuredOutputAdapter, which defines a single contract for adapters that produce structured outputs from any backend (e.g., Instructor, LiteLLM). Source: src/ragas/llms/adapters/base.py:4-32 This design lets llm_factory build a RagasLLM from a pre-initialized client and a model name, abstracting over the underlying provider. Community issue #2774 highlights that adapter coverage is still uneven: creating an adapter for the mistralai.Mistral client currently raises a ValueError, so users must ensure their provider is supported before calling llm_factory. Source: src/ragas/llms/adapters/base.py:18-32

Prompt Engine

The prompt engine is built on top of PydanticPrompt, a Pydantic-model-driven prompt abstraction that handles input/output typing, example storage, and persistence. Source: src/ragas/prompt/pydantic_prompt.py:1-20 The is_langchain_llm helper detects whether the active LLM is a LangChain or a Ragas LLM, with direct LangChain usage explicitly marked as deprecated in favor of the Ragas LLM interface. Source: src/ragas/prompt/pydantic_prompt.py:33-60

A companion MultiModalPrompt class extends this design to multi-modal inputs, and routes calls to either agenerate_prompt (LangChain) or generate (Ragas) depending on the LLM type. Source: src/ragas/prompt/multi_modal_prompt.py:115-148 Prompts are serialized to JSON (optionally gzip-compressed) following the format documented in prompt-formats.md, including support for DynamicFewShotPrompt with embedding-backed example selection. Source: src/ragas/prompt/prompt-formats.md:1-58

Testset Generation

The testset generation pipeline transforms raw documents into a knowledge graph, applies a configurable set of transforms, and then synthesizes queries and answers. The entry point TestsetGenerator.generate builds a KnowledgeGraph from input documents, calls apply_transforms, and dispatches to the synthesizer with a RunConfig. Source: src/ragas/testset/synthesizers/generate.py:18-78

The default transform pipeline (default_transforms) inspects document length distribution and conditionally attaches extractors such as HeadlinesExtractor, summaries, keyphrases, titles, and embeddings. Source: src/ragas/testset/transforms/default.py:1-60 LLMBasedNodeFilter subclasses — for example CustomNodeFilter — score each node for "question potential" so low-value chunks can be dropped before synthesis. Source: src/ragas/testset/transforms/filters.py:1-66

The HeadlinesExtractorPrompt and NERPrompt in llm_based.py illustrate the project's standard pattern for extractors: a PydanticPrompt with a typed input, a typed output (Headlines or NEROutput), and a small set of in-context examples. Source: src/ragas/testset/transforms/extractors/llm_based.py:1-110 The single-hop synthesizer uses a parallel QueryCondition → GeneratedQueryAnswer prompt, optionally conditioned on an llm_context to guide question style. Source: src/ragas/testset/synthesizers/single_hop/prompts.py:1-58

Backends

Backends are managed through a singleton BackendRegistry that supports both built-in and plugin-discovered backends, with alias resolution and validation enforced on registration. Source: src/ragas/backends/registry.py:1-66 The shared MemorableNames utility generates human-friendly, Docker-style adjective/noun identifiers for experiments and datasets. Source: src/ragas/backends/utils.py:1-46

Integrations

The integrations layer ships first-class adapters for tracing (Langfuse, MLflow), frameworks (LangChain, LlamaIndex, Griptape, LangGraph), observability (Helicone, LangSmith, Opik), platforms (Amazon Bedrock, R2R), agent systems (Swarm), and protocols (AG-UI). Source: src/ragas/integrations/__init__.py:1-20 Tracing integrations are imported lazily to keep optional dependencies optional, and the LangSmith helper upload_dataset converts a Testset into a pandas DataFrame before pushing it to a LangSmith dataset. Source: src/ragas/integrations/langsmith.py:1-56

Common Failure Modes and Known Issues

Several community-reported issues map directly to the modules above and are worth noting for new users:

Token usage tracking — get_token_usage_for_bedrock always returns 0 because it reads the wrong response_metadata keys for langchain-aws ChatBedrock / ChatBedrockConverse (issue #2779). When using Bedrock, token totals from the CostCallbackHandler cannot be trusted until the keys are corrected.
Model compatibility — AspectCritic raises runtime errors when the evaluator model is switched from gpt-4o to o3 (issue #2067), reflecting brittleness in metric prompts against newer reasoning models.
Non-LLM context metrics — NonLLMContextRecall and NonLLMContextPrecisionWithReference use inconsistent strict vs. non-strict comparisons (> vs >=) when thresholding string-similarity scores into binary relevance (issue #2777).
Deprecated import paths — The deprecation warning for LLMContextPrecisionWithoutReference points users to a class that does not exist in ragas.metrics.collections (issue #2748).
Multilingual extraction — Running HeadlinesExtractor on non-English content can fail with 'headlines' property not found in this node (issue #1775), so the default transformer pipeline may need to be customized for non-English corpora.
Optional dependencies — python-diskcache is imported inline but currently installed as a hard dependency; the community has requested it be made optional (issue #2622).

Evaluation Metrics, Collections, and Cost Tracking

Related topics: Overview, Installation, and Architecture, LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting

Section Related Pages

Continue reading this section for the full explanation and source context.

Section LLM Interaction Pattern

Continue reading this section for the full explanation and source context.

Evaluation Metrics, Collections, and Cost Tracking

Overview

Ragas provides an evaluation framework for Large Language Model (LLM) applications, with a strong emphasis on objective metrics, automated test data generation, and per-run cost tracking. The README introduces the framework as a "toolkit for evaluating and optimizing Large Language Model (LLM) applications" that supports both LLM-based and traditional metrics, integrations with popular LLM frameworks, and feedback-loop driven improvement.

The evaluation surface area in ragas is split into three coordinated concerns:

Metrics — pre-built and customizable scoring functions (e.g., faithfulness, answer correctness, noise sensitivity, aspect critique).
Collections — the structured module that exposes LLM-backed and non-LLM metrics through a unified import surface and that drives testset generation.
Cost Tracking — the TokenUsage / TokenUsageParser abstraction that allows per-run accounting of tokens and spend across providers (with provider-specific extractors such as get_token_usage_for_bedrock).

Metrics Module

The metrics package re-exports its prompt assets through a single entry point. As shown in src/ragas/prompt/metrics/__init__.py, ragas ships the following built-in metric prompts:

correctness_classifier_prompt — used by the answer correctness metric to label a response against a reference.
answer_relevancy_prompt — used by the answer relevance metric to score how pertinent an answer is to the question.
nli_statement_prompt and statement_generator_prompt — shared, low-level prompts exposed via common.py and used by composition-based metrics (for example, faithfulness and noise sensitivity decompose a response into atomic statements and judge them with an NLI step).

Source: src/ragas/prompt/metrics/__init__.py:5-16

Beyond these, ragas exposes higher-level metrics used in user code. DiscreteMetric (shown in the README quickstart) lets users define a custom evaluator with a fixed allowed_values vocabulary, while AspectCritic and NoiseSensitivity address aspect-based evaluation and robustness against irrelevant context respectively. The noise_sensitivity.py prompt demonstrates the canonical pattern: a context + list of statements input, decomposed into per-statement verdicts of 0 (irrelevant) or 1 (relevant), each accompanied by a reason.

Source: src/ragas/prompt/metrics/noise_sensitivity.py

LLM Interaction Pattern

All metric prompts ultimately subclass PydanticPrompt[InputModel, OutputModel], which in turn handles JSON-typed IO, response parsing, and LangChain/Ragas LLM dispatch. pydantic_prompt.py defines is_langchain_llm() to detect whether the supplied model is a legacy LangChain LLM or a Ragas LLM. multi_modal_prompt.py shows the dual code path: agenerate_prompt for LangChain LLMs, and ragas_llm.generate(...) for Ragas LLMs.

Source: src/ragas/prompt/multi_modal_prompt.py and src/ragas/prompt/pydantic_prompt.py

This split is the reason community issue #2774 reports ValueError for the mistralai.Mistral client — providers must be wired through llm_factory so the metric can dispatch correctly.

Collections and Testset Generation

Ragas organizes metrics and synthesizers under "collections" so that testset generation, evaluation, and prompt optimization share a common knowledge-graph and transform pipeline. The generate.py testset generator accepts a list of documents, applies a default set of transforms (extracting summaries, keyphrases, titles, headlines, and embeddings, and then building similarity edges), and runs a query_distribution of synthesizers to produce TestsetSample records.

Source: src/ragas/testset/synthesizers/generate.py

The generator's signature includes token_usage_parser: Optional[TokenUsageParser], which directly couples the collections surface to the cost tracking subsystem. This is the parameter users reach for when they want to know how many tokens testset generation consumed — a recurring community question (see issue #540, "Tokens Usage for evaluations and testset generations").

Cost Tracking and Token Usage

Cost tracking in ragas is built on two primitives:

Component	Role
`TokenUsageParser`	Accepts an `LLMResult` and returns a `TokenUsage` object; provider-agnostic entry point.
`get_token_usage_for_<provider>` (e.g., `get_token_usage_for_bedrock`)	Provider-specific extractor that reads `response_metadata` and returns the correct token counts.

Issue #2779 documents a real failure mode: get_token_usage_for_bedrock "always returns 0" because it reads the wrong response_metadata keys for langchain-aws ChatBedrock / ChatBedrockConverse. The fix requires inspecting the actual metadata emitted by langchain-aws and matching the right key paths. The same pattern recurs for every provider — the TokenUsageParser is the dispatch point, but the per-provider key extraction is where bugs surface.

The wider cost module also publishes a CostCallbackHandler (referenced in the same community thread) that attaches to a run to collect usage in real time.

Integrations and Tracing

The integration layer in src/ragas/integrations/__init__.py documents the available surfaces: tracing (Langfuse, MLflow), frameworks (LangChain, LlamaIndex, Griptape, LangGraph), observability (Helicone, Langsmith, Opik), platforms (Amazon Bedrock, R2R), AI systems (Swarm), and protocols (AG-UI).

Source: src/ragas/integrations/__init__.py

Tracing imports are lazy: src/ragas/integrations/tracing/__init__.py only resolves observe, sync_trace, LangfuseTrace, and MLflowTrace on attribute access, so projects that do not install langfuse or mlflow still import ragas cleanly. The same __getattr__ pattern is what community issue #2622 ("Make python-diskcache dependency optional") requests for diskcache, which is currently only imported on demand.

Prompt Persistence

src/ragas/prompt/prompt-formats.md documents the on-disk JSON (optionally gzip-compressed) format used to save and load both Prompt and DynamicFewShotPrompt objects. The format includes a format_version field, a type discriminator, an instruction template, an examples array, and a response_model_info block. DynamicFewShotPrompt extends this with an embedding model reference and max_similar_examples / similarity_threshold fields, enabling example retrieval at inference time.

Source: src/ragas/prompt/prompt-formats.md

Common Failure Modes

Wrong provider metadata keys — get_token_usage_for_bedrock returns zero because the keys differ between ChatBedrock and ChatBedrockConverse (issue #2779).
LLM factory dispatch — passing an unsupported client (e.g., raw mistralai.Mistral) to llm_factory raises ValueError (issue #2774).
AspectCritic model compatibility — switching the evaluator model to o3 causes runtime errors in the prompt/parser pipeline (issue #2067).
Threshold inconsistencies — non-LLM context metrics use mixed > vs >= comparisons when binarizing similarity (issue #2777).
Stale deprecation names — deprecation warnings may reference class names that no longer exist in ragas.metrics.collections (issue #2748).

Testset Generation, Datasets, and Storage Backends

Related topics: Overview, Installation, and Architecture, Evaluation Metrics, Collections, and Cost Tracking

Section Related Pages

Continue reading this section for the full explanation and source context.

Section The TestsetGenerator Class

Continue reading this section for the full explanation and source context.

Section Transforms and the Knowledge Graph

Continue reading this section for the full explanation and source context.

Section Query Synthesis Prompts

Continue reading this section for the full explanation and source context.

Testset Generation, Datasets, and Storage Backends

Overview and Role

Ragas treats LLM-application evaluation as a two-sided problem: you need metrics to score outputs, and you need data to score against. The ragas.testset module provides the second half by synthesizing realistic Testset objects from your source documents, while ragas.backends provides pluggable storage for persisting both datasets and experiment results across local, cloud, and third-party systems.

The public surface is intentionally small. src/ragas/testset/__init__.py exports only TestsetGenerator, Testset, and TestsetSample. Storage backends follow a parallel minimalism: BaseBackend in src/ragas/backends/base.py declares four abstract methods, and the ragas.backends entry-point group lets third parties register new backends without forking the library (see src/ragas/backends/README.md).

Testset Generation Pipeline

The `TestsetGenerator` Class

The entry point is TestsetGenerator in src/ragas/testset/synthesizers/generate.py. It exposes two generation paths:

generate_with_langchain_docs(...) — accepts LangChain Document objects (or raw strings), constructs a KnowledgeGraph of Node(type=NodeType.CHUNK) entries from each page_content + metadata, runs apply_transforms(...), then delegates to generate(...).
generate(...) — the lower-level entry that operates on an already-populated KnowledgeGraph and supports a configurable QueryDistribution (single-hop, multi-hop, etc.).

Both methods accept the same core options: testset_size, query_distribution, run_config, callbacks, token_usage_parser, with_debugging_logs, raise_exceptions, and return_executor. When return_executor=True, the call returns the Executor instance instead of running to completion, allowing the caller to call executor.cancel() mid-run and later collect partial results via executor.results().

The pipeline is gated by an explicit precondition — both self.llm and transforms_llm must be set or a ValueError is raised.

Transforms and the Knowledge Graph

Before any query is generated, the source chunks are enriched by a pipeline of Transforms. The default pipeline, defined in src/ragas/testset/transforms/default.py, bins documents by token length and conditionally adds:

HeadlinesExtractor (and the Headlines / NER Pydantic prompts in src/ragas/testset/transforms/extractors/llm_based.py) when ≥25% of documents exceed 500 tokens,
SummaryExtractor and KeyphrasesExtractor for shorter corpora,
Embedding-based similarity edges between nodes.

Filters drop low-utility nodes. CustomNodeFilter in src/ragas/testset/transforms/filters.py scores each chunk via the QuestionPotentialPrompt (1–5) and drops nodes scoring below min_score=2, falling back to a parent-document summary when one exists. If a chunk has no summary at all, the filter logs a warning and skips it.

flowchart LR
    A[LangChain Documents / Strings] --> B[KnowledgeGraph: Chunk Nodes]
    B --> C[apply_transforms: Headlines / Summary / Keyphrases / Embeddings]
    C --> D[CustomNodeFilter: QuestionPotential]
    D --> E[QueryDistribution Synthesizers]
    E --> F[Testset samples: query + reference + contexts]
    F --> G{return_executor?}
    G -- false --> H[Testset]
    G -- true --> I[Executor .results / .cancel]

Query Synthesis Prompts

The QueryCondition → GeneratedQueryAnswer pairs defined in src/ragas/testset/synthesizers/single_hop/prompts.py drive single-hop generation: persona, term, query style, query length, and an optional llm_context constrain the LLM so generated questions stay grounded in the source chunks.

Datasets and Schema

A generated test set is a list of TestsetSample records aggregated in a Testset container (see src/ragas/testset/synthesizers/testset_schema.py). Each sample carries the synthesized query, a reference answer, the contexts used to produce it, and the originating node references for traceability.

The prompt layer underneath — PydanticPrompt in src/ragas/prompt/multi_modal_prompt.py — dispatches to either LangChain agenerate_prompt or Ragas-native BaseRagasLLM.generate and parses the output through a RagasOutputParser, raising RagasOutputParserException on schema failure. This indirection is what lets a single TestsetGenerator work across OpenAI, Bedrock, Vertex, and LlamaIndex backends (the adapter contract is defined in src/ragas/llms/adapters/base.py).

LangSmith Integration

src/ragas/integrations/langsmith.py exposes upload_dataset(testset, dataset_name, dataset_desc=""). The function reads LangSmith to confirm the dataset name is free, then converts the Testset to a pandas DataFrame and uploads it. A duplicate name raises ValueError. Broader integrations (Langfuse, MLflow, Helicone, Opik, Swarm, AG-UI) are catalogued in src/ragas/integrations/__init__.py and imported lazily so optional dependencies stay optional.

Storage Backends

`BaseBackend` Interface

src/ragas/backends/base.py declares the contract every storage backend must satisfy. Implementations receive and return List[Dict[str, Any]], raise FileNotFoundError on missing names, and return an empty list (never None) for empty datasets. The abstract methods are:

Method	Purpose
`load_dataset(name)`	Return records for a dataset, or raise `FileNotFoundError`
`save_dataset(name, data, ...)`	Persist dataset records atomically
`load_experiment(name)`	Mirror of `load_dataset` for experiment runs
`save_experiment(name, data, ...)`	Mirror of `save_dataset` for experiment runs
`list_datasets()` / `list_experiments()`	Enumerate stored names

The file-based convention is storage_root/datasets/ and storage_root/experiments/, so users can point multiple backends at the same root safely.

Plugin Discovery

Third-party backends are registered via Python entry points (see src/ragas/backends/README.md). The package declares ragas.backends as the entry-point group; on first access get_registry() iterates registered entry points and lazy-loads each name -> backend_class mapping. Debugging is registry.keys(). Bundled helpers like MemorableNames in src/ragas/backends/utils.py generate human-readable identifiers (adjective + scientist) for experiment and dataset names.

Google Drive Backend

The Google Drive backend, documented in src/ragas/backends/gdrive_backend.md, stores datasets and experiments as Google Sheets inside two folders (datasets/, experiments/). It supports both Service Account and OAuth 2.0 authentication and is installed via pip install "ragas[gdrive]".

Known Limitations (Community Notes)

Several open issues surfaced by users intersect with this surface area:

Issue #1775 reports headlines property missing on nodes when generating test data in non-English languages — this is the HeadlinesExtractor step failing because the LLM did not emit structured output for that locale.
Issue #1688 tracks AttributeError('StringIO' object has no attribute 'classifications') during testset generation, again tied to the LLM extractor prompts.
Issue #2231 collects feedback on the future of this module ahead of v0.4 — relevant if you maintain a fork.
Issue #540 covers token-usage accounting for testset generation, which the token_usage_parser argument on generate(...) exists to address (see src/ragas/cost.py).

LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting

Related topics: Overview, Installation, and Architecture, Evaluation Metrics, Collections, and Cost Tracking

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Known Adapter Failure Modes

Continue reading this section for the full explanation and source context.

Section Prompt Persistence

Continue reading this section for the full explanation and source context.

LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting

Overview

Ragas provides a modular stack for connecting to LLMs, observability platforms, and prompt frameworks. The system is composed of three cooperating layers:

Adapter layer — A pluggable abstraction over the Instructor and LiteLLM libraries, exposed through the llm_factory function and the StructuredOutputAdapter base class.
Integration layer — Optional modules that bridge Ragas to LangChain, LlamaIndex, MLflow, Langfuse, Amazon Bedrock, and tracing/observability tooling.
Prompt layer — JSON-serializable, versioned prompts with optional few-shot example stores used by metrics and testset synthesizers.

This page documents how those layers fit together and how to recover from the most frequently reported community issues.

LLM Adapters

The adapter layer normalizes structured-output generation across providers. The base abstraction is defined in src/ragas/llms/adapters/base.py:

class StructuredOutputAdapter(ABC):
    @abstractmethod
    def create_llm(
        self,
        client: t.Any,
        model: str,
        provider: str,
        **kwargs,
    ) -> t.Any: ...

Concrete adapters (e.g. InstructorLLM, LiteLLMLLM) implement create_llm() and are selected automatically by llm_factory() based on the requested adapter string — "instructor" or "litellm". Source: src/ragas/llms/adapters/base.py:1-31 The factory surfaces a deterministic InstructorModelArgs default (temperature 0.1, max_tokens 1024) and accepts a cache backend such as DiskCacheBackend() to deliver "60x" repeated-evaluation speed-ups. Source: src/ragas/llms/base.py:1-120

Provider-specific wrappers complement the generic factory. For example, the OCI Gen AI wrapper performs lazy initialization of the SDK and only raises ImportError if neither an explicit client nor an endpoint_id is supplied:

if (
    self.client is None
    and GenerativeAiClient is None
    and self.endpoint_id is None
):
    raise ImportError("OCI SDK not found. Please install it with: pip ...")

Source: src/ragas/llms/oci_genai_wrapper.py:1-80

Known Adapter Failure Modes

The community has reported two recurring adapter bugs:

mistralai client — llm_factory raises ValueError because the mistral provider branch is missing from the adapter registry. Workaround: use the litellm adapter instead.
OpenAI o3 with AspectCritic — Structured-output parsing fails because o3 does not honor the default JSON schema the way gpt-4o does. The published mitigation is to fall back to the instructor adapter with mode=Mode.MD_JSON for backends that ignore response_format.

Both failures are routed through the same InstructorBaseRagasLLM interface, so swapping adapters requires no caller-side changes. Source: src/ragas/llms/base.py:1-120

Integrations

Ragas groups integrations under src/ragas/integrations/. The package re-exports optional submodules so that missing third-party SDKs do not hard-fail on import. Tracing integrations live in src/ragas/integrations/tracing/, which supports Langfuse and MLflow with observe() and sync_trace() helpers. Source: src/ragas/integrations/__init__.py:1-22, src/ragas/integrations/tracing/__init__.py:1-80

The framework-level integrations span LangChain, LlamaIndex, Griptape, LangGraph, Helicone, LangSmith, Opik, Amazon Bedrock, R2R, Swarm, and AG-UI. Because each is imported lazily through __getattr__, users can install only the bindings they need.

The example below, drawn from the README, shows the canonical entry point that all integrations share:

from ragas.metrics import DiscreteMetric
from ragas.llms import llm_factory

client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)
metric = DiscreteMetric(name="summary_accuracy", allowed_values=["accurate", "inaccurate"], prompt="...")
score = await metric.ascore(llm=llm, response="...")

Source: README.md:1-80

Prompt Optimization

Prompts are first-class, persistable objects. The PydanticPrompt[InputModel, OutputModel] base class renders a deterministic prompt string that combines an instruction, JSON-schema output spec, optional few-shot examples, and the serialized input. Source: src/ragas/prompt/metrics/base_prompt.py:1-90

Prompt Persistence

Prompt and DynamicFewShotPrompt use JSON (optionally gzip-compressed to .json.gz). The reference doc compares the two formats:

Feature	`Prompt`	`DynamicFewShotPrompt`
Type ID	`"Prompt"`	`"DynamicFewShotPrompt"`
Embedding-backed example selection	❌	✅
Similarity threshold / max similar examples	❌	✅
Stores `response_model_info`	✅	✅

Source: src/ragas/prompt/prompt-formats.md:1-50

For testset generation, generate.py orchestrates a sequence of synthesizers driven by configurable query_distribution, transforms_llm, and transforms_embedding_model. The default transform pipeline (default.py) inspects document-length bins and conditionally enables HeadlinesExtractor when ≥25% of documents exceed 501 tokens, helping LLM-based extractors avoid the StringIO and 'headlines' property not found errors reported in community issue #1775 and #1688. Source: src/ragas/testset/transforms/default.py:1-80, src/ragas/testset/synthesizers/generate.py:1-100

LLM-based extractors (SummaryExtractor, KeyphrasesExtractor, TitleExtractor, HeadlinesExtractor, NERExtractor, TopicDescriptionExtractor) all extend PydanticPrompt and share the same example/few-shot machinery, making them a natural place to plug in DSPy-style optimizers introduced in v0.4.3. Source: src/ragas/testset/transforms/extractors/__init__.py:1-30, src/ragas/testset/transforms/extractors/llm_based.py:1-120

Troubleshooting

The most common community-reported problems map to specific subsystems:

Symptom	Likely cause	Where to look
`ValueError` from `llm_factory`	Unsupported provider / client combo	`src/ragas/llms/base.py`, `adapters/base.py`
`AttributeError('StringIO' object has no attribute 'classifications')`	Outdated extractor running on `StringIO` payloads	`src/ragas/testset/transforms/extractors/llm_based.py`
`'headlines' property not found in this node`	Document too short; default transform skipped headline extraction	`src/ragas/testset/transforms/default.py`
Token usage always 0 for Bedrock	`response_metadata` keys differ across `ChatBedrock` / `ChatBedrockConverse`	LLM callback handler (see issue #2779)
Non-LLM context metrics threshold inconsistency (`>` vs `>=`)	Binary relevance boundary differs between `NonLLMContextRecall` and `NonLLMContextPrecisionWithReference`	Issue #2777

For non-English corpora, ensure Language is set on prompts and that transforms_llm is instructed to keep headings in the source language; this avoids the headline-extractor pipeline failing partway through generate(). Source: src/ragas/testset/synthesizers/single_hop/prompts.py:1-60

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Configuration risk requires verification

May increase setup, validation, or first-run risk for the user.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 17 structured pitfall item(s), including 9 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

Severity: high
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2760

2. Installation risk: Installation risk requires verification

Severity: high
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2748

3. Configuration risk: Configuration risk requires verification

Severity: high
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2774

4. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2067

5. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2732

6. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2622

7. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2649

8. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2692

9. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2779

10. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/vibrantlabsai/ragas

11. Runtime risk: Runtime risk requires verification

Severity: medium
Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2762

12. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/vibrantlabsai/ragas

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using ragas with real data or production workflows.

Proposal: Contribute English/Uzbek Multilingual RAG Evaluation Dataset t - github / github_issue
get_token_usage_for_bedrock always returns 0 (reads wrong response_metad - github / github_issue
AspectCritic not working with openai o3 - github / github_issue
Feature request: Add AgentThreatBench memory poison task as a RAG securi - github / github_issue
llm_factory raises ValueError when using mistralai client - github / github_issue
[[Security] Agentic Workflow Injection in Claude Docs Check](https://github.com/vibrantlabsai/ragas/issues/2692) - github / github_issue
NonLLMContextRecall and NonLLMContextPrecisionWithReference threshold re - github / github_issue
Website footer GitHub link redirects to incorrect repository/user - github / github_issue
Incorrect class name in deprecation warning for LLMContextPrecisionWitho - github / github_issue
Add EvaluationResult summary and threshold checks - github / github_issue
Make python-diskcache dependency optional - github / github_issue
v0.4.3 - github / github_release

Source: Project Pack community evidence and pitfall evidence

ragas

Overview, Installation, and Architecture

Related Pages

Overview, Installation, and Architecture

Core Capabilities

Installation

High-Level Architecture

Key Modules

LLM Adapters

Prompt Engine

Testset Generation

Backends

Integrations

Common Failure Modes and Known Issues

See Also

Evaluation Metrics, Collections, and Cost Tracking

Related Pages

Evaluation Metrics, Collections, and Cost Tracking

Overview

Metrics Module

LLM Interaction Pattern

Collections and Testset Generation

Cost Tracking and Token Usage

Integrations and Tracing

Prompt Persistence

Common Failure Modes

See Also

Testset Generation, Datasets, and Storage Backends

Related Pages

Testset Generation, Datasets, and Storage Backends

Overview and Role

Testset Generation Pipeline

The `TestsetGenerator` Class

Transforms and the Knowledge Graph

Query Synthesis Prompts

Datasets and Schema

LangSmith Integration

Storage Backends

`BaseBackend` Interface

Plugin Discovery

Google Drive Backend

Known Limitations (Community Notes)

See Also

LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting

Related Pages

LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting

Overview

LLM Adapters

Known Adapter Failure Modes

Integrations

Prompt Optimization

Prompt Persistence

Troubleshooting

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Installation risk: Installation risk requires verification

2. Installation risk: Installation risk requires verification

3. Configuration risk: Configuration risk requires verification

4. Security or permission risk: Security or permission risk requires verification

5. Security or permission risk: Security or permission risk requires verification

6. Security or permission risk: Security or permission risk requires verification

7. Security or permission risk: Security or permission risk requires verification

8. Security or permission risk: Security or permission risk requires verification

9. Security or permission risk: Security or permission risk requires verification

10. Capability evidence risk: Capability evidence risk requires verification

11. Runtime risk: Runtime risk requires verification

12. Maintenance risk: Maintenance risk requires verification

Community Discussion Evidence

Community Discussion Evidence