# https://github.com/vibrantlabsai/ragas Project Manual

Generated at: 2026-06-26 00:02:10 UTC

## Table of Contents

- [Overview, Installation, and Architecture](#page-1)
- [Evaluation Metrics, Collections, and Cost Tracking](#page-2)
- [Testset Generation, Datasets, and Storage Backends](#page-3)
- [LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting](#page-4)

<a id='page-1'></a>

## Overview, Installation, and Architecture

### Related Pages

Related topics: [Evaluation Metrics, Collections, and Cost Tracking](#page-2), [Testset Generation, Datasets, and Storage Backends](#page-3), [LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/vibrantlabsai/ragas/blob/main/README.md)
- [src/ragas/integrations/__init__.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/integrations/__init__.py)
- [src/ragas/backends/registry.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/backends/registry.py)
- [src/ragas/backends/utils.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/backends/utils.py)
- [src/ragas/testset/synthesizers/generate.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/synthesizers/generate.py)
- [src/ragas/testset/transforms/default.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/default.py)
- [src/ragas/testset/transforms/filters.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/filters.py)
- [src/ragas/testset/transforms/extractors/llm_based.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/extractors/llm_based.py)
- [src/ragas/prompt/pydantic_prompt.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/pydantic_prompt.py)
- [src/ragas/prompt/multi_modal_prompt.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/multi_modal_prompt.py)
- [src/ragas/llms/adapters/base.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/llms/adapters/base.py)
- [src/ragas/integrations/langsmith.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/integrations/langsmith.py)
- [src/ragas/prompt/prompt-formats.md](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/prompt-formats.md)
</details>

# Overview, Installation, and Architecture

Ragas is a toolkit for evaluating and optimizing Large Language Model (LLM) applications. It provides objective metrics, intelligent test data generation, and integration hooks for popular LLM frameworks, observability platforms, and agent runtimes. According to the project README, Ragas aims to replace time-consuming subjective assessments with data-driven evaluation workflows, and can also produce production-aligned test sets when no evaluation data is available. Source: [README.md:1-18]()

## Core Capabilities

The repository exposes four high-level feature pillars that frame the rest of the project:

| Pillar | Purpose |
|--------|---------|
| Objective Metrics | Evaluate LLM applications using LLM-based and traditional metrics |
| Test Data Generation | Automatically create comprehensive test datasets |
| Seamless Integrations | LangChain, LlamaIndex, observability tools, agent frameworks |
| Feedback Loops | Use production data to continually improve LLM applications |

Source: [README.md:11-15]()

## Installation

Ragas is distributed on PyPI and can be installed via `pip`. Two installation paths are documented in the README:

```bash
# PyPI install
pip install ragas

# Install directly from the source repository
pip install git+https://github.com/vibrantlabsai/ragas
```

Source: [README.md:20-30]()

Authentication for the default `OPENAI_API_KEY` environment variable is the most common setup; the README explicitly notes that this key must be set before running the examples. Source: [README.md:97-101]()

## High-Level Architecture

Ragas is organized as a small, layered core surrounded by a rich ecosystem of pluggable adapters and integrations. The architecture can be summarized as follows:

```mermaid
flowchart TB
    User[User / Application] --> CLI["ragas CLI<br/>(quickstart templates)"]
    User --> SDK["ragas SDK<br/>(Python API)"]
    SDK --> Metrics["Metrics<br/>(DiscreteMetric, AspectCritic)"]
    SDK --> Testset["Testset Generation<br/>(synthesizers + transforms)"]
    SDK --> Prompts["Prompt Engine<br/>(PydanticPrompt, MultiModalPrompt)"]
    Metrics --> LLM["LLM Adapters<br/>(llm_factory)"]
    Testset --> LLM
    Prompts --> LLM
    LLM --> Providers[(OpenAI, Bedrock,<br/>Mistral, Gemini, ...)]
    SDK --> Backends["Backend Registry<br/>(pluggable storage)"]
    SDK --> Integrations["Integrations<br/>(Langfuse, MLflow,<br/>LangSmith, Helicone,<br/>LangChain, LlamaIndex,<br/>LangGraph, Griptape,<br/>Swarm, AG-UI)"]
    Integrations --> Observability[(Observability Platforms)]
    Backends --> Storage[(Local / Remote Storage)]
```

This diagram reflects how user code enters the system through either the CLI or the Python SDK, and how the SDK funnels evaluation, generation, and prompting through a unified LLM adapter layer and an extensible backend / integration layer.

## Key Modules

### LLM Adapters

A central abstraction is `StructuredOutputAdapter`, which defines a single contract for adapters that produce structured outputs from any backend (e.g., Instructor, LiteLLM). Source: [src/ragas/llms/adapters/base.py:4-32]() This design lets `llm_factory` build a `RagasLLM` from a pre-initialized client and a model name, abstracting over the underlying provider. Community issue #2774 highlights that adapter coverage is still uneven: creating an adapter for the `mistralai.Mistral` client currently raises a `ValueError`, so users must ensure their provider is supported before calling `llm_factory`. Source: [src/ragas/llms/adapters/base.py:18-32]()

### Prompt Engine

The prompt engine is built on top of `PydanticPrompt`, a Pydantic-model-driven prompt abstraction that handles input/output typing, example storage, and persistence. Source: [src/ragas/prompt/pydantic_prompt.py:1-20]() The `is_langchain_llm` helper detects whether the active LLM is a LangChain or a Ragas LLM, with direct LangChain usage explicitly marked as deprecated in favor of the Ragas LLM interface. Source: [src/ragas/prompt/pydantic_prompt.py:33-60]()

A companion `MultiModalPrompt` class extends this design to multi-modal inputs, and routes calls to either `agenerate_prompt` (LangChain) or `generate` (Ragas) depending on the LLM type. Source: [src/ragas/prompt/multi_modal_prompt.py:115-148]() Prompts are serialized to JSON (optionally gzip-compressed) following the format documented in `prompt-formats.md`, including support for `DynamicFewShotPrompt` with embedding-backed example selection. Source: [src/ragas/prompt/prompt-formats.md:1-58]()

### Testset Generation

The testset generation pipeline transforms raw documents into a knowledge graph, applies a configurable set of transforms, and then synthesizes queries and answers. The entry point `TestsetGenerator.generate` builds a `KnowledgeGraph` from input documents, calls `apply_transforms`, and dispatches to the synthesizer with a `RunConfig`. Source: [src/ragas/testset/synthesizers/generate.py:18-78]()

The default transform pipeline (`default_transforms`) inspects document length distribution and conditionally attaches extractors such as `HeadlinesExtractor`, summaries, keyphrases, titles, and embeddings. Source: [src/ragas/testset/transforms/default.py:1-60]() `LLMBasedNodeFilter` subclasses — for example `CustomNodeFilter` — score each node for "question potential" so low-value chunks can be dropped before synthesis. Source: [src/ragas/testset/transforms/filters.py:1-66]()

The `HeadlinesExtractorPrompt` and `NERPrompt` in `llm_based.py` illustrate the project's standard pattern for extractors: a `PydanticPrompt` with a typed input, a typed output (`Headlines` or `NEROutput`), and a small set of in-context examples. Source: [src/ragas/testset/transforms/extractors/llm_based.py:1-110]() The single-hop synthesizer uses a parallel `QueryCondition` → `GeneratedQueryAnswer` prompt, optionally conditioned on an `llm_context` to guide question style. Source: [src/ragas/testset/synthesizers/single_hop/prompts.py:1-58]()

### Backends

Backends are managed through a singleton `BackendRegistry` that supports both built-in and plugin-discovered backends, with alias resolution and validation enforced on registration. Source: [src/ragas/backends/registry.py:1-66]() The shared `MemorableNames` utility generates human-friendly, Docker-style adjective/noun identifiers for experiments and datasets. Source: [src/ragas/backends/utils.py:1-46]()

### Integrations

The integrations layer ships first-class adapters for tracing (Langfuse, MLflow), frameworks (LangChain, LlamaIndex, Griptape, LangGraph), observability (Helicone, LangSmith, Opik), platforms (Amazon Bedrock, R2R), agent systems (Swarm), and protocols (AG-UI). Source: [src/ragas/integrations/__init__.py:1-20]() Tracing integrations are imported lazily to keep optional dependencies optional, and the LangSmith helper `upload_dataset` converts a `Testset` into a pandas DataFrame before pushing it to a LangSmith dataset. Source: [src/ragas/integrations/langsmith.py:1-56]()

## Common Failure Modes and Known Issues

Several community-reported issues map directly to the modules above and are worth noting for new users:

- **Token usage tracking** — `get_token_usage_for_bedrock` always returns 0 because it reads the wrong `response_metadata` keys for `langchain-aws` `ChatBedrock` / `ChatBedrockConverse` (issue #2779). When using Bedrock, token totals from the `CostCallbackHandler` cannot be trusted until the keys are corrected.
- **Model compatibility** — `AspectCritic` raises runtime errors when the evaluator model is switched from `gpt-4o` to `o3` (issue #2067), reflecting brittleness in metric prompts against newer reasoning models.
- **Non-LLM context metrics** — `NonLLMContextRecall` and `NonLLMContextPrecisionWithReference` use inconsistent strict vs. non-strict comparisons (`>` vs `>=`) when thresholding string-similarity scores into binary relevance (issue #2777).
- **Deprecated import paths** — The deprecation warning for `LLMContextPrecisionWithoutReference` points users to a class that does not exist in `ragas.metrics.collections` (issue #2748).
- **Multilingual extraction** — Running `HeadlinesExtractor` on non-English content can fail with `'headlines' property not found in this node` (issue #1775), so the default transformer pipeline may need to be customized for non-English corpora.
- **Optional dependencies** — `python-diskcache` is imported inline but currently installed as a hard dependency; the community has requested it be made optional (issue #2622).

## See Also

- Metrics and DiscreteMetric usage (see README quickstart)
- LLM adapter providers and `llm_factory`
- Testset Generation module (synthesizers, transforms, filters)
- Integrations: Langfuse, MLflow, LangSmith, LangChain, LlamaIndex
- Prompt JSON format reference (`src/ragas/prompt/prompt-formats.md`)

---

<a id='page-2'></a>

## Evaluation Metrics, Collections, and Cost Tracking

### Related Pages

Related topics: [Overview, Installation, and Architecture](#page-1), [LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/vibrantlabsai/ragas/blob/main/README.md)
- [src/ragas/prompt/metrics/__init__.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/metrics/__init__.py)
- [src/ragas/prompt/metrics/noise_sensitivity.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/metrics/noise_sensitivity.py)
- [src/ragas/prompt/metrics/base_prompt.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/metrics/base_prompt.py)
- [src/ragas/prompt/metrics/answer_relevance.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/metrics/answer_relevance.py)
- [src/ragas/prompt/metrics/answer_correctness.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/metrics/answer_correctness.py)
- [src/ragas/prompt/metrics/common.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/metrics/common.py)
- [src/ragas/prompt/pydantic_prompt.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/pydantic_prompt.py)
- [src/ragas/prompt/multi_modal_prompt.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/multi_modal_prompt.py)
- [src/ragas/prompt/prompt-formats.md](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/prompt-formats.md)
- [src/ragas/testset/synthesizers/generate.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/synthesizers/generate.py)
- [src/ragas/integrations/__init__.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/integrations/__init__.py)
- [src/ragas/integrations/tracing/__init__.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/integrations/tracing/__init__.py)
- [src/ragas/backends/utils.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/backends/utils.py)
</details>

# Evaluation Metrics, Collections, and Cost Tracking

## Overview

Ragas provides an evaluation framework for Large Language Model (LLM) applications, with a strong emphasis on objective metrics, automated test data generation, and per-run cost tracking. The README introduces the framework as a "toolkit for evaluating and optimizing Large Language Model (LLM) applications" that supports both LLM-based and traditional metrics, integrations with popular LLM frameworks, and feedback-loop driven improvement.

The evaluation surface area in ragas is split into three coordinated concerns:

1. **Metrics** — pre-built and customizable scoring functions (e.g., faithfulness, answer correctness, noise sensitivity, aspect critique).
2. **Collections** — the structured module that exposes LLM-backed and non-LLM metrics through a unified import surface and that drives testset generation.
3. **Cost Tracking** — the `TokenUsage` / `TokenUsageParser` abstraction that allows per-run accounting of tokens and spend across providers (with provider-specific extractors such as `get_token_usage_for_bedrock`).

## Metrics Module

The metrics package re-exports its prompt assets through a single entry point. As shown in `src/ragas/prompt/metrics/__init__.py`, ragas ships the following built-in metric prompts:

- `correctness_classifier_prompt` — used by the answer correctness metric to label a response against a reference.
- `answer_relevancy_prompt` — used by the answer relevance metric to score how pertinent an answer is to the question.
- `nli_statement_prompt` and `statement_generator_prompt` — shared, low-level prompts exposed via `common.py` and used by composition-based metrics (for example, faithfulness and noise sensitivity decompose a response into atomic statements and judge them with an NLI step).

Source: [src/ragas/prompt/metrics/__init__.py:5-16]()

Beyond these, ragas exposes higher-level metrics used in user code. `DiscreteMetric` (shown in the README quickstart) lets users define a custom evaluator with a fixed `allowed_values` vocabulary, while `AspectCritic` and `NoiseSensitivity` address aspect-based evaluation and robustness against irrelevant context respectively. The `noise_sensitivity.py` prompt demonstrates the canonical pattern: a context + list of statements input, decomposed into per-statement verdicts of `0` (irrelevant) or `1` (relevant), each accompanied by a `reason`.

Source: [src/ragas/prompt/metrics/noise_sensitivity.py]()

### LLM Interaction Pattern

All metric prompts ultimately subclass `PydanticPrompt[InputModel, OutputModel]`, which in turn handles JSON-typed IO, response parsing, and LangChain/Ragas LLM dispatch. `pydantic_prompt.py` defines `is_langchain_llm()` to detect whether the supplied model is a legacy LangChain LLM or a Ragas LLM. `multi_modal_prompt.py` shows the dual code path: `agenerate_prompt` for LangChain LLMs, and `ragas_llm.generate(...)` for Ragas LLMs.

Source: [src/ragas/prompt/multi_modal_prompt.py]() and [src/ragas/prompt/pydantic_prompt.py]()

This split is the reason community issue #2774 reports `ValueError` for the `mistralai.Mistral` client — providers must be wired through `llm_factory` so the metric can dispatch correctly.

## Collections and Testset Generation

Ragas organizes metrics and synthesizers under "collections" so that testset generation, evaluation, and prompt optimization share a common knowledge-graph and transform pipeline. The `generate.py` testset generator accepts a list of documents, applies a default set of transforms (extracting summaries, keyphrases, titles, headlines, and embeddings, and then building similarity edges), and runs a `query_distribution` of synthesizers to produce `TestsetSample` records.

Source: [src/ragas/testset/synthesizers/generate.py]()

The generator's signature includes `token_usage_parser: Optional[TokenUsageParser]`, which directly couples the collections surface to the cost tracking subsystem. This is the parameter users reach for when they want to know how many tokens testset generation consumed — a recurring community question (see issue #540, "Tokens Usage for evaluations and testset generations").

## Cost Tracking and Token Usage

Cost tracking in ragas is built on two primitives:

| Component | Role |
|---|---|
| `TokenUsageParser` | Accepts an `LLMResult` and returns a `TokenUsage` object; provider-agnostic entry point. |
| `get_token_usage_for_<provider>` (e.g., `get_token_usage_for_bedrock`) | Provider-specific extractor that reads `response_metadata` and returns the correct token counts. |

Issue #2779 documents a real failure mode: `get_token_usage_for_bedrock` "always returns 0" because it reads the wrong `response_metadata` keys for `langchain-aws ChatBedrock` / `ChatBedrockConverse`. The fix requires inspecting the actual metadata emitted by langchain-aws and matching the right key paths. The same pattern recurs for every provider — the `TokenUsageParser` is the dispatch point, but the per-provider key extraction is where bugs surface.

The wider `cost` module also publishes a `CostCallbackHandler` (referenced in the same community thread) that attaches to a run to collect usage in real time.

## Integrations and Tracing

The integration layer in `src/ragas/integrations/__init__.py` documents the available surfaces: tracing (Langfuse, MLflow), frameworks (LangChain, LlamaIndex, Griptape, LangGraph), observability (Helicone, Langsmith, Opik), platforms (Amazon Bedrock, R2R), AI systems (Swarm), and protocols (AG-UI).

Source: [src/ragas/integrations/__init__.py]()

Tracing imports are lazy: `src/ragas/integrations/tracing/__init__.py` only resolves `observe`, `sync_trace`, `LangfuseTrace`, and `MLflowTrace` on attribute access, so projects that do not install langfuse or mlflow still import ragas cleanly. The same `__getattr__` pattern is what community issue #2622 ("Make `python-diskcache` dependency optional") requests for `diskcache`, which is currently only imported on demand.

## Prompt Persistence

`src/ragas/prompt/prompt-formats.md` documents the on-disk JSON (optionally gzip-compressed) format used to save and load both `Prompt` and `DynamicFewShotPrompt` objects. The format includes a `format_version` field, a `type` discriminator, an `instruction` template, an `examples` array, and a `response_model_info` block. `DynamicFewShotPrompt` extends this with an embedding model reference and `max_similar_examples` / `similarity_threshold` fields, enabling example retrieval at inference time.

Source: [src/ragas/prompt/prompt-formats.md]()

## Common Failure Modes

- **Wrong provider metadata keys** — `get_token_usage_for_bedrock` returns zero because the keys differ between `ChatBedrock` and `ChatBedrockConverse` (issue #2779).
- **LLM factory dispatch** — passing an unsupported client (e.g., raw `mistralai.Mistral`) to `llm_factory` raises `ValueError` (issue #2774).
- **AspectCritic model compatibility** — switching the evaluator model to `o3` causes runtime errors in the prompt/parser pipeline (issue #2067).
- **Threshold inconsistencies** — non-LLM context metrics use mixed `>` vs `>=` comparisons when binarizing similarity (issue #2777).
- **Stale deprecation names** — deprecation warnings may reference class names that no longer exist in `ragas.metrics.collections` (issue #2748).

## See Also

- Testset Generation and Knowledge Graph Transforms
- LLM Adapters and the `llm_factory` API
- Tracing Integrations: Langfuse and MLflow
- Prompt Persistence and `DynamicFewShotPrompt`

---

<a id='page-3'></a>

## Testset Generation, Datasets, and Storage Backends

### Related Pages

Related topics: [Overview, Installation, and Architecture](#page-1), [Evaluation Metrics, Collections, and Cost Tracking](#page-2)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/ragas/testset/__init__.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/__init__.py)
- [src/ragas/testset/synthesizers/generate.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/synthesizers/generate.py)
- [src/ragas/testset/synthesizers/testset_schema.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/synthesizers/testset_schema.py)
- [src/ragas/testset/transforms/default.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/default.py)
- [src/ragas/testset/transforms/extractors/llm_based.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/extractors/llm_based.py)
- [src/ragas/testset/transforms/filters.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/filters.py)
- [src/ragas/testset/synthesizers/single_hop/prompts.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/synthesizers/single_hop/prompts.py)
- [src/ragas/backends/base.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/backends/base.py)
- [src/ragas/backends/README.md](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/backends/README.md)
- [src/ragas/backends/utils.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/backends/utils.py)
- [src/ragas/backends/gdrive_backend.md](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/backends/gdrive_backend.md)
- [src/ragas/integrations/__init__.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/integrations/__init__.py)
- [src/ragas/integrations/langsmith.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/integrations/langsmith.py)
- [src/ragas/llms/adapters/base.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/llms/adapters/base.py)
</details>

# Testset Generation, Datasets, and Storage Backends

## Overview and Role

Ragas treats LLM-application evaluation as a two-sided problem: you need metrics to score outputs, and you need data to score against. The `ragas.testset` module provides the second half by synthesizing realistic `Testset` objects from your source documents, while `ragas.backends` provides pluggable storage for persisting both datasets and experiment results across local, cloud, and third-party systems.

The public surface is intentionally small. `src/ragas/testset/__init__.py` exports only `TestsetGenerator`, `Testset`, and `TestsetSample`. Storage backends follow a parallel minimalism: `BaseBackend` in `src/ragas/backends/base.py` declares four abstract methods, and the `ragas.backends` entry-point group lets third parties register new backends without forking the library (see `src/ragas/backends/README.md`).

## Testset Generation Pipeline

### The `TestsetGenerator` Class

The entry point is `TestsetGenerator` in [src/ragas/testset/synthesizers/generate.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/synthesizers/generate.py). It exposes two generation paths:

- `generate_with_langchain_docs(...)` — accepts LangChain `Document` objects (or raw strings), constructs a `KnowledgeGraph` of `Node(type=NodeType.CHUNK)` entries from each `page_content` + `metadata`, runs `apply_transforms(...)`, then delegates to `generate(...)`.
- `generate(...)` — the lower-level entry that operates on an already-populated `KnowledgeGraph` and supports a configurable `QueryDistribution` (single-hop, multi-hop, etc.).

Both methods accept the same core options: `testset_size`, `query_distribution`, `run_config`, `callbacks`, `token_usage_parser`, `with_debugging_logs`, `raise_exceptions`, and `return_executor`. When `return_executor=True`, the call returns the `Executor` instance instead of running to completion, allowing the caller to call `executor.cancel()` mid-run and later collect partial results via `executor.results()`.

The pipeline is gated by an explicit precondition — both `self.llm` and `transforms_llm` must be set or a `ValueError` is raised.

### Transforms and the Knowledge Graph

Before any query is generated, the source chunks are enriched by a pipeline of `Transforms`. The default pipeline, defined in [src/ragas/testset/transforms/default.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/default.py), bins documents by token length and conditionally adds:

- `HeadlinesExtractor` (and the `Headlines` / `NER` Pydantic prompts in [src/ragas/testset/transforms/extractors/llm_based.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/extractors/llm_based.py)) when ≥25% of documents exceed 500 tokens,
- `SummaryExtractor` and `KeyphrasesExtractor` for shorter corpora,
- Embedding-based similarity edges between nodes.

Filters drop low-utility nodes. `CustomNodeFilter` in [src/ragas/testset/transforms/filters.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/filters.py) scores each chunk via the `QuestionPotentialPrompt` (1–5) and drops nodes scoring below `min_score=2`, falling back to a parent-document summary when one exists. If a chunk has no summary at all, the filter logs a warning and skips it.

```mermaid
flowchart LR
    A[LangChain Documents / Strings] --> B[KnowledgeGraph: Chunk Nodes]
    B --> C[apply_transforms: Headlines / Summary / Keyphrases / Embeddings]
    C --> D[CustomNodeFilter: QuestionPotential]
    D --> E[QueryDistribution Synthesizers]
    E --> F[Testset samples: query + reference + contexts]
    F --> G{return_executor?}
    G -- false --> H[Testset]
    G -- true --> I[Executor .results / .cancel]
```

### Query Synthesis Prompts

The `QueryCondition → GeneratedQueryAnswer` pairs defined in [src/ragas/testset/synthesizers/single_hop/prompts.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/synthesizers/single_hop/prompts.py) drive single-hop generation: persona, term, query style, query length, and an optional `llm_context` constrain the LLM so generated questions stay grounded in the source chunks.

## Datasets and Schema

A generated test set is a list of `TestsetSample` records aggregated in a `Testset` container (see `src/ragas/testset/synthesizers/testset_schema.py`). Each sample carries the synthesized `query`, a reference `answer`, the `contexts` used to produce it, and the originating node references for traceability.

The prompt layer underneath — `PydanticPrompt` in [src/ragas/prompt/multi_modal_prompt.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/multi_modal_prompt.py) — dispatches to either LangChain `agenerate_prompt` or Ragas-native `BaseRagasLLM.generate` and parses the output through a `RagasOutputParser`, raising `RagasOutputParserException` on schema failure. This indirection is what lets a single `TestsetGenerator` work across OpenAI, Bedrock, Vertex, and LlamaIndex backends (the adapter contract is defined in [src/ragas/llms/adapters/base.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/llms/adapters/base.py)).

### LangSmith Integration

[src/ragas/integrations/langsmith.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/integrations/langsmith.py) exposes `upload_dataset(testset, dataset_name, dataset_desc="")`. The function reads LangSmith to confirm the dataset name is free, then converts the `Testset` to a pandas `DataFrame` and uploads it. A duplicate name raises `ValueError`. Broader integrations (Langfuse, MLflow, Helicone, Opik, Swarm, AG-UI) are catalogued in [src/ragas/integrations/__init__.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/integrations/__init__.py) and imported lazily so optional dependencies stay optional.

## Storage Backends

### `BaseBackend` Interface

`src/ragas/backends/base.py` declares the contract every storage backend must satisfy. Implementations receive and return `List[Dict[str, Any]]`, raise `FileNotFoundError` on missing names, and return an empty list (never `None`) for empty datasets. The abstract methods are:

| Method | Purpose |
| --- | --- |
| `load_dataset(name)` | Return records for a dataset, or raise `FileNotFoundError` |
| `save_dataset(name, data, ...)` | Persist dataset records atomically |
| `load_experiment(name)` | Mirror of `load_dataset` for experiment runs |
| `save_experiment(name, data, ...)` | Mirror of `save_dataset` for experiment runs |
| `list_datasets()` / `list_experiments()` | Enumerate stored names |

The file-based convention is `storage_root/datasets/` and `storage_root/experiments/`, so users can point multiple backends at the same root safely.

### Plugin Discovery

Third-party backends are registered via Python entry points (see `src/ragas/backends/README.md`). The package declares `ragas.backends` as the entry-point group; on first access `get_registry()` iterates registered entry points and lazy-loads each `name -> backend_class` mapping. Debugging is `registry.keys()`. Bundled helpers like `MemorableNames` in [src/ragas/backends/utils.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/backends/utils.py) generate human-readable identifiers (`adjective + scientist`) for experiment and dataset names.

### Google Drive Backend

The Google Drive backend, documented in [src/ragas/backends/gdrive_backend.md](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/backends/gdrive_backend.md), stores datasets and experiments as Google Sheets inside two folders (`datasets/`, `experiments/`). It supports both Service Account and OAuth 2.0 authentication and is installed via `pip install "ragas[gdrive]"`.

## Known Limitations (Community Notes)

Several open issues surfaced by users intersect with this surface area:

- Issue [#1775](https://github.com/vibrantlabsai/ragas/issues/1775) reports `headlines` property missing on nodes when generating test data in non-English languages — this is the `HeadlinesExtractor` step failing because the LLM did not emit structured output for that locale.
- Issue [#1688](https://github.com/vibrantlabsai/ragas/issues/1688) tracks `AttributeError('StringIO' object has no attribute 'classifications')` during testset generation, again tied to the LLM extractor prompts.
- Issue [#2231](https://github.com/vibrantlabsai/ragas/issues/2231) collects feedback on the future of this module ahead of v0.4 — relevant if you maintain a fork.
- Issue [#540](https://github.com/vibrantlabsai/ragas/issues/540) covers token-usage accounting for testset generation, which the `token_usage_parser` argument on `generate(...)` exists to address (see `src/ragas/cost.py`).

## See Also

- Metrics and Evaluation — LLM-based and non-LLM scoring metrics
- Integrations — LangChain, LlamaIndex, LangSmith, Langfuse
- Cost Tracking — `TokenUsageParser` and `CostCallbackHandler`
- Backends — Plugin authoring guide in `src/ragas/backends/README.md`

---

<a id='page-4'></a>

## LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting

### Related Pages

Related topics: [Overview, Installation, and Architecture](#page-1), [Evaluation Metrics, Collections, and Cost Tracking](#page-2)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/vibrantlabsai/ragas/blob/main/README.md)
- [src/ragas/llms/base.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/llms/base.py)
- [src/ragas/llms/oci_genai_wrapper.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/llms/oci_genai_wrapper.py)
- [src/ragas/llms/adapters/base.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/llms/adapters/base.py)
- [src/ragas/integrations/__init__.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/integrations/__init__.py)
- [src/ragas/integrations/tracing/__init__.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/integrations/tracing/__init__.py)
- [src/ragas/prompt/prompt-formats.md](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/prompt-formats.md)
- [src/ragas/prompt/metrics/base_prompt.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/metrics/base_prompt.py)
- [src/ragas/prompt/multi_modal_prompt.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/prompt/multi_modal_prompt.py)
- [src/ragas/testset/synthesizers/generate.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/synthesizers/generate.py)
- [src/ragas/testset/transforms/extractors/llm_based.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/extractors/llm_based.py)
- [src/ragas/testset/transforms/extractors/__init__.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/extractors/__init__.py)
- [src/ragas/testset/transforms/default.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/testset/transforms/default.py)
- [src/ragas/backends/utils.py](https://github.com/vibrantlabsai/ragas/blob/main/src/ragas/backends/utils.py)
</details>

# LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting

## Overview

Ragas provides a modular stack for connecting to LLMs, observability platforms, and prompt frameworks. The system is composed of three cooperating layers:

1. **Adapter layer** — A pluggable abstraction over the `Instructor` and `LiteLLM` libraries, exposed through the `llm_factory` function and the `StructuredOutputAdapter` base class.
2. **Integration layer** — Optional modules that bridge Ragas to LangChain, LlamaIndex, MLflow, Langfuse, Amazon Bedrock, and tracing/observability tooling.
3. **Prompt layer** — JSON-serializable, versioned prompts with optional few-shot example stores used by metrics and testset synthesizers.

This page documents how those layers fit together and how to recover from the most frequently reported community issues.

## LLM Adapters

The adapter layer normalizes structured-output generation across providers. The base abstraction is defined in `src/ragas/llms/adapters/base.py`:

```python
class StructuredOutputAdapter(ABC):
    @abstractmethod
    def create_llm(
        self,
        client: t.Any,
        model: str,
        provider: str,
        **kwargs,
    ) -> t.Any: ...
```

Concrete adapters (e.g. `InstructorLLM`, `LiteLLMLLM`) implement `create_llm()` and are selected automatically by `llm_factory()` based on the requested `adapter` string — `"instructor"` or `"litellm"`. Source: [src/ragas/llms/adapters/base.py:1-31]()
The factory surfaces a deterministic `InstructorModelArgs` default (temperature 0.1, max_tokens 1024) and accepts a `cache` backend such as `DiskCacheBackend()` to deliver "60x" repeated-evaluation speed-ups. Source: [src/ragas/llms/base.py:1-120]()

Provider-specific wrappers complement the generic factory. For example, the OCI Gen AI wrapper performs lazy initialization of the SDK and only raises `ImportError` if neither an explicit `client` nor an `endpoint_id` is supplied:

```python
if (
    self.client is None
    and GenerativeAiClient is None
    and self.endpoint_id is None
):
    raise ImportError("OCI SDK not found. Please install it with: pip ...")
```

Source: [src/ragas/llms/oci_genai_wrapper.py:1-80]()

### Known Adapter Failure Modes

The community has reported two recurring adapter bugs:

- **mistralai client** — `llm_factory` raises `ValueError` because the mistral provider branch is missing from the adapter registry. Workaround: use the `litellm` adapter instead.
- **OpenAI `o3` with `AspectCritic`** — Structured-output parsing fails because `o3` does not honor the default JSON schema the way `gpt-4o` does. The published mitigation is to fall back to the `instructor` adapter with `mode=Mode.MD_JSON` for backends that ignore `response_format`.

Both failures are routed through the same `InstructorBaseRagasLLM` interface, so swapping adapters requires no caller-side changes. Source: [src/ragas/llms/base.py:1-120]()

## Integrations

Ragas groups integrations under `src/ragas/integrations/`. The package re-exports optional submodules so that missing third-party SDKs do not hard-fail on import. Tracing integrations live in `src/ragas/integrations/tracing/`, which supports Langfuse and MLflow with `observe()` and `sync_trace()` helpers. Source: [src/ragas/integrations/__init__.py:1-22](), [src/ragas/integrations/tracing/__init__.py:1-80]()

The framework-level integrations span LangChain, LlamaIndex, Griptape, LangGraph, Helicone, LangSmith, Opik, Amazon Bedrock, R2R, Swarm, and AG-UI. Because each is imported lazily through `__getattr__`, users can install only the bindings they need.

The example below, drawn from the README, shows the canonical entry point that all integrations share:

```python
from ragas.metrics import DiscreteMetric
from ragas.llms import llm_factory

client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)
metric = DiscreteMetric(name="summary_accuracy", allowed_values=["accurate", "inaccurate"], prompt="...")
score = await metric.ascore(llm=llm, response="...")
```

Source: [README.md:1-80]()

## Prompt Optimization

Prompts are first-class, persistable objects. The `PydanticPrompt[InputModel, OutputModel]` base class renders a deterministic prompt string that combines an instruction, JSON-schema output spec, optional few-shot examples, and the serialized input. Source: [src/ragas/prompt/metrics/base_prompt.py:1-90]()

### Prompt Persistence

`Prompt` and `DynamicFewShotPrompt` use JSON (optionally gzip-compressed to `.json.gz`). The reference doc compares the two formats:

| Feature | `Prompt` | `DynamicFewShotPrompt` |
|---------|----------|------------------------|
| Type ID | `"Prompt"` | `"DynamicFewShotPrompt"` |
| Embedding-backed example selection | ❌ | ✅ |
| Similarity threshold / max similar examples | ❌ | ✅ |
| Stores `response_model_info` | ✅ | ✅ |

Source: [src/ragas/prompt/prompt-formats.md:1-50]()

For testset generation, `generate.py` orchestrates a sequence of synthesizers driven by configurable `query_distribution`, `transforms_llm`, and `transforms_embedding_model`. The default transform pipeline (`default.py`) inspects document-length bins and conditionally enables `HeadlinesExtractor` when ≥25% of documents exceed 501 tokens, helping LLM-based extractors avoid the `StringIO` and `'headlines' property not found` errors reported in community issue #1775 and #1688. Source: [src/ragas/testset/transforms/default.py:1-80](), [src/ragas/testset/synthesizers/generate.py:1-100]()

LLM-based extractors (`SummaryExtractor`, `KeyphrasesExtractor`, `TitleExtractor`, `HeadlinesExtractor`, `NERExtractor`, `TopicDescriptionExtractor`) all extend `PydanticPrompt` and share the same example/few-shot machinery, making them a natural place to plug in DSPy-style optimizers introduced in v0.4.3. Source: [src/ragas/testset/transforms/extractors/__init__.py:1-30](), [src/ragas/testset/transforms/extractors/llm_based.py:1-120]()

## Troubleshooting

The most common community-reported problems map to specific subsystems:

| Symptom | Likely cause | Where to look |
|---------|--------------|---------------|
| `ValueError` from `llm_factory` | Unsupported provider / client combo | `src/ragas/llms/base.py`, `adapters/base.py` |
| `AttributeError('StringIO' object has no attribute 'classifications')` | Outdated extractor running on `StringIO` payloads | `src/ragas/testset/transforms/extractors/llm_based.py` |
| `'headlines' property not found in this node` | Document too short; default transform skipped headline extraction | `src/ragas/testset/transforms/default.py` |
| Token usage always 0 for Bedrock | `response_metadata` keys differ across `ChatBedrock` / `ChatBedrockConverse` | LLM callback handler (see issue #2779) |
| Non-LLM context metrics threshold inconsistency (`>` vs `>=`) | Binary relevance boundary differs between `NonLLMContextRecall` and `NonLLMContextPrecisionWithReference` | Issue #2777 |

For non-English corpora, ensure `Language` is set on prompts and that `transforms_llm` is instructed to keep headings in the source language; this avoids the headline-extractor pipeline failing partway through `generate()`. Source: [src/ragas/testset/synthesizers/single_hop/prompts.py:1-60]()

## See Also

- LLM module: `src/ragas/llms/__init__.py`
- Prompt format spec: `src/ragas/prompt/prompt-formats.md`
- Integrations index: `src/ragas/integrations/__init__.py`
- Tracing integrations: `src/ragas/integrations/tracing/__init__.py`
- Testset synthesis: `src/ragas/testset/synthesizers/generate.py`
- Community issue tracker: [vibrantlabsai/ragas issues](https://github.com/vibrantlabsai/ragas/issues)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: vibrantlabsai/ragas

Summary: Found 17 structured pitfall item(s), including 9 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2760

## 2. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2748

## 3. Configuration risk - Configuration risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2774

## 4. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2067

## 5. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2732

## 6. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2622

## 7. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2649

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2692

## 9. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2779

## 10. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/vibrantlabsai/ragas

## 11. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2762

## 12. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/vibrantlabsai/ragas

## 13. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/vibrantlabsai/ragas

## 14. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/vibrantlabsai/ragas

## 15. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2777

## 16. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/vibrantlabsai/ragas

## 17. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/vibrantlabsai/ragas

<!-- canonical_name: vibrantlabsai/ragas; human_manual_source: deepwiki_human_wiki -->