Doramagic Project Pack ยท Human Manual
ragas
Supercharge Your LLM Application Evaluations ๐
Overview, Installation, and Architecture
Related topics: Evaluation Metrics, Collections, and Cost Tracking, Testset Generation, Datasets, and Storage Backends, LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Evaluation Metrics, Collections, and Cost Tracking, Testset Generation, Datasets, and Storage Backends, LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting
Overview, Installation, and Architecture
Ragas is a toolkit for evaluating and optimizing Large Language Model (LLM) applications. It provides objective metrics, intelligent test data generation, and integration hooks for popular LLM frameworks, observability platforms, and agent runtimes. According to the project README, Ragas aims to replace time-consuming subjective assessments with data-driven evaluation workflows, and can also produce production-aligned test sets when no evaluation data is available. Source: README.md:1-18
Core Capabilities
The repository exposes four high-level feature pillars that frame the rest of the project:
| Pillar | Purpose |
|---|---|
| Objective Metrics | Evaluate LLM applications using LLM-based and traditional metrics |
| Test Data Generation | Automatically create comprehensive test datasets |
| Seamless Integrations | LangChain, LlamaIndex, observability tools, agent frameworks |
| Feedback Loops | Use production data to continually improve LLM applications |
Source: README.md:11-15
Installation
Ragas is distributed on PyPI and can be installed via pip. Two installation paths are documented in the README:
# PyPI install
pip install ragas
# Install directly from the source repository
pip install git+https://github.com/vibrantlabsai/ragas
Source: README.md:20-30
Authentication for the default OPENAI_API_KEY environment variable is the most common setup; the README explicitly notes that this key must be set before running the examples. Source: README.md:97-101
High-Level Architecture
Ragas is organized as a small, layered core surrounded by a rich ecosystem of pluggable adapters and integrations. The architecture can be summarized as follows:
flowchart TB
User[User / Application] --> CLI["ragas CLI<br/>(quickstart templates)"]
User --> SDK["ragas SDK<br/>(Python API)"]
SDK --> Metrics["Metrics<br/>(DiscreteMetric, AspectCritic)"]
SDK --> Testset["Testset Generation<br/>(synthesizers + transforms)"]
SDK --> Prompts["Prompt Engine<br/>(PydanticPrompt, MultiModalPrompt)"]
Metrics --> LLM["LLM Adapters<br/>(llm_factory)"]
Testset --> LLM
Prompts --> LLM
LLM --> Providers[(OpenAI, Bedrock,<br/>Mistral, Gemini, ...)]
SDK --> Backends["Backend Registry<br/>(pluggable storage)"]
SDK --> Integrations["Integrations<br/>(Langfuse, MLflow,<br/>LangSmith, Helicone,<br/>LangChain, LlamaIndex,<br/>LangGraph, Griptape,<br/>Swarm, AG-UI)"]
Integrations --> Observability[(Observability Platforms)]
Backends --> Storage[(Local / Remote Storage)]This diagram reflects how user code enters the system through either the CLI or the Python SDK, and how the SDK funnels evaluation, generation, and prompting through a unified LLM adapter layer and an extensible backend / integration layer.
Key Modules
LLM Adapters
A central abstraction is StructuredOutputAdapter, which defines a single contract for adapters that produce structured outputs from any backend (e.g., Instructor, LiteLLM). Source: src/ragas/llms/adapters/base.py:4-32 This design lets llm_factory build a RagasLLM from a pre-initialized client and a model name, abstracting over the underlying provider. Community issue #2774 highlights that adapter coverage is still uneven: creating an adapter for the mistralai.Mistral client currently raises a ValueError, so users must ensure their provider is supported before calling llm_factory. Source: src/ragas/llms/adapters/base.py:18-32
Prompt Engine
The prompt engine is built on top of PydanticPrompt, a Pydantic-model-driven prompt abstraction that handles input/output typing, example storage, and persistence. Source: src/ragas/prompt/pydantic_prompt.py:1-20 The is_langchain_llm helper detects whether the active LLM is a LangChain or a Ragas LLM, with direct LangChain usage explicitly marked as deprecated in favor of the Ragas LLM interface. Source: src/ragas/prompt/pydantic_prompt.py:33-60
A companion MultiModalPrompt class extends this design to multi-modal inputs, and routes calls to either agenerate_prompt (LangChain) or generate (Ragas) depending on the LLM type. Source: src/ragas/prompt/multi_modal_prompt.py:115-148 Prompts are serialized to JSON (optionally gzip-compressed) following the format documented in prompt-formats.md, including support for DynamicFewShotPrompt with embedding-backed example selection. Source: src/ragas/prompt/prompt-formats.md:1-58
Testset Generation
The testset generation pipeline transforms raw documents into a knowledge graph, applies a configurable set of transforms, and then synthesizes queries and answers. The entry point TestsetGenerator.generate builds a KnowledgeGraph from input documents, calls apply_transforms, and dispatches to the synthesizer with a RunConfig. Source: src/ragas/testset/synthesizers/generate.py:18-78
The default transform pipeline (default_transforms) inspects document length distribution and conditionally attaches extractors such as HeadlinesExtractor, summaries, keyphrases, titles, and embeddings. Source: src/ragas/testset/transforms/default.py:1-60 LLMBasedNodeFilter subclasses โ for example CustomNodeFilter โ score each node for "question potential" so low-value chunks can be dropped before synthesis. Source: src/ragas/testset/transforms/filters.py:1-66
The HeadlinesExtractorPrompt and NERPrompt in llm_based.py illustrate the project's standard pattern for extractors: a PydanticPrompt with a typed input, a typed output (Headlines or NEROutput), and a small set of in-context examples. Source: src/ragas/testset/transforms/extractors/llm_based.py:1-110 The single-hop synthesizer uses a parallel QueryCondition โ GeneratedQueryAnswer prompt, optionally conditioned on an llm_context to guide question style. Source: src/ragas/testset/synthesizers/single_hop/prompts.py:1-58
Backends
Backends are managed through a singleton BackendRegistry that supports both built-in and plugin-discovered backends, with alias resolution and validation enforced on registration. Source: src/ragas/backends/registry.py:1-66 The shared MemorableNames utility generates human-friendly, Docker-style adjective/noun identifiers for experiments and datasets. Source: src/ragas/backends/utils.py:1-46
Integrations
The integrations layer ships first-class adapters for tracing (Langfuse, MLflow), frameworks (LangChain, LlamaIndex, Griptape, LangGraph), observability (Helicone, LangSmith, Opik), platforms (Amazon Bedrock, R2R), agent systems (Swarm), and protocols (AG-UI). Source: src/ragas/integrations/__init__.py:1-20 Tracing integrations are imported lazily to keep optional dependencies optional, and the LangSmith helper upload_dataset converts a Testset into a pandas DataFrame before pushing it to a LangSmith dataset. Source: src/ragas/integrations/langsmith.py:1-56
Common Failure Modes and Known Issues
Several community-reported issues map directly to the modules above and are worth noting for new users:
- Token usage tracking โ
get_token_usage_for_bedrockalways returns 0 because it reads the wrongresponse_metadatakeys forlangchain-awsChatBedrock/ChatBedrockConverse(issue #2779). When using Bedrock, token totals from theCostCallbackHandlercannot be trusted until the keys are corrected. - Model compatibility โ
AspectCriticraises runtime errors when the evaluator model is switched fromgpt-4otoo3(issue #2067), reflecting brittleness in metric prompts against newer reasoning models. - Non-LLM context metrics โ
NonLLMContextRecallandNonLLMContextPrecisionWithReferenceuse inconsistent strict vs. non-strict comparisons (>vs>=) when thresholding string-similarity scores into binary relevance (issue #2777). - Deprecated import paths โ The deprecation warning for
LLMContextPrecisionWithoutReferencepoints users to a class that does not exist inragas.metrics.collections(issue #2748). - Multilingual extraction โ Running
HeadlinesExtractoron non-English content can fail with'headlines' property not found in this node(issue #1775), so the default transformer pipeline may need to be customized for non-English corpora. - Optional dependencies โ
python-diskcacheis imported inline but currently installed as a hard dependency; the community has requested it be made optional (issue #2622).
See Also
- Metrics and DiscreteMetric usage (see README quickstart)
- LLM adapter providers and
llm_factory - Testset Generation module (synthesizers, transforms, filters)
- Integrations: Langfuse, MLflow, LangSmith, LangChain, LlamaIndex
- Prompt JSON format reference (
src/ragas/prompt/prompt-formats.md)
Source: https://github.com/vibrantlabsai/ragas / Human Manual
Evaluation Metrics, Collections, and Cost Tracking
Related topics: Overview, Installation, and Architecture, LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview, Installation, and Architecture, LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting
Evaluation Metrics, Collections, and Cost Tracking
Overview
Ragas provides an evaluation framework for Large Language Model (LLM) applications, with a strong emphasis on objective metrics, automated test data generation, and per-run cost tracking. The README introduces the framework as a "toolkit for evaluating and optimizing Large Language Model (LLM) applications" that supports both LLM-based and traditional metrics, integrations with popular LLM frameworks, and feedback-loop driven improvement.
The evaluation surface area in ragas is split into three coordinated concerns:
- Metrics โ pre-built and customizable scoring functions (e.g., faithfulness, answer correctness, noise sensitivity, aspect critique).
- Collections โ the structured module that exposes LLM-backed and non-LLM metrics through a unified import surface and that drives testset generation.
- Cost Tracking โ the
TokenUsage/TokenUsageParserabstraction that allows per-run accounting of tokens and spend across providers (with provider-specific extractors such asget_token_usage_for_bedrock).
Metrics Module
The metrics package re-exports its prompt assets through a single entry point. As shown in src/ragas/prompt/metrics/__init__.py, ragas ships the following built-in metric prompts:
correctness_classifier_promptโ used by the answer correctness metric to label a response against a reference.answer_relevancy_promptโ used by the answer relevance metric to score how pertinent an answer is to the question.nli_statement_promptandstatement_generator_promptโ shared, low-level prompts exposed viacommon.pyand used by composition-based metrics (for example, faithfulness and noise sensitivity decompose a response into atomic statements and judge them with an NLI step).
Source: src/ragas/prompt/metrics/__init__.py:5-16
Beyond these, ragas exposes higher-level metrics used in user code. DiscreteMetric (shown in the README quickstart) lets users define a custom evaluator with a fixed allowed_values vocabulary, while AspectCritic and NoiseSensitivity address aspect-based evaluation and robustness against irrelevant context respectively. The noise_sensitivity.py prompt demonstrates the canonical pattern: a context + list of statements input, decomposed into per-statement verdicts of 0 (irrelevant) or 1 (relevant), each accompanied by a reason.
Source: src/ragas/prompt/metrics/noise_sensitivity.py
LLM Interaction Pattern
All metric prompts ultimately subclass PydanticPrompt[InputModel, OutputModel], which in turn handles JSON-typed IO, response parsing, and LangChain/Ragas LLM dispatch. pydantic_prompt.py defines is_langchain_llm() to detect whether the supplied model is a legacy LangChain LLM or a Ragas LLM. multi_modal_prompt.py shows the dual code path: agenerate_prompt for LangChain LLMs, and ragas_llm.generate(...) for Ragas LLMs.
Source: src/ragas/prompt/multi_modal_prompt.py and src/ragas/prompt/pydantic_prompt.py
This split is the reason community issue #2774 reports ValueError for the mistralai.Mistral client โ providers must be wired through llm_factory so the metric can dispatch correctly.
Collections and Testset Generation
Ragas organizes metrics and synthesizers under "collections" so that testset generation, evaluation, and prompt optimization share a common knowledge-graph and transform pipeline. The generate.py testset generator accepts a list of documents, applies a default set of transforms (extracting summaries, keyphrases, titles, headlines, and embeddings, and then building similarity edges), and runs a query_distribution of synthesizers to produce TestsetSample records.
Source: src/ragas/testset/synthesizers/generate.py
The generator's signature includes token_usage_parser: Optional[TokenUsageParser], which directly couples the collections surface to the cost tracking subsystem. This is the parameter users reach for when they want to know how many tokens testset generation consumed โ a recurring community question (see issue #540, "Tokens Usage for evaluations and testset generations").
Cost Tracking and Token Usage
Cost tracking in ragas is built on two primitives:
| Component | Role |
|---|---|
TokenUsageParser | Accepts an LLMResult and returns a TokenUsage object; provider-agnostic entry point. |
get_token_usage_for_<provider> (e.g., get_token_usage_for_bedrock) | Provider-specific extractor that reads response_metadata and returns the correct token counts. |
Issue #2779 documents a real failure mode: get_token_usage_for_bedrock "always returns 0" because it reads the wrong response_metadata keys for langchain-aws ChatBedrock / ChatBedrockConverse. The fix requires inspecting the actual metadata emitted by langchain-aws and matching the right key paths. The same pattern recurs for every provider โ the TokenUsageParser is the dispatch point, but the per-provider key extraction is where bugs surface.
The wider cost module also publishes a CostCallbackHandler (referenced in the same community thread) that attaches to a run to collect usage in real time.
Integrations and Tracing
The integration layer in src/ragas/integrations/__init__.py documents the available surfaces: tracing (Langfuse, MLflow), frameworks (LangChain, LlamaIndex, Griptape, LangGraph), observability (Helicone, Langsmith, Opik), platforms (Amazon Bedrock, R2R), AI systems (Swarm), and protocols (AG-UI).
Source: src/ragas/integrations/__init__.py
Tracing imports are lazy: src/ragas/integrations/tracing/__init__.py only resolves observe, sync_trace, LangfuseTrace, and MLflowTrace on attribute access, so projects that do not install langfuse or mlflow still import ragas cleanly. The same __getattr__ pattern is what community issue #2622 ("Make python-diskcache dependency optional") requests for diskcache, which is currently only imported on demand.
Prompt Persistence
src/ragas/prompt/prompt-formats.md documents the on-disk JSON (optionally gzip-compressed) format used to save and load both Prompt and DynamicFewShotPrompt objects. The format includes a format_version field, a type discriminator, an instruction template, an examples array, and a response_model_info block. DynamicFewShotPrompt extends this with an embedding model reference and max_similar_examples / similarity_threshold fields, enabling example retrieval at inference time.
Source: src/ragas/prompt/prompt-formats.md
Common Failure Modes
- Wrong provider metadata keys โ
get_token_usage_for_bedrockreturns zero because the keys differ betweenChatBedrockandChatBedrockConverse(issue #2779). - LLM factory dispatch โ passing an unsupported client (e.g., raw
mistralai.Mistral) tollm_factoryraisesValueError(issue #2774). - AspectCritic model compatibility โ switching the evaluator model to
o3causes runtime errors in the prompt/parser pipeline (issue #2067). - Threshold inconsistencies โ non-LLM context metrics use mixed
>vs>=comparisons when binarizing similarity (issue #2777). - Stale deprecation names โ deprecation warnings may reference class names that no longer exist in
ragas.metrics.collections(issue #2748).
See Also
- Testset Generation and Knowledge Graph Transforms
- LLM Adapters and the
llm_factoryAPI - Tracing Integrations: Langfuse and MLflow
- Prompt Persistence and
DynamicFewShotPrompt
Source: https://github.com/vibrantlabsai/ragas / Human Manual
Testset Generation, Datasets, and Storage Backends
Related topics: Overview, Installation, and Architecture, Evaluation Metrics, Collections, and Cost Tracking
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview, Installation, and Architecture, Evaluation Metrics, Collections, and Cost Tracking
Testset Generation, Datasets, and Storage Backends
Overview and Role
Ragas treats LLM-application evaluation as a two-sided problem: you need metrics to score outputs, and you need data to score against. The ragas.testset module provides the second half by synthesizing realistic Testset objects from your source documents, while ragas.backends provides pluggable storage for persisting both datasets and experiment results across local, cloud, and third-party systems.
The public surface is intentionally small. src/ragas/testset/__init__.py exports only TestsetGenerator, Testset, and TestsetSample. Storage backends follow a parallel minimalism: BaseBackend in src/ragas/backends/base.py declares four abstract methods, and the ragas.backends entry-point group lets third parties register new backends without forking the library (see src/ragas/backends/README.md).
Testset Generation Pipeline
The `TestsetGenerator` Class
The entry point is TestsetGenerator in src/ragas/testset/synthesizers/generate.py. It exposes two generation paths:
generate_with_langchain_docs(...)โ accepts LangChainDocumentobjects (or raw strings), constructs aKnowledgeGraphofNode(type=NodeType.CHUNK)entries from eachpage_content+metadata, runsapply_transforms(...), then delegates togenerate(...).generate(...)โ the lower-level entry that operates on an already-populatedKnowledgeGraphand supports a configurableQueryDistribution(single-hop, multi-hop, etc.).
Both methods accept the same core options: testset_size, query_distribution, run_config, callbacks, token_usage_parser, with_debugging_logs, raise_exceptions, and return_executor. When return_executor=True, the call returns the Executor instance instead of running to completion, allowing the caller to call executor.cancel() mid-run and later collect partial results via executor.results().
The pipeline is gated by an explicit precondition โ both self.llm and transforms_llm must be set or a ValueError is raised.
Transforms and the Knowledge Graph
Before any query is generated, the source chunks are enriched by a pipeline of Transforms. The default pipeline, defined in src/ragas/testset/transforms/default.py, bins documents by token length and conditionally adds:
HeadlinesExtractor(and theHeadlines/NERPydantic prompts in src/ragas/testset/transforms/extractors/llm_based.py) when โฅ25% of documents exceed 500 tokens,SummaryExtractorandKeyphrasesExtractorfor shorter corpora,- Embedding-based similarity edges between nodes.
Filters drop low-utility nodes. CustomNodeFilter in src/ragas/testset/transforms/filters.py scores each chunk via the QuestionPotentialPrompt (1โ5) and drops nodes scoring below min_score=2, falling back to a parent-document summary when one exists. If a chunk has no summary at all, the filter logs a warning and skips it.
flowchart LR
A[LangChain Documents / Strings] --> B[KnowledgeGraph: Chunk Nodes]
B --> C[apply_transforms: Headlines / Summary / Keyphrases / Embeddings]
C --> D[CustomNodeFilter: QuestionPotential]
D --> E[QueryDistribution Synthesizers]
E --> F[Testset samples: query + reference + contexts]
F --> G{return_executor?}
G -- false --> H[Testset]
G -- true --> I[Executor .results / .cancel]Query Synthesis Prompts
The QueryCondition โ GeneratedQueryAnswer pairs defined in src/ragas/testset/synthesizers/single_hop/prompts.py drive single-hop generation: persona, term, query style, query length, and an optional llm_context constrain the LLM so generated questions stay grounded in the source chunks.
Datasets and Schema
A generated test set is a list of TestsetSample records aggregated in a Testset container (see src/ragas/testset/synthesizers/testset_schema.py). Each sample carries the synthesized query, a reference answer, the contexts used to produce it, and the originating node references for traceability.
The prompt layer underneath โ PydanticPrompt in src/ragas/prompt/multi_modal_prompt.py โ dispatches to either LangChain agenerate_prompt or Ragas-native BaseRagasLLM.generate and parses the output through a RagasOutputParser, raising RagasOutputParserException on schema failure. This indirection is what lets a single TestsetGenerator work across OpenAI, Bedrock, Vertex, and LlamaIndex backends (the adapter contract is defined in src/ragas/llms/adapters/base.py).
LangSmith Integration
src/ragas/integrations/langsmith.py exposes upload_dataset(testset, dataset_name, dataset_desc=""). The function reads LangSmith to confirm the dataset name is free, then converts the Testset to a pandas DataFrame and uploads it. A duplicate name raises ValueError. Broader integrations (Langfuse, MLflow, Helicone, Opik, Swarm, AG-UI) are catalogued in src/ragas/integrations/__init__.py and imported lazily so optional dependencies stay optional.
Storage Backends
`BaseBackend` Interface
src/ragas/backends/base.py declares the contract every storage backend must satisfy. Implementations receive and return List[Dict[str, Any]], raise FileNotFoundError on missing names, and return an empty list (never None) for empty datasets. The abstract methods are:
| Method | Purpose |
|---|---|
load_dataset(name) | Return records for a dataset, or raise FileNotFoundError |
save_dataset(name, data, ...) | Persist dataset records atomically |
load_experiment(name) | Mirror of load_dataset for experiment runs |
save_experiment(name, data, ...) | Mirror of save_dataset for experiment runs |
list_datasets() / list_experiments() | Enumerate stored names |
The file-based convention is storage_root/datasets/ and storage_root/experiments/, so users can point multiple backends at the same root safely.
Plugin Discovery
Third-party backends are registered via Python entry points (see src/ragas/backends/README.md). The package declares ragas.backends as the entry-point group; on first access get_registry() iterates registered entry points and lazy-loads each name -> backend_class mapping. Debugging is registry.keys(). Bundled helpers like MemorableNames in src/ragas/backends/utils.py generate human-readable identifiers (adjective + scientist) for experiment and dataset names.
Google Drive Backend
The Google Drive backend, documented in src/ragas/backends/gdrive_backend.md, stores datasets and experiments as Google Sheets inside two folders (datasets/, experiments/). It supports both Service Account and OAuth 2.0 authentication and is installed via pip install "ragas[gdrive]".
Known Limitations (Community Notes)
Several open issues surfaced by users intersect with this surface area:
- Issue #1775 reports
headlinesproperty missing on nodes when generating test data in non-English languages โ this is theHeadlinesExtractorstep failing because the LLM did not emit structured output for that locale. - Issue #1688 tracks
AttributeError('StringIO' object has no attribute 'classifications')during testset generation, again tied to the LLM extractor prompts. - Issue #2231 collects feedback on the future of this module ahead of v0.4 โ relevant if you maintain a fork.
- Issue #540 covers token-usage accounting for testset generation, which the
token_usage_parserargument ongenerate(...)exists to address (seesrc/ragas/cost.py).
See Also
- Metrics and Evaluation โ LLM-based and non-LLM scoring metrics
- Integrations โ LangChain, LlamaIndex, LangSmith, Langfuse
- Cost Tracking โ
TokenUsageParserandCostCallbackHandler - Backends โ Plugin authoring guide in
src/ragas/backends/README.md
Source: https://github.com/vibrantlabsai/ragas / Human Manual
LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting
Related topics: Overview, Installation, and Architecture, Evaluation Metrics, Collections, and Cost Tracking
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview, Installation, and Architecture, Evaluation Metrics, Collections, and Cost Tracking
LLM Adapters, Integrations, Prompt Optimization, and Troubleshooting
Overview
Ragas provides a modular stack for connecting to LLMs, observability platforms, and prompt frameworks. The system is composed of three cooperating layers:
- Adapter layer โ A pluggable abstraction over the
InstructorandLiteLLMlibraries, exposed through thellm_factoryfunction and theStructuredOutputAdapterbase class. - Integration layer โ Optional modules that bridge Ragas to LangChain, LlamaIndex, MLflow, Langfuse, Amazon Bedrock, and tracing/observability tooling.
- Prompt layer โ JSON-serializable, versioned prompts with optional few-shot example stores used by metrics and testset synthesizers.
This page documents how those layers fit together and how to recover from the most frequently reported community issues.
LLM Adapters
The adapter layer normalizes structured-output generation across providers. The base abstraction is defined in src/ragas/llms/adapters/base.py:
class StructuredOutputAdapter(ABC):
@abstractmethod
def create_llm(
self,
client: t.Any,
model: str,
provider: str,
**kwargs,
) -> t.Any: ...
Concrete adapters (e.g. InstructorLLM, LiteLLMLLM) implement create_llm() and are selected automatically by llm_factory() based on the requested adapter string โ "instructor" or "litellm". Source: src/ragas/llms/adapters/base.py:1-31 The factory surfaces a deterministic InstructorModelArgs default (temperature 0.1, max_tokens 1024) and accepts a cache backend such as DiskCacheBackend() to deliver "60x" repeated-evaluation speed-ups. Source: src/ragas/llms/base.py:1-120
Provider-specific wrappers complement the generic factory. For example, the OCI Gen AI wrapper performs lazy initialization of the SDK and only raises ImportError if neither an explicit client nor an endpoint_id is supplied:
if (
self.client is None
and GenerativeAiClient is None
and self.endpoint_id is None
):
raise ImportError("OCI SDK not found. Please install it with: pip ...")
Source: src/ragas/llms/oci_genai_wrapper.py:1-80
Known Adapter Failure Modes
The community has reported two recurring adapter bugs:
- mistralai client โ
llm_factoryraisesValueErrorbecause the mistral provider branch is missing from the adapter registry. Workaround: use thelitellmadapter instead. - OpenAI
o3withAspectCriticโ Structured-output parsing fails becauseo3does not honor the default JSON schema the waygpt-4odoes. The published mitigation is to fall back to theinstructoradapter withmode=Mode.MD_JSONfor backends that ignoreresponse_format.
Both failures are routed through the same InstructorBaseRagasLLM interface, so swapping adapters requires no caller-side changes. Source: src/ragas/llms/base.py:1-120
Integrations
Ragas groups integrations under src/ragas/integrations/. The package re-exports optional submodules so that missing third-party SDKs do not hard-fail on import. Tracing integrations live in src/ragas/integrations/tracing/, which supports Langfuse and MLflow with observe() and sync_trace() helpers. Source: src/ragas/integrations/__init__.py:1-22, src/ragas/integrations/tracing/__init__.py:1-80
The framework-level integrations span LangChain, LlamaIndex, Griptape, LangGraph, Helicone, LangSmith, Opik, Amazon Bedrock, R2R, Swarm, and AG-UI. Because each is imported lazily through __getattr__, users can install only the bindings they need.
The example below, drawn from the README, shows the canonical entry point that all integrations share:
from ragas.metrics import DiscreteMetric
from ragas.llms import llm_factory
client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)
metric = DiscreteMetric(name="summary_accuracy", allowed_values=["accurate", "inaccurate"], prompt="...")
score = await metric.ascore(llm=llm, response="...")
Source: README.md:1-80
Prompt Optimization
Prompts are first-class, persistable objects. The PydanticPrompt[InputModel, OutputModel] base class renders a deterministic prompt string that combines an instruction, JSON-schema output spec, optional few-shot examples, and the serialized input. Source: src/ragas/prompt/metrics/base_prompt.py:1-90
Prompt Persistence
Prompt and DynamicFewShotPrompt use JSON (optionally gzip-compressed to .json.gz). The reference doc compares the two formats:
| Feature | Prompt | DynamicFewShotPrompt |
|---|---|---|
| Type ID | "Prompt" | "DynamicFewShotPrompt" |
| Embedding-backed example selection | โ | โ |
| Similarity threshold / max similar examples | โ | โ |
Stores response_model_info | โ | โ |
Source: src/ragas/prompt/prompt-formats.md:1-50
For testset generation, generate.py orchestrates a sequence of synthesizers driven by configurable query_distribution, transforms_llm, and transforms_embedding_model. The default transform pipeline (default.py) inspects document-length bins and conditionally enables HeadlinesExtractor when โฅ25% of documents exceed 501 tokens, helping LLM-based extractors avoid the StringIO and 'headlines' property not found errors reported in community issue #1775 and #1688. Source: src/ragas/testset/transforms/default.py:1-80, src/ragas/testset/synthesizers/generate.py:1-100
LLM-based extractors (SummaryExtractor, KeyphrasesExtractor, TitleExtractor, HeadlinesExtractor, NERExtractor, TopicDescriptionExtractor) all extend PydanticPrompt and share the same example/few-shot machinery, making them a natural place to plug in DSPy-style optimizers introduced in v0.4.3. Source: src/ragas/testset/transforms/extractors/__init__.py:1-30, src/ragas/testset/transforms/extractors/llm_based.py:1-120
Troubleshooting
The most common community-reported problems map to specific subsystems:
| Symptom | Likely cause | Where to look |
|---|---|---|
ValueError from llm_factory | Unsupported provider / client combo | src/ragas/llms/base.py, adapters/base.py |
AttributeError('StringIO' object has no attribute 'classifications') | Outdated extractor running on StringIO payloads | src/ragas/testset/transforms/extractors/llm_based.py |
'headlines' property not found in this node | Document too short; default transform skipped headline extraction | src/ragas/testset/transforms/default.py |
| Token usage always 0 for Bedrock | response_metadata keys differ across ChatBedrock / ChatBedrockConverse | LLM callback handler (see issue #2779) |
Non-LLM context metrics threshold inconsistency (> vs >=) | Binary relevance boundary differs between NonLLMContextRecall and NonLLMContextPrecisionWithReference | Issue #2777 |
For non-English corpora, ensure Language is set on prompts and that transforms_llm is instructed to keep headings in the source language; this avoids the headline-extractor pipeline failing partway through generate(). Source: src/ragas/testset/synthesizers/single_hop/prompts.py:1-60
See Also
- LLM module:
src/ragas/llms/__init__.py - Prompt format spec:
src/ragas/prompt/prompt-formats.md - Integrations index:
src/ragas/integrations/__init__.py - Tracing integrations:
src/ragas/integrations/tracing/__init__.py - Testset synthesis:
src/ragas/testset/synthesizers/generate.py - Community issue tracker: vibrantlabsai/ragas issues
Source: https://github.com/vibrantlabsai/ragas / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 17 structured pitfall item(s), including 9 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.
1. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2760
2. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2748
3. Configuration risk: Configuration risk requires verification
- Severity: high
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2774
4. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2067
5. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2732
6. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2622
7. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2649
8. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2692
9. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2779
10. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/vibrantlabsai/ragas
11. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2762
12. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/vibrantlabsai/ragas
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using ragas with real data or production workflows.
- Proposal: Contribute English/Uzbek Multilingual RAG Evaluation Dataset t - github / github_issue
- get_token_usage_for_bedrock always returns 0 (reads wrong response_metad - github / github_issue
- AspectCritic not working with openai o3 - github / github_issue
- Feature request: Add AgentThreatBench memory poison task as a RAG securi - github / github_issue
- llm_factory raises ValueError when using mistralai client - github / github_issue
- [[Security] Agentic Workflow Injection in Claude Docs Check](https://github.com/vibrantlabsai/ragas/issues/2692) - github / github_issue
- NonLLMContextRecall and NonLLMContextPrecisionWithReference threshold re - github / github_issue
- Website footer GitHub link redirects to incorrect repository/user - github / github_issue
- Incorrect class name in deprecation warning for LLMContextPrecisionWitho - github / github_issue
- Add EvaluationResult summary and threshold checks - github / github_issue
- Make python-diskcache dependency optional - github / github_issue
- v0.4.3 - github / github_release
Source: Project Pack community evidence and pitfall evidence