graphrag Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system

GraphRAG Overview and Architecture

Related topics: Indexing Pipeline, Data Flow & Incremental Updates, Query Engine and Search Methods, Configuration, LLM Integration, Storage & Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Indexing API

Continue reading this section for the full explanation and source context.

Section Query API

Continue reading this section for the full explanation and source context.

Section Prompt Tuning API

Continue reading this section for the full explanation and source context.

GraphRAG Overview and Architecture

Purpose and Scope

GraphRAG is a data pipeline and transformation suite designed to extract meaningful, structured information from unstructured text using large language models (LLMs). The project implements a knowledge-graph–based memory layer that augments LLM reasoning over private datasets, as described in the upstream Microsoft Research blog post and the GraphRAG arXiv paper. Source: README.md:1-9.

The repository is a methodology demonstration, not an officially supported Microsoft product, and indexing is intentionally treated as an expensive operation that should be started on small data first. Source: README.md:17-19. The codebase is published as a monorepo of several Python packages (each with its own README) plus a unified-search-app demo that consumes the resulting index.

Repository and Package Architecture

The monorepo separates concerns into narrowly-scoped libraries that can be composed at runtime. The top-level graphrag package exposes the user-facing API and CLI; the remaining packages provide pluggable, factory-based building blocks.

graph TB
    subgraph "graphrag (main package)"
        API[api: index, query, prompt_tune]
        CLI[cli: graphrag init/index/query]
        CFG[config: GraphRagConfig + load_config]
        IDX[index: run_pipeline, workflows]
    end

    subgraph "Supporting packages"
        CHUNK[graphrag-chunking]
        LLM[graphrag-llm: completion]
        STORE[graphrag-storage]
        CACHE[graphrag-cache]
        IN[graphrag-input]
        COMMON[graphrag-common: factory + config]
    end

    subgraph "Reference consumers"
        APP[unified-search-app]
    end

    API --> IDX
    API --> CFG
    CLI --> API
    IDX --> CHUNK
    IDX --> LLM
    IDX --> STORE
    IDX --> CACHE
    IDX --> IN
    API --> LLM
    APP --> STORE
    APP --> API
    COMMON -. provides .-> CHUNK
    COMMON -. provides .-> LLM
    COMMON -. provides .-> STORE
    COMMON -. provides .-> CACHE

Key architectural conventions:

Factory + DI pattern. A shared Factory class in graphrag-common registers and resolves implementations by string strategy, with transient and singleton scopes. Source: packages/graphrag-common/README.md:5-9.
Config-driven setup. load_config in graphrag-common auto-discovers YAML/JSON, performs environment variable substitution, and supports .env loading. Source: packages/graphrag-common/README.md:13-15.
Pluggable storage backends. graphrag-storage registers FileStorage, AzureBlobStorage, AzureCosmosStorage, and MemoryStorage, with dynamic preregistration so unused providers are not imported. Source: packages/graphrag-storage/README.md:43-55.
Pluggable cache backends. graphrag-cache ships JsonCache, MemoryCache, and NoopCache under the same factory model. Source: packages/graphrag-cache/README.md:23-32.
Pluggable input formats. graphrag-input supports CSV, JSON, JSON Lines, plain text, and a MarkItDown converter that can ingest PDFs, Office files, HTML, etc. Source: packages/graphrag-input/README.md:5-23.
Pluggable chunking. graphrag-chunking exposes SentenceChunker, TokenChunker, and a create_chunker factory keyed off a ChunkingConfig object. Source: packages/graphrag-chunking/README.md:1-9.
Pluggable LLM completion. graphrag-llm provides a create_completion(model_config) function and a ModelConfig that abstracts over providers. Source: packages/graphrag-llm/graphrag_llm/README.md:3-15.

Core APIs and Pipelines

The public surface of the main package is concentrated in graphrag.api, with three entry points: indexing, query, and prompt tuning. Source: packages/graphrag/graphrag/api/__init__.py:11-33.

Indexing API

build_index(config, method, is_update_run, callbacks, input_documents, ...) runs a pipeline under a GraphRagConfig, choosing an IndexingMethod (e.g., Standard) and a PipelineFactory-resolved workflow. Source: packages/graphrag/graphrag/api/index.py:21-49. The function returns a list of PipelineRunResult records, allowing callers to inspect per-stage output. The is_update_run flag is the existing hook for the highly requested incremental-indexing workflow tracked in community discussion #741 ("Incremental indexing (adding new content)"), which has 35 comments and is currently in the design stage. Source: packages/graphrag/graphrag/api/index.py:30-33.

Query API

graphrag.api.query exposes six search entry points: global_search, global_search_streaming, local_search, local_search_streaming, drift_search, drift_search_streaming, plus basic_search variants re-exported from __init__. Source: packages/graphrag/graphrag/api/query.py:1-19 and packages/graphrag/graphrag/api/__init__.py:23-29. Internally these functions call get_global_search_engine, get_local_search_engine, get_drift_search_engine, and get_basic_search_engine from the query.factory module, then rehydrate the persisted index through read_indexer_* adapter helpers. Source: packages/graphrag/graphrag/api/query.py:33-44. The expected table names for those artifacts are codified in unified-search-app/app/data_config.py (output/communities, output/community_reports, output/entities, output/relationships, output/covariates, output/text_units). Source: unified-search-app/app/data_config.py:6-21.

Prompt Tuning API

generate_indexing_prompts (in graphrag.api.prompt_tune) drives auto-templating: it loads sample documents, detects language and domain, infers entity types, and synthesizes extraction, summarization, community-report, and reporter-role prompts. Source: packages/graphrag/graphrag/api/prompt_tune.py:11-39. This API is explicitly marked as under development and not yet stable. Source: packages/graphrag/graphrag/api/prompt_tune.py:9-11.

CLI Surface

The graphrag CLI is exported from graphrag.cli and is the recommended starting point. The recommended initialization command is graphrag init --root [path] --force, which should be rerun between minor version bumps to pick up the latest config format. Source: packages/graphrag/README.md:51-55.

Supporting Subsystems and Community-Driven Roadmap

Beyond the core APIs, several subsystems implement the storage, caching, and language-model abstractions that the indexer and query engines rely on:

The LLMCompletion returned by create_completion is the abstraction used throughout the codebase; it returns either an LLMCompletionResponse or an Iterator[LLMCompletionChunk] for streaming, and gather_completion_response collapses both into a single string. Source: packages/graphrag-llm/graphrag_llm/README.md:5-43.
The unified search app's data config defines reasonable defaults for downstream LLM use, including suggested follow-up questions and a 7-day Streamlit cache TTL, and notes that context-window settings should be tuned per model. Source: unified-search-app/app/data_config.py:23-30.
The most recent release (v3.1.0) introduced a native CosmosTableProvider with namespace partitioning, transactional batch writes, and a simplified AzureCosmosStorage, plus a litellm dependency update that broadens indirect model-provider support. Source: community release notes for v3.1.0.

Open community threads shape the near-term roadmap and are useful to know when planning an adoption:

Incremental indexing (#741, 35 comments). Add new documents to an existing index without a full re-run; design is in progress.
Additional model providers (#657, 15 comments; #345, 29 comments for Ollama). Native support beyond OpenAI/Azure is not planned by the core team; the litellm upgrade in v3.1.0 and community workarounds for Ollama remain the primary paths.
Cheaper triplet extraction (#632, 2 comments). Interest in integrating Triplex for cost reduction relative to gpt-4o.
LazyGraphRAG (#1512, 44 comments). The most-upvoted open question, awaiting a release announcement.

Indexing Pipeline, Data Flow & Incremental Updates

Related topics: GraphRAG Overview and Architecture, Query Engine and Search Methods, Configuration, LLM Integration, Storage & Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Indexing Pipeline, Data Flow & Incremental Updates

Overview and Purpose

GraphRAG's indexing pipeline is the data transformation suite that converts unstructured text into a structured knowledge graph plus derived artifacts (entities, relationships, communities, community reports, embeddings, and covariates). The repository positions this suite as "a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using the power of LLMs" README.md. The system warns users that "GraphRAG indexing can be an expensive operation, please read all of the documentation to understand the process and costs involved, and start small" README.md.

The pipeline is composed of modular Python packages:

packages/graphrag-input — loaders that ingest source documents from disk, blob storage, or markitdown for PDF parsing packages/graphrag-input/README.md.
packages/graphrag-chunking — text splitters (sentence, token, factory-based) that produce text units packages/graphrag-chunking/README.md.
packages/graphrag — the core library, including graphrag.api.prompt_tune, graphrag.api.query, the CLI (graphrag.cli.prompt_tune, graphrag.cli.index), and the runnable pipeline runner packages/graphrag/README.md.
packages/graphrag-common — shared infrastructure providing the Factory dependency-injection pattern and the load_config system that parses YAML/JSON with Pydantic, environment-variable substitution, and .env loading packages/graphrag-common/README.md.
unified-search-app — a Streamlit reference application that consumes the produced parquet outputs to expose search and community exploration unified-search-app/README.md.

The pipeline writes its final results as parquet tables under well-known paths consumed downstream by the query engine and the search app: output/communities, output/community_reports, output/entities, output/relationships, output/covariates, and output/text_units unified-search-app/app/data_config.py.

Data Flow Stages

The runtime flow from raw documents to queryable index follows a five-stage pipeline, each stage producing artifacts that the next stage consumes.

flowchart LR
    A[Input Loader<br/>graphrag-input] --> B[Chunking<br/>graphrag-chunking]
    B --> C[Graph Extraction<br/>LLM: entities/relationships]
    C --> D[Community Detection<br/>+ Report Generation]
    D --> E[Embeddings & Covariates]
    E --> F[(Parquet Outputs<br/>text_units, entities,<br/>relationships, communities,<br/>community_reports, covariates)]
    F --> G[Query / Search App<br/>graphrag.api.query]

Key behaviors observed in the source:

Input — loaders read raw files according to a configured input.type (for example, markitdown with a file pattern such as ".*\\.pdf$$") and an input_storage block describing where the input lives (e.g. local type: file, base_dir: input) packages/graphrag-input/README.md. The unified-search-app's create_datasource switches between BlobDatasource and LocalDatasource based on whether blob_account_name is set, demonstrating the same pluggable strategy used inside the indexing CLI unified-search-app/app/knowledge_loader/data_sources/loader.py.
Chunking — the ChunkingConfig selects a strategy via create_chunker, with SentenceChunker for boundary detection and TokenChunker for fixed-size windows with overlap packages/graphrag-chunking/README.md. During prompt tuning, the chunking overrides are read from the loaded graph config: if chunk_size != graph_config.chunking.size: graph_config.chunking.size = chunk_size and the same pattern is used for overlap packages/graphrag/graphrag/cli/prompt_tune.py.
Prompt Tuning (optional pre-pass) — generate_indexing_prompts chunks a sample of documents, derives a domain and persona from the LLM if not supplied, and returns the entity-extraction, entity-summarization, and community-summarization prompts that downstream index stages will use packages/graphrag/graphrag/api/prompt_tune.py. The CLI mirrors this API, writing logs to prompt-tuning.log and honoring overrides for chunk_size, overlap, limit, selection_method, domain, language, max_tokens, discover_entity_types, and min_examples_required packages/graphrag/graphrag/cli/prompt_tune.py.
Graph Extraction — text units are sent to the LLM with the tuned prompts to produce entities, relationships, claims/covariates, and descriptions. This stage is the dominant cost driver and is what makes GraphRAG "an expensive operation" README.md.
Community Detection and Reporting — Leiden/Leiden-like algorithms produce a hierarchy of communities; an LLM-driven reporter generates per-community summaries that the global search engine consumes packages/graphrag/graphrag/api/query.py.
Outputs — the canonical tables listed above are persisted and re-read by local_search and global_search via DataFrame parameters (entities, relationships, text_units, community_reports, covariates, communities) packages/graphrag/graphrag/api/query.py. The unified-search-app's UI then renders these as citations, hyperlinking entity/relationship IDs back to source text units unified-search-app/app/ui/search.py.

Incremental Updates: Current State

Incremental indexing — the ability to add new documents to an existing index without rebuilding from scratch — is the most engaged community topic, tracked in issue #741 "Incremental indexing (adding new content)". As of v3.1.0, the maintainers state that the feature is "in the design stages" and provide a manual workaround. The current repository architecture, however, still assumes a full re-run for the parquet outputs consumed by graphrag.api.query packages/graphrag/graphrag/api/query.py.

What users can do today without a re-index:

Append new files to the input directory configured via input_storage (local or blob) packages/graphrag-input/README.md.
Re-run the full pipeline; the loaders will pick up the new files based on the configured file pattern (e.g. ".*\\.pdf$$") packages/graphrag-input/README.md.
Swap in a different parquet output backend. The v3.1.0 release notes call out a "Native CosmosTableProvider with namespace partitioning, transactional batch writes, and simplified AzureCosmosStorage", which makes it easier to persist index artifacts in Azure Cosmos and treat each pipeline run as a partitioned namespace — a foundation for future incremental runs.

What is not yet first-class:

There is no incremental flag or delta-detection step in the chunking or graph extraction APIs visible in graphrag-chunking or graphrag.api.prompt_tune packages/graphrag/graphrag/api/prompt_tune.py.
Community detection and report generation re-derive the full hierarchy on each run; merging communities across runs is not implemented in the graphrag.api.query interface packages/graphrag/graphrag/api/query.py.
The unified-search-app reads outputs via create_datasource, which selects either BlobDatasource or LocalDatasource based on environment, but does not perform any merge or diff of prior and new parquet outputs unified-search-app/app/knowledge_loader/data_sources/loader.py.

Until incremental indexing ships, the recommended operational pattern is to version output directories per run and treat each run as immutable.

Configuration, CLI, and Extensibility

All pipeline behavior is driven by settings.yaml, parsed through load_config which "automatically discovers and parses YAML/JSON config files into Pydantic models with support for environment variable substitution and .env file loading" packages/graphrag-common/README.md. Strategies (chunkers, model providers, storage backends) are registered through the Factory class with transient or singleton scope, allowing new implementations to be plugged in without changing call sites packages/graphrag-common/README.md.

The prompt-tuning CLI explicitly honors per-invocation overrides for chunking parameters before delegating to the API, illustrating how users can experiment without rewriting the config file packages/graphrag/graphrag/cli/prompt_tune.py. Community demand for non-OpenAI/Azure providers (issue #657, with #345 focused on Ollama) flows through this same factory mechanism — new model providers are added by registering a strategy string in graphrag-common's Factory rather than by patching core code packages/graphrag-common/README.md.

Downstream, the Streamlit unified-search-app renders citation tables for each context type (sources, reports, entities, relationships, covariates) by reading the parquet outputs the indexing pipeline produces, making the pipeline's contract with the query layer explicit and stable unified-search-app/app/ui/search.py.

Query Engine and Search Methods

Related topics: GraphRAG Overview and Architecture, Indexing Pipeline, Data Flow & Incremental Updates, Configuration, LLM Integration, Storage & Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Query Engine and Search Methods

Overview

The Query Engine is the retrieval layer of Microsoft GraphRAG. After the indexer produces a knowledge graph (entities, relationships, communities, community reports, text units, and optional covariates), the query engine consumes those parquet outputs and returns natural-language answers grounded in the graph. The module exposes a public API and a CLI, and is also embedded inside the Streamlit-based unified-search-app for interactive exploration.

The module's docstring states it "provides access to the query engine of graphrag, allowing external applications to hook into graphrag and run queries over a knowledge graph" and warns that "this API is under development and may undergo changes in future releases. Backwards compatibility is not guaranteed at this time" (packages/graphrag/graphrag/api/query.py). Treat the surface as stable in shape but evolving in detail.

Search Methods

The query engine implements multiple search strategies, each suited to a different question type. They are assembled through a factory in graphrag.query.factory (referenced as get_basic_search_engine, get_drift_search_engine, get_global_search_engine, and get_local_search_engine in the API module) and selected via the CLI's --method argument.

Local Search — entity-centric retrieval. Uses entities, their text-unit neighborhoods, relationships, and covariates to answer questions about specific people, places, or concepts. The API signature requires entities, communities, community_reports, text_units, relationships, community_level, and response_type (packages/graphrag/graphrag/api/query.py).
Global Search — map-reduce over community reports. The engine distributes the query across many community summaries and consolidates partial answers into a single response. A dynamic_community_selection flag enables runtime selection of communities, capped by community_level (packages/graphrag/graphrag/api/query.py).
DRIFT Search — a dynamic variant that combines local and global reasoning by introducing exploratory sub-queries; useful for comparative or "why/how" questions. The CLI exposes it as a distinct --method value (packages/graphrag/graphrag/cli/query.py).
Basic Search — text-unit-only retrieval, lightweight, with no graph traversal (packages/graphrag/graphrag/cli/query.py).

Every method has both a blocking variant (global_search, local_search) and a streaming variant (global_search_streaming, local_search_streaming) that yield chunks via an AsyncGenerator (packages/graphrag/graphrag/api/query.py).

flowchart LR
    A[User Query] --> B{Method}
    B -->|local| C[Local Search Engine]
    B -->|global| D[Global Search Engine]
    B -->|drift| E[DRIFT Search Engine]
    B -->|basic| F[Basic Search Engine]
    C --> G[Index Artifacts]
    D --> G
    E --> G
    F --> G
    G --> H[Response + Context]

API and CLI Usage

The API functions take a GraphRagConfig (loaded from settings.yaml) plus the relevant pandas DataFrames. _resolve_output_files in the CLI is responsible for loading the parquet outputs required by a given method (packages/graphrag/graphrag/cli/query.py). Records are normalized into typed objects via read_indexer_entities, read_indexer_relationships, read_indexer_text_units, read_indexer_reports, read_indexer_report_embeddings, read_indexer_communities, and read_indexer_covariates (packages/graphrag/graphrag/api/query.py). Entity and Relationship builders accept configurable column names, so custom indexers can be plugged in by remapping columns (packages/graphrag/graphrag/query/input/loaders/dfs.py).

On the CLI side, graphrag query --method <local|global|drift|basic> is the entry point. Each method has a dedicated runner (run_global_search, run_local_search, run_drift_search, run_basic_search) and the streaming path is triggered with --streaming. Response style is controlled by --response-type (e.g., multiple_paragraphs, single_paragraph, prioritized_list) (packages/graphrag/graphrag/cli/query.py).

Prompt Tuning and the Unified Search App

Before running queries, users typically tune their prompts via graphrag prompt-tune, which uses the same GraphRagConfig to load chunks and produce entity-extraction, entity-summarization, and community-summarization prompts (packages/graphrag/graphrag/api/prompt_tune.py). The CLI override pattern lets the tuning run inject chunk-size and overlap adjustments into the loaded config (packages/graphrag/graphrag/cli/prompt_tune.py).

The unified-search-app is a Streamlit reference consumer of the query engine. home_page.py wires search buttons to run_all_searches and run_generate_questions (unified-search-app/app/home_page.py). ui/search.py renders per-method responses, token usage, LLM call counts, and a Citations panel that lists the entities, relationships, reports, and source chunks the engine consumed (unified-search-app/app/ui/search.py). The expected parquet paths for the app live in app/data_config.py (e.g., output/communities, output/community_reports, output/entities, output/relationships, output/covariates, output/text_units) (unified-search-app/app/data_config.py).

Configuration and Known Constraints

Several practical constraints surface from the source and from community discussion:

Model providers. The API and CLI instantiate completion and embedding models through the GraphRagConfig, which natively targets OpenAI and Azure. Community requests for additional providers (Ollama, other SLMs, custom endpoints) are tracked but not yet supported in-tree (issue #657, issue #345).
Cheaper extraction. Triplex-style models have been proposed to lower extraction cost during indexing; this affects the indexer rather than the query engine, but the engine consumes the result (issue #632).
LazyGraphRAG. A deferred-evaluation variant has been requested; once shipped it would likely plug in alongside the existing factory methods (issue #1512).
Incremental indexing. Adding new documents to an existing index currently requires a re-run; the engine is unaffected, but the artifacts it loads would need to be regenerated or extended (issue #741).
Data layout. Because the engine loads from parquet, downstream tools must respect the column conventions expected by read_indexer_* helpers; mismatches will fail at load time (packages/graphrag/graphrag/query/input/loaders/dfs.py).
API stability. The module docstring explicitly warns that "backwards compatibility is not guaranteed at this time", so pin versions and avoid coupling external code to internal helper signatures (packages/graphrag/graphrag/api/query.py).

Configuration, LLM Integration, Storage & Extensibility

Related topics: GraphRAG Overview and Architecture, Indexing Pipeline, Data Flow & Incremental Updates, Query Engine and Search Methods

Section Related Pages

Continue reading this section for the full explanation and source context.

Configuration, LLM Integration, Storage & Extensibility

Overview

GraphRAG is a data pipeline and transformation suite designed to extract meaningful, structured data from unstructured text using LLMs. Source: packages/graphrag/README.md. Underneath the indexing and query APIs sit four foundational subsystems that determine how the project is configured, how it talks to language models, where it persists intermediate artifacts, and how third parties can plug in new behavior. These four subsystems — configuration, LLM integration, storage, and extensibility — are the primary surfaces users customize when adapting GraphRAG to their own data and infrastructure.

The configuration layer is built on Pydantic-style typed config models exposed under graphrag.config.models. Source: packages/graphrag/graphrag/config/models/__init__.py. It is loaded via load_config and accepts a settings.yaml file as the canonical user-facing configuration artifact.

Configuration System

The GraphRagConfig model is the central object that drives indexing, prompt tuning, and query workflows. Every public API accepts a GraphRagConfig instance and reads model, storage, chunking, and input settings from it. Source: packages/graphrag/graphrag/api/prompt_tune.py.

The prompt_tune API shows the typical usage pattern: the configuration is loaded, an LLM is instantiated via create_completion(default_llm_settings), and downstream operations are configured against the typed model. Source: packages/graphrag/graphrag/api/prompt_tune.py. The CLI mirror in graphrag.cli.prompt_tune calls load_config(root_dir=root) and allows runtime overrides such as chunk_size and chunking.overlap before invoking the prompt-tuning pipeline. Source: packages/graphrag/graphrag/cli/prompt_tune.py.

The unified search application uses a parallel data_config.py module that defines table names for downstream artifacts (output/communities, output/community_reports, output/entities, output/relationships, output/covariates, output/text_units). Source: unified-search-app/app/data_config.py. This reflects how a built index is consumed at query time and how output artifacts are addressed independently of storage backend.

LLM Integration

Native LLM support in GraphRAG is implemented through completion-model configuration objects and a create_completion factory. The prompt-tuning API explicitly retrieves the model via config.get_completion_model_config(PROMPT_TUNING_MODEL_ID) and instantiates the model with create_completion(...). Source: packages/graphrag/graphrag/api/prompt_tune.py.

A second completion model is retrieved for graph extraction when discover_entity_types is enabled: config.get_completion_model_config(config.extract_graph.completion_model_id). Source: packages/graphrag/graphrag/api/prompt_tune.py. This separation lets users run a cheaper model for prompt tuning while keeping a stronger model for entity/relationship extraction.

For query time, both global (global_search) and local (local_search) APIs in graphrag.api.query accept the same GraphRagConfig, ensuring a consistent model-selection surface across indexing and retrieval. Source: packages/graphrag/graphrag/api/query.py.

Storage Architecture

The graphrag-storage package provides a unified storage abstraction. By default the create_storage factory ships with four preregistered providers corresponding to a StorageType enum. Source: packages/graphrag-storage/README.md.

graph LR
    Config["GraphRagConfig"] --> Factory["create_storage / storage_factory"]
    Factory --> FS["FileStorage"]
    Factory --> ABS["AzureBlobStorage"]
    Factory --> ACS["AzureCosmosStorage"]
    Factory --> MS["MemoryStorage"]
    User["User-defined Storage subclass"] -.register.-> Factory

Registration is dynamic — FileStorage is only imported when requested — and users can bypass preregistration by importing storage_factory directly for a clean factory. Source: packages/graphrag-storage/README.md. The v3.1.0 release notes describe a native CosmosTableProvider with namespace partitioning, transactional batch writes, and a simplified AzureCosmosStorage, indicating that the storage layer is actively evolving toward richer table semantics. Source: packages/graphrag-storage/README.md.

Provider	Typical Use	Notes from source
`FileStorage`	Local development	Default; lazily imported
`AzureBlobStorage`	Cloud blob persistence	Pre-registered
`AzureCosmosStorage`	Cosmos DB-backed storage	Simplified in v3.1.0; adds table-provider semantics
`MemoryStorage`	Tests / ephemeral pipelines	Pre-registered

Extensibility

GraphRAG is designed for extension at every layer. Three concrete extension points are documented:

Storage: Subclass the base Storage class and register the new provider with the factory; it can then be instantiated via create_storage or storage_factory. Source: packages/graphrag-storage/README.md.
Chunking: A create_chunker factory accepts a ChunkingConfig and instantiates strategies such as SentenceChunker (NLTK-based) or TokenChunker (tokenizer-based with configurable size and overlap). Source: packages/graphrag-chunking/README.md.
Input: Input loaders are configured via a YAML block that combines input.type (for example, markitdown) with a file_pattern regex; PDF processing additionally requires the markitdown[pdf] extra. Source: packages/graphrag-input/README.md.
Knowledge loading (apps): The unified search app organizes custom loaders under app.knowledge_loader, with pluggable data_sources submodules. Source: unified-search-app/app/knowledge_loader/__init__.py and unified-search-app/app/knowledge_loader/data_sources/__init__.py.

Common Failure Modes and Community Notes

Several recurring community discussions intersect with the topics on this page. Users have repeatedly asked for non-OpenAI/Azure model providers such as Ollama (#657, #345) and for cheaper extractors like Triplex (#632); because native support is limited to OpenAI and Azure, these integrations typically rely on OpenAI-compatible endpoints wired through create_completion. Source: packages/graphrag/graphrag/api/prompt_tune.py.

Incremental indexing (#741) is itself an extensibility concern: users wishing to add content today must re-run the full pipeline because the storage and configuration layers do not yet expose a partial-update API. Source: packages/graphrag-storage/README.md.

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Maintenance risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 6 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Capability evidence risk - Capability evidence risk requires verification.

1. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/microsoft/graphrag

2. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/microsoft/graphrag

3. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/microsoft/graphrag

4. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | https://github.com/microsoft/graphrag

5. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/microsoft/graphrag

6. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/microsoft/graphrag

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using graphrag with real data or production workflows.

K-core based hierarchical community construction as an alternative to Le - github / github_issue
bug: vector store db_uri whitespace check uses .strip without parenthese - github / github_issue
v3.1.0 - github / github_release
v3.0.9 - github / github_release
v3.0.8 - github / github_release
v3.0.7 - github / github_release
v3.0.6 - github / github_release
v3.0.5 - github / github_release
v3.0.4 - github / github_release
Release v3.0.2 - github / github_release
Release v3.0.1 - github / github_release
Release v3.0.0 - github / github_release

Source: Project Pack community evidence and pitfall evidence

graphrag

GraphRAG Overview and Architecture

Related Pages

GraphRAG Overview and Architecture

Purpose and Scope

Repository and Package Architecture

Core APIs and Pipelines

Indexing API

Query API

Prompt Tuning API

CLI Surface

Supporting Subsystems and Community-Driven Roadmap

See Also

Indexing Pipeline, Data Flow & Incremental Updates

Related Pages

Indexing Pipeline, Data Flow & Incremental Updates

Overview and Purpose

Data Flow Stages

Incremental Updates: Current State

Configuration, CLI, and Extensibility

See Also

Query Engine and Search Methods

Related Pages

Query Engine and Search Methods

Overview

Search Methods

API and CLI Usage

Prompt Tuning and the Unified Search App

Configuration and Known Constraints

See Also

Configuration, LLM Integration, Storage & Extensibility

Related Pages

Configuration, LLM Integration, Storage & Extensibility

Overview

Configuration System

LLM Integration

Storage Architecture

Extensibility

Common Failure Modes and Community Notes

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Capability evidence risk: Capability evidence risk requires verification

2. Maintenance risk: Maintenance risk requires verification

3. Security or permission risk: Security or permission risk requires verification

4. Security or permission risk: Security or permission risk requires verification

5. Maintenance risk: Maintenance risk requires verification

6. Maintenance risk: Maintenance risk requires verification

Community Discussion Evidence

Community Discussion Evidence