# https://github.com/HKUDS/MiniRAG Project Manual

Generated at: 2026-06-25 09:30:38 UTC

## Table of Contents

- [MiniRAG Overview & System Architecture](#page-1)
- [Indexing Pipeline & Knowledge Graph Construction](#page-2)
- [Query & Retrieval Workflow](#page-3)
- [LLM/Embedding Integrations, API Server & Deployment](#page-4)

<a id='page-1'></a>

## MiniRAG Overview & System Architecture

### Related Pages

Related topics: [Indexing Pipeline & Knowledge Graph Construction](#page-2), [Query & Retrieval Workflow](#page-3), [LLM/Embedding Integrations, API Server & Deployment](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md)
- [minirag/api/README.md](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/README.md)
- [minirag/api/minirag_server.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/minirag_server.py)
- [dataset/LiHua-World/README.md](https://github.com/HKUDS/MiniRAG/blob/main/dataset/LiHua-World/README.md)
- [main.py](https://github.com/HKUDS/MiniRAG/blob/main/main.py)
</details>

# MiniRAG Overview & System Architecture

## 1. Purpose and Scope

MiniRAG is an extremely simple Retrieval-Augmented Generation (RAG) framework designed to make Small Language Models (SLMs) viable for RAG tasks on resource-constrained, on-device scenarios. The project introduces two principal innovations:

1. **Semantic-aware heterogeneous graph indexing**, which unifies raw text chunks and named entities in a single graph structure, reducing dependence on deep semantic understanding during ingestion.
2. **Lightweight topology-enhanced retrieval**, which uses graph structure to discover relevant knowledge without invoking heavy chain-of-thought or multi-hop LLM reasoning.

The framework targets deployments where full-size LLM-based RAG stacks (e.g., GraphRAG, LightRAG with GPT-4-class models) are too costly or too slow. According to the abstract in [README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md), MiniRAG achieves comparable accuracy to LLM-based methods while using only about 25% of the storage space, and it ships a dedicated benchmark dataset named **LiHua-World**.

The repository is positioned as a sibling of [LightRAG](https://github.com/HKUDS/LightRAG) and shares substantial lineage with it — the PyPI distribution is in fact `lightrag-hku`. MiniRAG also acknowledges [nano-graphrag](https://github.com/gusye1234/nano-graphrag) as a foundational inspiration ([README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md)).

## 2. System Architecture

MiniRAG follows a streamlined two-stage pipeline: **indexing** and **retrieval/QA**. Both stages are exposed as both a Python library (`minirag` package) and a FastAPI-based HTTP/Ollama-compatible server.

```mermaid
flowchart LR
    A[Raw Documents<br/>.txt / .md / .pdf / .docx / .pptx] --> B[Chunking<br/>operate.py]
    B --> C[Entity & Relation Extraction<br/>via LLM]
    C --> D[Heterogeneous Graph<br/>chunks + entities + edges]
    D --> E[(Vector Store<br/>+ KV / Graph Storage)]
    E --> F[User Query]
    F --> G[Topology-Enhanced Retrieval<br/>minirag.py]
    G --> H[LLM Generation<br/>llm.py]
    H --> I[Answer]
```

**Key design properties**

- *Heterogeneous graph*: nodes represent both text chunks and named entities, connected by typed edges. This lets retrieval fan out along entity links without requiring the LLM to reason over long contexts during ingestion.
- *Topology-enhanced retrieval*: query expansion leverages graph neighbours and chunk-to-entity paths, then ranks candidate chunks for the LLM prompt.
- *Storage abstraction*: the `storage` module (referenced in the module map of [README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md)) supports more than ten heterogeneous backends, including Neo4j, PostgreSQL, and TiDB (announced in the 2025.02.14 news entry in [README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md)).

The Python module layout shown in [README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md) is:

| Path | Role |
|---|---|
| `minirag/__init__.py` | Package entry point |
| `minirag/base.py` | Base classes and shared types |
| `minirag/minirag.py` | Core `MiniRAG` class: `ainsert`, `aquery` |
| `minirag/operate.py` | Chunking, entity/relation extraction operations |
| `minirag/llm.py` | LLM binding and invocation |
| `minirag/prompt.py` | Prompt templates for extraction and answering |
| `minirag/storage.py` | Pluggable storage backends |
| `minirag/utils.py` | Helpers (chunking, token estimation, etc.) |
| `reproduce/Step_0_index.py` | End-to-end indexing reproduction script |
| `reproduce/Step_1_QA.py` | End-to-end QA reproduction script |
| `minirag/api/minirag_server.py` | FastAPI/Ollama-compatible HTTP server |
| `main.py` | Programmatic initialization example |

## 3. Installation and Quick Start

Two installation paths are documented in [README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md):

```bash
# Source (recommended for development)
cd MiniRAG
pip install -e .

# PyPI (shared with LightRAG)
pip install lightrag-hku
```

A typical workflow is to drop a corpus into `./dataset/<name>/data/` and then run:

```bash
python ./reproduce/Step_0_index.py   # build the heterogeneous graph index
python ./reproduce/Step_1_QA.py      # run retrieval + answer generation
```

Programmatic usage goes through `main.py`, which constructs a `MiniRAG` instance and calls `ainsert(...)` followed by `aquery(...)`.

For serving, the optional `[api]` extra adds FastAPI servers, including an Ollama-emulating endpoint that lets existing Ollama clients route chat through RAG without code changes ([minirag/api/README.md](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/README.md), [minirag/api/minirag_server.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/minirag_server.py)).

## 4. Dataset, API Surface, and Known Pitfalls

**LiHua-World dataset.** [dataset/LiHua-World/README.md](https://github.com/HKUDS/MiniRAG/blob/main/dataset/LiHua-World/README.md) describes a one-year corpus of chat records for a virtual user. It supplies three question categories (single-hop, multi-hop, summary), each with gold answers and supporting documents. The archive `LiHuaWorld.zip` is shipped inside `./dataset/LiHua-World/data/`.

**API surface.** The HTTP server in [minirag/api/minirag_server.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/minirag_server.py) exposes:

| Endpoint | Purpose |
|---|---|
| `POST /query` | Run a RAG query, optionally streamed |
| `POST /documents/text` | Insert a raw text payload via `rag.ainsert` |
| `POST /documents/file` | Upload & immediately index a single file (txt/md/pdf/docx/pptx) |
| `POST /documents/batch` | Upload and index many files in one call |
| `POST /documents/scan` | Rescan an input directory for new files |
| `DELETE /documents` | Clear all indexed documents |
| `GET /health` | Health and configuration check |
| `POST /api/chat` | Ollama-compatible chat (mode inferred from query prefix) |

The `/api/chat` handler routes the user message through `parse_query_mode`, then calls `rag.aquery` with the inferred `mode` and the rest of the conversation as `conversation_history` ([minirag/api/minirag_server.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/minirag_server.py)).

**Known failure modes from the community**

- *Indexing slowdown with `gpt-4o-mini`* (issue #82): when indexing a few hundred text files, throughput degrades over time, eventually exceeding 20 minutes per file. This typically correlates with a runaway re-extraction loop in `ainsert`.
- *Re-processing of already-processed chunks* (issue #96): repeated `ainsert` calls iterate over chunks whose status is `processed`, causing redundant entity extraction and growing latency. The reproduction scripts in `./reproduce/` call `ainsert` multiple times, which can amplify this.
- *Phi-3 `DynamicCache` error* (issue #69): Microsoft's `modeling_phi3.py` exposes `get_max_cache_shape`, not `get_max_length`. Workaround: patch the cached model file in `~/.cache/huggingface/...` or switch models.
- *Python version mismatch* (issue #1): some transitive dependencies require a different Python version than `3.10.13`. Use a Python version compatible with `requirements.txt` (see [README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md)).
- *Path-to-chunk ranking question* (issue #90): the `path2chunk` helper aggregates counts into a dictionary and then picks top chunks via `count_dict.most_common(max_chunks)`. If you change this code, preserve the aggregation step; do not read directly from `node_chunk_id`.

## See Also

- MiniRAG README: [README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md)
- API server reference: [minirag/api/README.md](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/README.md)
- LiHua-World dataset: [dataset/LiHua-World/README.md](https://github.com/HKUDS/MiniRAG/blob/main/dataset/LiHua-World/README.md)
- Related project: [LightRAG](https://github.com/HKUDS/LightRAG)
- Paper: [arXiv:2501.06713](https://arxiv.org/abs/2501.06713)

---

<a id='page-2'></a>

## Indexing Pipeline & Knowledge Graph Construction

### Related Pages

Related topics: [MiniRAG Overview & System Architecture](#page-1), [Query & Retrieval Workflow](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md)
- [minirag/operate.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/operate.py)
- [minirag/prompt.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/prompt.py)
- [minirag/kg/__init__.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/kg/__init__.py)
- [minirag/kg/networkx_impl.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/kg/networkx_impl.py)
- [minirag/kg/neo4j_impl.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/kg/neo4j_impl.py)
- [minirag/kg/postgres_impl.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/kg/postgres_impl.py)
- [minirag/api/minirag_server.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/minirag_server.py)
- [minirag/api/README.md](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/README.md)
- [reproduce/Step_0_index.py](https://github.com/HKUDS/MiniRAG/blob/main/reproduce/Step_0_index.py)
</details>

# Indexing Pipeline & Knowledge Graph Construction

## Purpose and Scope

MiniRAG's indexing pipeline transforms raw text documents into a **heterogeneous knowledge graph** that combines text chunks and named entities in a unified structure. This pipeline is the foundation of the framework's ability to deliver strong RAG performance on small language models (SLMs), because it reduces the retrieval burden placed on the model's semantic understanding by pre-computing topological structure at index time.

The indexing stage is triggered whenever a caller invokes `MiniRAG.ainsert(...)` (or its server equivalent `/documents/text`, `/documents/file`, `/documents/batch`, `/documents/scan`). The output is persisted to a configurable working directory and to the chosen graph backend, where the query stage later performs lightweight topology-enhanced retrieval. Source: [README.md:9-13](), [minirag/api/minirag_server.py:130-168]().

The pipeline is designed around two guiding principles stated in the project abstract:

1. **Semantic-aware heterogeneous graph indexing** — text chunks and named entities are co-located in a single graph, so downstream retrieval can rely on graph topology rather than complex semantic reasoning. Source: [README.md:9-11]().
2. **Lightweight topology-enhanced retrieval** — because the graph is built during indexing, queries can navigate the structure efficiently even with SLMs. Source: [README.md:11-13]().

## Pipeline Stages

The end-to-end indexing flow consists of four cooperating stages: chunking, entity/relation extraction, graph upsert, and persistence. The diagram below summarizes how data moves between them.

```mermaid
flowchart LR
    A[Raw Document<br/>txt / md / pdf / docx / pptx] --> B[Chunking<br/>operate.py]
    B --> C[Entity & Relation Extraction<br/>prompt.py + LLM]
    C --> D[Heterogeneous Graph Upsert<br/>kg/*_impl.py]
    D --> E[(Working Dir KV Store<br/>+ Graph Backend)]
    E --> F[Topology-Enhanced Retrieval<br/>query stage]
```

### Stage 1 — Chunking

When `ainsert` receives content, the operating layer splits it into manageable chunks before any LLM call. Chunk status is tracked so that partially completed work can be resumed. Community issue [#96](https://github.com/HKUDS/MiniRAG/issues/96) reports a suspected bug in this logic where `inserting_chunks` is selected from all chunks whose status is `processed`, which causes previously processed chunks to be re-extracted when `ainsert` is invoked multiple times — making the pipeline progressively slower across runs. Source: [minirag/operate.py](), issue [#96](https://github.com/HKUDS/MiniRAG/issues/96).

### Stage 2 — Entity and Relation Extraction

Each chunk is sent through prompt templates that ask the LLM to extract named entities and their relationships. The extraction prompts live in [minirag/prompt.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/prompt.py) and are deliberately lightweight so they work with small models such as Phi-3.5-mini, GLM-Edge-1.5B-Chat, Qwen2.5-3B-Instruct, and MiniCPM3-4B. Source: [README.md:43-50](), [minirag/prompt.py]().

This stage is the most expensive part of the pipeline. Community issue [#82](https://github.com/HKUDS/MiniRAG/issues/82) reports that running `reproduce/Step_0_index.py` against 150 `.txt` files with `gpt-4o-mini` took a full day and degraded to >20 minutes per file as more documents were processed. The slowdown is amplified by the re-processing bug in issue [#96](https://github.com/HKUDS/MiniRAG/issues/96). Source: [reproduce/Step_0_index.py](), issue [#82](https://github.com/HKUDS/MiniRAG/issues/82).

### Stage 3 — Heterogeneous Graph Upsert

Extracted entities and relations are written to a heterogeneous graph backend. MiniRAG ships several implementations behind a common interface in [minirag/kg/__init__.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/kg/__init__.py):

| Backend | Implementation File | Typical Use |
|---|---|---|
| NetworkX (in-memory) | `minirag/kg/networkx_impl.py` | Local development, single-process runs |
| Neo4j | `minirag/kg/neo4j_impl.py` | Production deployments requiring a real graph DB |
| PostgreSQL / TiDB | `minirag/kg/postgres_impl.py` | SQL-based deployments and existing data stacks |

The graph stores both chunk nodes (with their original text) and entity nodes (with extracted mentions), connected by typed edges. This is what the paper calls a *semantic-aware heterogeneous graph*: retrieval can traverse either node type. Source: [README.md:9-11](), [minirag/kg/networkx_impl.py](), [minirag/kg/neo4j_impl.py](), [minirag/kg/postgres_impl.py]().

### Stage 4 — Persistence and Resumption

Indexed data is persisted to the working directory (`--working-dir`) so that subsequent runs can resume without re-vectorizing existing documents. The API server's startup hook scans `--input-dir` and only processes new files. Source: [minirag/api/minirag_server.py:53-79](), [minirag/api/README.md:50-58]().

> ⚠️ Community note: Until the `#96` re-processing bug is resolved, users who call `ainsert` more than once on the same dataset will repeatedly re-extract entities from already-processed chunks. Workaround: restart the process with a clean `--working-dir` between batch inserts, or call `ainsert` exactly once with the full corpus. Source: issue [#96](https://github.com/HKUDS/MiniRAG/issues/96).

## `path2chunk` and Chunk Path Assignment

After the graph is built, MiniRAG assigns each chunk a *path* of related chunks via the `path2chunk` function in `operate.py`. This function walks the heterogeneous graph and accumulates how often each chunk co-occurs with a given chunk's entities, producing a `count_dict`. The final selection currently uses `count_dict.most_common(max_chunks)` rather than the accumulated `node_chunk_id` list — community issue [#90](https://github.com/HKUDS/MiniRAG/issues/90) questions this choice because the two variables are not equivalent. Understanding this step is important because the chunk path is what the lightweight retrieval stage later traverses. Source: [minirag/operate.py](), issue [#90](https://github.com/HKUDS/MiniRAG/issues/90).

## Storage Footprint

A key claimed benefit of the indexing design is storage efficiency. By sharing nodes and edges across chunks rather than maintaining large per-chunk embedding stores, MiniRAG reportedly requires only about **25% of the storage space** of comparable LLM-based RAG systems while delivering comparable retrieval accuracy on the LiHua-World benchmark. Source: [README.md:13-15]().

## Common Failure Modes

| Symptom | Likely Cause | Mitigation |
|---|---|---|
| Indexing slows down across runs | `ainsert` re-processes `processed` chunks (issue #96) | Call `ainsert` once on the full corpus, or wipe `--working-dir` between runs |
| Per-file latency grows to >20 minutes | Cumulative re-extraction + remote LLM latency (issue #82) | Use a local model, batch chunks, or pre-filter inputs |
| `DynamicCache has no attribute get_max_length` | Microsoft Phi-3 modeling file incompatibility (issue #69) | Swap model, or patch cached `modeling_phi3.py`: rename `get_max_length` to `get_max_cache_shape` |
| Python dependency install failure | No pinned Python version; some deps need a different interpreter (issue #1) | Match the Python version expected by each pinned dependency in `requirements.txt` |
| Empty graph after indexing | Unsupported file extension or PDF/DOCX/PPTX parser not installed | Install `pypdf`, `docx`, `python-pptx`; check `supported_extensions` in `DocumentManager` |

Sources: issue [#96](https://github.com/HKUDS/MiniRAG/issues/96), issue [#82](https://github.com/HKUDS/MiniRAG/issues/82), issue [#69](https://github.com/HKUDS/MiniRAG/issues/69), issue [#1](https://github.com/HKUDS/MiniRAG/issues/1), [minirag/api/minirag_server.py:280-340]().

## See Also

- [`MiniRAG` Class & Public API](MiniRAG-Class-and-Public-API.md)
- [Query & Topology-Enhanced Retrieval](Query-and-Topology-Enhanced-Retrieval.md)
- [Supported Storage Backends](Supported-Storage-Backends.md)
- [Reproducing the LiHua-World Benchmark](Reproducing-the-LiHua-World-Benchmark.md)

---

<a id='page-3'></a>

## Query & Retrieval Workflow

### Related Pages

Related topics: [MiniRAG Overview & System Architecture](#page-1), [Indexing Pipeline & Knowledge Graph Construction](#page-2), [LLM/Embedding Integrations, API Server & Deployment](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [minirag/minirag.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/minirag.py)
- [minirag/operate.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/operate.py)
- [minirag/prompt.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/prompt.py)
- [minirag/base.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/base.py)
- [minirag/api/minirag_server.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/minirag_server.py)
- [reproduce/Step_0_index.py](https://github.com/HKUDS/MiniRAG/blob/main/reproduce/Step_0_index.py)
- [reproduce/Step_1_QA.py](https://github.com/HKUDS/MiniRAG/blob/main/reproduce/Step_1_QA.py)
- [README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md)
</details>

# Query & Retrieval Workflow

## 1. Overview and Purpose

The **Query & Retrieval Workflow** is the runtime counterpart to MiniRAG's indexing pipeline. Once documents have been chunked, entity-extracted, and embedded into a heterogeneous graph (see the *Indexing Workflow* page), the retrieval workflow is responsible for taking a natural-language question and producing an answer that is grounded in that graph.

MiniRAG exposes this workflow through two complementary surfaces:

- A **Python API** — the `MiniRAG` class defined in [minirag/minirag.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/minirag.py) exposes `query(...)` and `aquery(...)` methods that are called directly by user code or by the bundled reproduction scripts.
- An **HTTP API** — a FastAPI server built around the same `MiniRAG` instance, see [minirag/api/minirag_server.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/minirag_server.py). The server exposes `/query` for RAG queries and also Ollama-compatible `/api/chat` and `/api/tags` endpoints that route through the same retrieval core.

The design goal, as stated in [README.md:1-30](https://github.com/HKUDS/MiniRAG/blob/main/README.md), is to let *small* language models (SLMs) perform RAG effectively by leaning on the graph structure rather than on the model's own semantic reasoning. Source: [README.md:13-23]().

## 2. Entry Points and Parameters

### 2.1 Python entry point

The user-facing entry point is the async `aquery()` method, which is also wrapped by the synchronous `query()` convenience method. Both accept a `QueryParam` object defined in [minirag/base.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/base.py). The most important fields are:

| Field | Type | Purpose |
|-------|------|---------|
| `mode` | `SearchMode` enum | Selects the retrieval strategy (`mini`, `naive`, `light`) |
| `stream` | `bool` | Toggles streaming vs. one-shot response |
| `only_need_context` | `bool` | Returns only the retrieved context, no answer generation |
| `top_k` | `int` | Number of similar entities/chunks to retrieve |
| `conversation_history` | `list` | Optional multi-turn chat history |
| `history_turns` | `int` | Number of past turns to include |

Source: [minirag/api/minirag_server.py:1-60]() (the server constructs a `QueryParam` from the HTTP body with exactly these fields) and [minirag/base.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/base.py) (`QueryParam` and `SearchMode` definitions).

### 2.2 HTTP entry point

The `/query` endpoint in `minirag_server.py` delegates to `rag.aquery()` and packages the result into a `QueryResponse`. Source: [minirag/api/minirag_server.py:Query handler](). The Ollama-compatible `/api/chat` endpoint additionally supports a *prefix-based* mode selector parsed by `parse_query_mode()`:

- `/light ...` → `SearchMode.light`
- `/naive ...` → `SearchMode.naive`
- `/mini ...` → `SearchMode.mini`

Source: [minirag/api/minirag_server.py:parse_query_mode definition]()(prefix map and stripping logic).

### 2.3 Command-line reproduction

For local benchmarking, [reproduce/Step_1_QA.py](https://github.com/HKUDS/MiniRAG/blob/main/reproduce/Step_1_QA.py) loads the previously built index (created by [reproduce/Step_0_index.py](https://github.com/HKUDS/MiniRAG/blob/main/reproduce/Step_0_index.py)) and iterates over the LiHua-World QA set, calling the `MiniRAG` query interface for each question. Source: [README.md:Quick Start section]().

## 3. Retrieval Pipeline

Once a query reaches `MiniRAG.aquery()`, the following pipeline executes:

```mermaid
flowchart TD
    A[User query] --> B[Parse QueryParam<br/>mode, top_k, history]
    B --> C{Mode}
    C -->|mini| D[Topology-enhanced<br/>graph retrieval]
    C -->|naive| E[Vector-only<br/>chunk retrieval]
    C -->|light| F[LightRAG<br/>hybrid retrieval]
    D --> G[Build context from<br/>entities + chunks]
    E --> G
    F --> G
    G --> H[Compose prompt<br/>using prompt.py]
    H --> I[LLM call<br/>streaming or one-shot]
    I --> J[Return answer / context]
```

1. **Mode dispatch.** The `mode` field selects one of three retrieval strategies implemented in [minirag/operate.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/operate.py). The flagship `mini` mode is the *topology-enhanced* retrieval that walks the heterogeneous graph (entities ↔ chunks) rather than relying purely on vector similarity. Source: [README.md:MiniRAG Framework section]().

2. **Graph/entity retrieval.** For `mini` mode the system looks up the query embedding, retrieves the top-`k` most similar entity nodes (default `top_k=50`, see the `--top-k` argument in `minirag_server.py`), then expands to chunks through the graph edges recorded during indexing. Source: [minirag/api/minirag_server.py:parse_args --top-k]() and [minirag/operate.py:mini-mode functions]().

3. **Context assembly.** Retrieved entities, relations, and chunk texts are concatenated into a context block. When `only_need_context=True`, the workflow short-circuits here and returns the assembled context without invoking the LLM. Source: [minirag/api/minirag_server.py:QueryRequest.only_need_context field]().

4. **Prompt composition.** The context and the user's question are combined using the prompt templates in [minirag/prompt.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/prompt.py). These templates are deliberately short so that SLMs can follow them reliably. Source: [README.md:Abstract — "lightweight topology-enhanced retrieval"]().

5. **Answer generation.** The composed prompt is sent to the configured LLM binding (`ollama`, `openai`, `azure_openai`, or `lollms`) selected at server start-up. The server can stream the response chunk-by-chunk (NDJSON for Ollama) or return a single `QueryResponse`. Source: [minirag/api/minirag_server.py:streaming vs non-streaming branches]().

## 4. Known Issues and Practical Guidance

Several community-reported issues are directly related to the query and retrieval workflow and should be considered when operating MiniRAG:

- **Slow indexing that appears to leak into retrieval.** Issue #82 reports that `reproduce/Step_0_index.py` runs very slowly and slows down further over time. Because retrieval depends on a fully built index, a slow or growing index will directly delay the first query and make every subsequent incremental insert slower. Recommendation: complete indexing in one pass, then call `aquery()` separately. Source: [README.md:reproduce workflow]() and the community thread for issue #82.

- **Repeated entity extraction on `ainsert()`.** Issue #96 shows that calling `ainsert()` multiple times can re-process chunks that are already in `processed` state, extracting entities again. This makes the knowledge base grow on every call and degrades retrieval quality over time. The workaround is to insert each document exactly once and then issue queries without re-inserting. Source: [reproduce/Step_0_index.py:ainsert usage pattern]().

- **`most_common` selection in `path2chunk`.** Issue #90 questions why `count_dict.most_common(max_chunks)` is used to populate `v['Path']` instead of `node_chunk_id.most_common(max_chunks)`. This is relevant during retrieval because the selected paths determine which chunks appear in the assembled context and therefore which evidence the LLM sees. Source: [minirag/operate.py:path2chunk function]().

- **Phi-3 / `DynamicCache` incompatibility.** Issue #69 reports that recent Microsoft Phi-3 model files removed `get_max_length` from `DynamicCache`, breaking LLM calls during answer generation. Two workarounds are documented: switch LLM binding, or patch the cached `modeling_phi3.py` to use `get_max_cache_shape`. Source: [minirag/llm.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/llm.py) (LLM binding configuration).

- **Python version drift.** Issue #1 notes that the project does not pin a Python version, and dependency conflicts appear with Python 3.10.13. The retrieval workflow itself is not affected, but a mismatched interpreter will prevent the server from starting. Source: [setup.py](https://github.com/HKUDS/MiniRAG/blob/main/setup.py) and [requirements.txt](https://github.com/HKUDS/MiniRAG/blob/main/requirements.txt).

## See Also

- *Indexing Workflow* — how the heterogeneous graph used by `mini` mode is built.
- *MiniRAG API Server* — full reference for `/query`, `/documents/*`, and Ollama-compatible endpoints.
- *LiHua-World Dataset* — the benchmark dataset used by the reproduction scripts.
- [LightRAG](https://github.com/HKUDS/LightRAG) — the upstream project that MiniRAG extends.

---

<a id='page-4'></a>

## LLM/Embedding Integrations, API Server & Deployment

### Related Pages

Related topics: [MiniRAG Overview & System Architecture](#page-1), [Query & Retrieval Workflow](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [minirag/api/minirag_server.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/minirag_server.py)
- [minirag/api/README.md](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/README.md)
- [minirag/__init__.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/__init__.py)
- [minirag/llm.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/llm.py)
- [minirag/api/config.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/config.py)
- [minirag/api/ollama_api.py](https://github.com/HKUDS/MiniRAG/blob/main/minirag/api/ollama_api.py)
- [README.md](https://github.com/HKUDS/MiniRAG/blob/main/README.md)
- [setup.py](https://github.com/HKUDS/MiniRAG/blob/main/setup.py)
</details>

# LLM/Embedding Integrations, API Server & Deployment

## Purpose and Scope

MiniRAG exposes a flexible integration layer that decouples the RAG pipeline from any specific model provider, allowing the same `MiniRAG` core to be paired with interchangeable LLM and embedding backends. On top of this layer, the project ships an optional FastAPI-based server (`minirag-server`) that turns MiniRAG into a drop-in RAG backend compatible with multiple client ecosystems, including an Ollama-compatible API.

The deployment story covers three concerns: (1) choosing and configuring an LLM/embedding binding via CLI or environment variables, (2) running the API server with the `minirag-server` entry point, and (3) exposing document insertion, querying, scanning, and deletion through HTTP. Community issue [#1](https://github.com/HKUDS/MiniRAG/issues/1) highlights the importance of pinning a compatible Python version (some transitive dependencies require versions other than 3.10.13), and the latest release **v0.0.2** added the PyPI distribution plus the official API server described below (`Source: [README.md]()`).

## LLM and Embedding Binding Architecture

MiniRAG supports four LLM/embedding bindings: `lollms`, `ollama`, `openai`, and `azure_openai`. The binding is configured independently for the LLM and the embedding model, so you can mix providers (for example, Ollama for embeddings and OpenAI for the LLM) (`Source: [minirag/api/README.md]()`). Bindings are selected through the CLI flags `--llm-binding` and `--embedding-binding`, with their corresponding `--llm-binding-host`, `--embedding-binding-host`, `--llm-binding-api-key`, and `--embedding-binding-api-key` arguments. Defaults are read from environment variables (`LLM_BINDING`, `LLM_BINDING_HOST`, `LLM_MODEL`, `EMBEDDING_BINDING`, etc.) and fall back to sensible values such as `mistral-nemo:latest` for the LLM and `bge-m3:latest` for embeddings (`Source: [minirag/api/minirag_server.py]()`).

The binding selects one of the completion functions defined in `minirag_server.py`:

- `openai_alike_model_complete` wraps `openai_complete_if_cache` and targets any OpenAI-compatible HTTP endpoint (`Source: [minirag/api/minirag_server.py]()`).
- `azure_openai_model_complete` wraps `azure_openai_complete_if_cache` and reads `AZURE_OPENAI_API_KEY` plus `AZURE_OPENAI_API_VERSION` (`Source: [minirag/api/minirag_server.py]()`).
- The Ollama/LoLLMs path passes `host`, `timeout`, `num_ctx`, and `api_key` via `llm_model_kwargs` into the standard LightRAG-style completion pipeline (`Source: [minirag/api/minirag_server.py]()`).

Embedding functions are wired through a single `EmbeddingFunc` instance whose callable is selected at runtime by an `if/else` ladder over `args.embedding_binding`, dispatching to `lollms_embed`, `ollama_embed`, or `azure_openai_embed`. `embedding_dim` and `max_token_size` are forwarded from CLI arguments (`Source: [minirag/api/minirag_server.py]()`).

The Hugging Face / Microsoft Phi-3 integration is not first-class in the API server; community issue [#69](https://github.com/HKUDS/MiniRAG/issues/69) reports a `'DynamicCache' object has no attribute 'get_max_length'` error caused by a recent change in Microsoft's `modeling_phi3.py`. The recommended workarounds are to switch to a different model or to patch the cached `modeling_phi3.py` (replacing `get_max_length` with `get_max_cache_shape`) under the Hugging Face cache directory.

## API Server Configuration

The server is built around a single `MiniRAG` instance constructed from CLI arguments and injected with shared KV, document-status, graph, and vector storage classes (`Source: [minirag/api/minirag_server.py]()`). The CLI surface includes chunking controls (`--chunk_size` default 1200, `--chunk_overlap_size` default 100), concurrency (`--max-async` default 4), context limits (`--max-tokens` default 32768, `--max-embed-tokens` default 8192), timeout, log level, an optional API key, and SSL settings (`Source: [minirag/api/minirag_server.py]()`). Retrieval-side tuning is exposed via `--top-k` (default 50) and `--cosine-threshold` (default 0.4), which together control how many entities/relations are returned in local/global modes (`Source: [minirag/api/README.md]()`).

Key CLI flags summarized:

| Flag | Default | Purpose |
|------|---------|---------|
| `--llm-binding` | `ollama` | LLM backend: `lollms`, `ollama`, `openai`, `azure_openai` |
| `--embedding-binding` | `ollama` | Embedding backend (independently selectable) |
| `--chunk-size` / `--chunk-overlap-size` | 1200 / 100 | Text chunking window |
| `--max-async` | 4 | Concurrent LLM/embedding calls |
| `--top-k` | 50 | Retrieval count per query |
| `--cosine-threshold` | 0.4 | Cosine similarity cutoff for retrieval |
| `--timeout` | `None` | Per-call timeout; `None` means infinite |

`Source: [minirag/api/README.md]()`

## API Endpoints and Deployment Modes

The FastAPI app exposes three groups of endpoints (`Source: [minirag/api/README.md]()`):

1. **Document Management** — `POST /documents/text` inserts raw text, `POST /documents/file` uploads a single file, `POST /documents/batch` uploads many files, `POST /documents/scan` triggers a directory scan of `--input-dir`, and `DELETE /documents` clears the store. File uploads accept `.txt`, `.md`, `.pdf`, `.docx`, and `.pptx`; PDF/DOCX/PPTX handlers are lazily installed via `pm.install(...)` on first use (`Source: [minirag/api/minirag_server.py]()`).
2. **Query** — `POST /query` accepts a `QueryRequest` (query text, mode, stream flag, `only_need_context`) and dispatches to `rag.aquery` with a `QueryParam` carrying `top_k`. Streaming is handled by accumulating chunks from the async generator before returning a `QueryResponse` (`Source: [minirag/api/minirag_server.py]()`).
3. **Ollama Emulation** — `GET /api/version`, `GET /api/tags`, and `POST /api/chat` make the server usable as a drop-in Ollama backend for tools that already speak the Ollama protocol. Chat requests are streamed back to the client (`Source: [minirag/api/README.md]()`).

A utility `GET /health` endpoint reports the server's configuration and liveness (`Source: [minirag/api/README.md]()`).

Installation is offered in two ways matching LightRAG: from PyPI with `pip install "lightrag-hku[api]"`, or from source with `pip install -e ".[api]"` (`Source: [minirag/api/README.md]()`). Once installed, `minirag-server` is the console entry point. You can also run the variants directly under Uvicorn: `uvicorn lollms_minirag_server:app --reload --port 9721`, `ollama_minirag_server:app`, `openai_minirag_server:app`, or the Azure OpenAI variant (`Source: [minirag/api/README.md]()`).

## Common Failure Modes and Tuning Tips

A few recurring deployment problems surface in the community. Issue [#82](https://github.com/HKUDS/MiniRAG/issues/82) reports indexing slowing to over 20 minutes per file when using `gpt-4o-mini`; the practical levers are `--max-async` (raise it for parallel calls), `--timeout` (set a value so a stalled call doesn't block indefinitely), `--chunk-size` (smaller chunks mean fewer LLM calls per file), and batching uploads through `POST /documents/batch` so concurrency is amortized across files (`Source: [minirag/api/minirag_server.py]()`).

Issue [#1](https://github.com/HKUDS/MiniRAG/issues/1) notes dependency-version drift; the `setup.py` should be consulted for the supported Python range before installation (`Source: [setup.py]()`). Issue [#96](https://github.com/HKUDS/MiniRAG/issues/96) describes repeated entity extraction when `ainsert` is called multiple times; the API server mitigates this by routing uploads through the single `rag.ainsert` entry point and tracking chunk state in `DOC_STATUS_STORAGE`, so prefer `POST /documents/scan` over ad-hoc re-ingestion when re-running on the same `--input-dir` (`Source: [minirag/api/minirag_server.py]()`).

When in doubt, use `minirag-server --help` to enumerate every supported flag and binding combination, and confirm that the chosen LLM and embedding models are pre-pulled in the target Ollama or LoLLMs instance before serving traffic (`Source: [minirag/api/README.md]()`).

## See Also

- MiniRAG Framework Overview
- Heterogeneous Graph Indexing
- LiHua-World Benchmark Dataset
- Retrieval Modes (local, global, hybrid)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: HKUDS/MiniRAG

Summary: Found 13 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Security or permission risk - Security or permission risk requires verification.

## 1. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/104

## 2. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/108

## 3. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/97

## 4. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/HKUDS/MiniRAG

## 5. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/95

## 6. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/102

## 7. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/HKUDS/MiniRAG

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/HKUDS/MiniRAG

## 9. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/HKUDS/MiniRAG

## 10. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/109

## 11. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/98

## 12. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/HKUDS/MiniRAG

## 13. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/HKUDS/MiniRAG

<!-- canonical_name: HKUDS/MiniRAG; human_manual_source: deepwiki_human_wiki -->