Doramagic Project Pack · Human Manual

MiniRAG

[ACL2026] "MiniRAG: Making RAG Simpler with Small and Open-Sourced Language Models"

MiniRAG Overview & System Architecture

Related topics: Indexing Pipeline & Knowledge Graph Construction, Query & Retrieval Workflow, LLM/Embedding Integrations, API Server & Deployment

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Indexing Pipeline & Knowledge Graph Construction, Query & Retrieval Workflow, LLM/Embedding Integrations, API Server & Deployment

MiniRAG Overview & System Architecture

1. Purpose and Scope

MiniRAG is an extremely simple Retrieval-Augmented Generation (RAG) framework designed to make Small Language Models (SLMs) viable for RAG tasks on resource-constrained, on-device scenarios. The project introduces two principal innovations:

  1. Semantic-aware heterogeneous graph indexing, which unifies raw text chunks and named entities in a single graph structure, reducing dependence on deep semantic understanding during ingestion.
  2. Lightweight topology-enhanced retrieval, which uses graph structure to discover relevant knowledge without invoking heavy chain-of-thought or multi-hop LLM reasoning.

The framework targets deployments where full-size LLM-based RAG stacks (e.g., GraphRAG, LightRAG with GPT-4-class models) are too costly or too slow. According to the abstract in README.md, MiniRAG achieves comparable accuracy to LLM-based methods while using only about 25% of the storage space, and it ships a dedicated benchmark dataset named LiHua-World.

The repository is positioned as a sibling of LightRAG and shares substantial lineage with it — the PyPI distribution is in fact lightrag-hku. MiniRAG also acknowledges nano-graphrag as a foundational inspiration (README.md).

2. System Architecture

MiniRAG follows a streamlined two-stage pipeline: indexing and retrieval/QA. Both stages are exposed as both a Python library (minirag package) and a FastAPI-based HTTP/Ollama-compatible server.

flowchart LR
    A[Raw Documents<br/>.txt / .md / .pdf / .docx / .pptx] --> B[Chunking<br/>operate.py]
    B --> C[Entity & Relation Extraction<br/>via LLM]
    C --> D[Heterogeneous Graph<br/>chunks + entities + edges]
    D --> E[(Vector Store<br/>+ KV / Graph Storage)]
    E --> F[User Query]
    F --> G[Topology-Enhanced Retrieval<br/>minirag.py]
    G --> H[LLM Generation<br/>llm.py]
    H --> I[Answer]

Key design properties

  • *Heterogeneous graph*: nodes represent both text chunks and named entities, connected by typed edges. This lets retrieval fan out along entity links without requiring the LLM to reason over long contexts during ingestion.
  • *Topology-enhanced retrieval*: query expansion leverages graph neighbours and chunk-to-entity paths, then ranks candidate chunks for the LLM prompt.
  • *Storage abstraction*: the storage module (referenced in the module map of README.md) supports more than ten heterogeneous backends, including Neo4j, PostgreSQL, and TiDB (announced in the 2025.02.14 news entry in README.md).

The Python module layout shown in README.md is:

PathRole
minirag/__init__.pyPackage entry point
minirag/base.pyBase classes and shared types
minirag/minirag.pyCore MiniRAG class: ainsert, aquery
minirag/operate.pyChunking, entity/relation extraction operations
minirag/llm.pyLLM binding and invocation
minirag/prompt.pyPrompt templates for extraction and answering
minirag/storage.pyPluggable storage backends
minirag/utils.pyHelpers (chunking, token estimation, etc.)
reproduce/Step_0_index.pyEnd-to-end indexing reproduction script
reproduce/Step_1_QA.pyEnd-to-end QA reproduction script
minirag/api/minirag_server.pyFastAPI/Ollama-compatible HTTP server
main.pyProgrammatic initialization example

3. Installation and Quick Start

Two installation paths are documented in README.md:

# Source (recommended for development)
cd MiniRAG
pip install -e .

# PyPI (shared with LightRAG)
pip install lightrag-hku

A typical workflow is to drop a corpus into ./dataset/<name>/data/ and then run:

python ./reproduce/Step_0_index.py   # build the heterogeneous graph index
python ./reproduce/Step_1_QA.py      # run retrieval + answer generation

Programmatic usage goes through main.py, which constructs a MiniRAG instance and calls ainsert(...) followed by aquery(...).

For serving, the optional [api] extra adds FastAPI servers, including an Ollama-emulating endpoint that lets existing Ollama clients route chat through RAG without code changes (minirag/api/README.md, minirag/api/minirag_server.py).

4. Dataset, API Surface, and Known Pitfalls

LiHua-World dataset. dataset/LiHua-World/README.md describes a one-year corpus of chat records for a virtual user. It supplies three question categories (single-hop, multi-hop, summary), each with gold answers and supporting documents. The archive LiHuaWorld.zip is shipped inside ./dataset/LiHua-World/data/.

API surface. The HTTP server in minirag/api/minirag_server.py exposes:

EndpointPurpose
POST /queryRun a RAG query, optionally streamed
POST /documents/textInsert a raw text payload via rag.ainsert
POST /documents/fileUpload & immediately index a single file (txt/md/pdf/docx/pptx)
POST /documents/batchUpload and index many files in one call
POST /documents/scanRescan an input directory for new files
DELETE /documentsClear all indexed documents
GET /healthHealth and configuration check
POST /api/chatOllama-compatible chat (mode inferred from query prefix)

The /api/chat handler routes the user message through parse_query_mode, then calls rag.aquery with the inferred mode and the rest of the conversation as conversation_history (minirag/api/minirag_server.py).

Known failure modes from the community

  • *Indexing slowdown with gpt-4o-mini* (issue #82): when indexing a few hundred text files, throughput degrades over time, eventually exceeding 20 minutes per file. This typically correlates with a runaway re-extraction loop in ainsert.
  • *Re-processing of already-processed chunks* (issue #96): repeated ainsert calls iterate over chunks whose status is processed, causing redundant entity extraction and growing latency. The reproduction scripts in ./reproduce/ call ainsert multiple times, which can amplify this.
  • *Phi-3 DynamicCache error* (issue #69): Microsoft's modeling_phi3.py exposes get_max_cache_shape, not get_max_length. Workaround: patch the cached model file in ~/.cache/huggingface/... or switch models.
  • *Python version mismatch* (issue #1): some transitive dependencies require a different Python version than 3.10.13. Use a Python version compatible with requirements.txt (see README.md).
  • *Path-to-chunk ranking question* (issue #90): the path2chunk helper aggregates counts into a dictionary and then picks top chunks via count_dict.most_common(max_chunks). If you change this code, preserve the aggregation step; do not read directly from node_chunk_id.

See Also

Source: https://github.com/HKUDS/MiniRAG / Human Manual

Indexing Pipeline & Knowledge Graph Construction

Related topics: MiniRAG Overview & System Architecture, Query & Retrieval Workflow

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Stage 1 — Chunking

Continue reading this section for the full explanation and source context.

Section Stage 2 — Entity and Relation Extraction

Continue reading this section for the full explanation and source context.

Section Stage 3 — Heterogeneous Graph Upsert

Continue reading this section for the full explanation and source context.

Related topics: MiniRAG Overview & System Architecture, Query & Retrieval Workflow

Indexing Pipeline & Knowledge Graph Construction

Purpose and Scope

MiniRAG's indexing pipeline transforms raw text documents into a heterogeneous knowledge graph that combines text chunks and named entities in a unified structure. This pipeline is the foundation of the framework's ability to deliver strong RAG performance on small language models (SLMs), because it reduces the retrieval burden placed on the model's semantic understanding by pre-computing topological structure at index time.

The indexing stage is triggered whenever a caller invokes MiniRAG.ainsert(...) (or its server equivalent /documents/text, /documents/file, /documents/batch, /documents/scan). The output is persisted to a configurable working directory and to the chosen graph backend, where the query stage later performs lightweight topology-enhanced retrieval. Source: README.md:9-13, minirag/api/minirag_server.py:130-168.

The pipeline is designed around two guiding principles stated in the project abstract:

  1. Semantic-aware heterogeneous graph indexing — text chunks and named entities are co-located in a single graph, so downstream retrieval can rely on graph topology rather than complex semantic reasoning. Source: README.md:9-11.
  2. Lightweight topology-enhanced retrieval — because the graph is built during indexing, queries can navigate the structure efficiently even with SLMs. Source: README.md:11-13.

Pipeline Stages

The end-to-end indexing flow consists of four cooperating stages: chunking, entity/relation extraction, graph upsert, and persistence. The diagram below summarizes how data moves between them.

flowchart LR
    A[Raw Document<br/>txt / md / pdf / docx / pptx] --> B[Chunking<br/>operate.py]
    B --> C[Entity & Relation Extraction<br/>prompt.py + LLM]
    C --> D[Heterogeneous Graph Upsert<br/>kg/*_impl.py]
    D --> E[(Working Dir KV Store<br/>+ Graph Backend)]
    E --> F[Topology-Enhanced Retrieval<br/>query stage]

Stage 1 — Chunking

When ainsert receives content, the operating layer splits it into manageable chunks before any LLM call. Chunk status is tracked so that partially completed work can be resumed. Community issue #96 reports a suspected bug in this logic where inserting_chunks is selected from all chunks whose status is processed, which causes previously processed chunks to be re-extracted when ainsert is invoked multiple times — making the pipeline progressively slower across runs. Source: minirag/operate.py, issue #96.

Stage 2 — Entity and Relation Extraction

Each chunk is sent through prompt templates that ask the LLM to extract named entities and their relationships. The extraction prompts live in minirag/prompt.py and are deliberately lightweight so they work with small models such as Phi-3.5-mini, GLM-Edge-1.5B-Chat, Qwen2.5-3B-Instruct, and MiniCPM3-4B. Source: README.md:43-50, minirag/prompt.py.

This stage is the most expensive part of the pipeline. Community issue #82 reports that running reproduce/Step_0_index.py against 150 .txt files with gpt-4o-mini took a full day and degraded to >20 minutes per file as more documents were processed. The slowdown is amplified by the re-processing bug in issue #96. Source: reproduce/Step_0_index.py, issue #82.

Stage 3 — Heterogeneous Graph Upsert

Extracted entities and relations are written to a heterogeneous graph backend. MiniRAG ships several implementations behind a common interface in minirag/kg/__init__.py:

BackendImplementation FileTypical Use
NetworkX (in-memory)minirag/kg/networkx_impl.pyLocal development, single-process runs
Neo4jminirag/kg/neo4j_impl.pyProduction deployments requiring a real graph DB
PostgreSQL / TiDBminirag/kg/postgres_impl.pySQL-based deployments and existing data stacks

The graph stores both chunk nodes (with their original text) and entity nodes (with extracted mentions), connected by typed edges. This is what the paper calls a *semantic-aware heterogeneous graph*: retrieval can traverse either node type. Source: README.md:9-11, minirag/kg/networkx_impl.py, minirag/kg/neo4j_impl.py, minirag/kg/postgres_impl.py.

Stage 4 — Persistence and Resumption

Indexed data is persisted to the working directory (--working-dir) so that subsequent runs can resume without re-vectorizing existing documents. The API server's startup hook scans --input-dir and only processes new files. Source: minirag/api/minirag_server.py:53-79, minirag/api/README.md:50-58.

⚠️ Community note: Until the #96 re-processing bug is resolved, users who call ainsert more than once on the same dataset will repeatedly re-extract entities from already-processed chunks. Workaround: restart the process with a clean --working-dir between batch inserts, or call ainsert exactly once with the full corpus. Source: issue #96.

`path2chunk` and Chunk Path Assignment

After the graph is built, MiniRAG assigns each chunk a *path* of related chunks via the path2chunk function in operate.py. This function walks the heterogeneous graph and accumulates how often each chunk co-occurs with a given chunk's entities, producing a count_dict. The final selection currently uses count_dict.most_common(max_chunks) rather than the accumulated node_chunk_id list — community issue #90 questions this choice because the two variables are not equivalent. Understanding this step is important because the chunk path is what the lightweight retrieval stage later traverses. Source: minirag/operate.py, issue #90.

Storage Footprint

A key claimed benefit of the indexing design is storage efficiency. By sharing nodes and edges across chunks rather than maintaining large per-chunk embedding stores, MiniRAG reportedly requires only about 25% of the storage space of comparable LLM-based RAG systems while delivering comparable retrieval accuracy on the LiHua-World benchmark. Source: README.md:13-15.

Common Failure Modes

SymptomLikely CauseMitigation
Indexing slows down across runsainsert re-processes processed chunks (issue #96)Call ainsert once on the full corpus, or wipe --working-dir between runs
Per-file latency grows to >20 minutesCumulative re-extraction + remote LLM latency (issue #82)Use a local model, batch chunks, or pre-filter inputs
DynamicCache has no attribute get_max_lengthMicrosoft Phi-3 modeling file incompatibility (issue #69)Swap model, or patch cached modeling_phi3.py: rename get_max_length to get_max_cache_shape
Python dependency install failureNo pinned Python version; some deps need a different interpreter (issue #1)Match the Python version expected by each pinned dependency in requirements.txt
Empty graph after indexingUnsupported file extension or PDF/DOCX/PPTX parser not installedInstall pypdf, docx, python-pptx; check supported_extensions in DocumentManager

Sources: issue #96, issue #82, issue #69, issue #1, minirag/api/minirag_server.py:280-340.

See Also

  • MiniRAG Class & Public API
  • Query & Topology-Enhanced Retrieval
  • Supported Storage Backends
  • Reproducing the LiHua-World Benchmark

Sources: issue #96, issue #82, issue #69, issue #1, minirag/api/minirag_server.py:280-340.

Query & Retrieval Workflow

Related topics: MiniRAG Overview & System Architecture, Indexing Pipeline & Knowledge Graph Construction, LLM/Embedding Integrations, API Server & Deployment

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 2.1 Python entry point

Continue reading this section for the full explanation and source context.

Section 2.2 HTTP entry point

Continue reading this section for the full explanation and source context.

Section 2.3 Command-line reproduction

Continue reading this section for the full explanation and source context.

Related topics: MiniRAG Overview & System Architecture, Indexing Pipeline & Knowledge Graph Construction, LLM/Embedding Integrations, API Server & Deployment

Query & Retrieval Workflow

1. Overview and Purpose

The Query & Retrieval Workflow is the runtime counterpart to MiniRAG's indexing pipeline. Once documents have been chunked, entity-extracted, and embedded into a heterogeneous graph (see the *Indexing Workflow* page), the retrieval workflow is responsible for taking a natural-language question and producing an answer that is grounded in that graph.

MiniRAG exposes this workflow through two complementary surfaces:

  • A Python API — the MiniRAG class defined in minirag/minirag.py exposes query(...) and aquery(...) methods that are called directly by user code or by the bundled reproduction scripts.
  • An HTTP API — a FastAPI server built around the same MiniRAG instance, see minirag/api/minirag_server.py. The server exposes /query for RAG queries and also Ollama-compatible /api/chat and /api/tags endpoints that route through the same retrieval core.

The design goal, as stated in README.md:1-30, is to let *small* language models (SLMs) perform RAG effectively by leaning on the graph structure rather than on the model's own semantic reasoning. Source: README.md:13-23.

2. Entry Points and Parameters

2.1 Python entry point

The user-facing entry point is the async aquery() method, which is also wrapped by the synchronous query() convenience method. Both accept a QueryParam object defined in minirag/base.py. The most important fields are:

FieldTypePurpose
modeSearchMode enumSelects the retrieval strategy (mini, naive, light)
streamboolToggles streaming vs. one-shot response
only_need_contextboolReturns only the retrieved context, no answer generation
top_kintNumber of similar entities/chunks to retrieve
conversation_historylistOptional multi-turn chat history
history_turnsintNumber of past turns to include

Source: minirag/api/minirag_server.py:1-60 (the server constructs a QueryParam from the HTTP body with exactly these fields) and minirag/base.py (QueryParam and SearchMode definitions).

2.2 HTTP entry point

The /query endpoint in minirag_server.py delegates to rag.aquery() and packages the result into a QueryResponse. Source: minirag/api/minirag_server.py:Query handler. The Ollama-compatible /api/chat endpoint additionally supports a *prefix-based* mode selector parsed by parse_query_mode():

  • /light ...SearchMode.light
  • /naive ...SearchMode.naive
  • /mini ...SearchMode.mini

Source: minirag/api/minirag_server.py:parse_query_mode definition(prefix map and stripping logic).

2.3 Command-line reproduction

For local benchmarking, reproduce/Step_1_QA.py loads the previously built index (created by reproduce/Step_0_index.py) and iterates over the LiHua-World QA set, calling the MiniRAG query interface for each question. Source: README.md:Quick Start section.

3. Retrieval Pipeline

Once a query reaches MiniRAG.aquery(), the following pipeline executes:

flowchart TD
    A[User query] --> B[Parse QueryParam<br/>mode, top_k, history]
    B --> C{Mode}
    C -->|mini| D[Topology-enhanced<br/>graph retrieval]
    C -->|naive| E[Vector-only<br/>chunk retrieval]
    C -->|light| F[LightRAG<br/>hybrid retrieval]
    D --> G[Build context from<br/>entities + chunks]
    E --> G
    F --> G
    G --> H[Compose prompt<br/>using prompt.py]
    H --> I[LLM call<br/>streaming or one-shot]
    I --> J[Return answer / context]
  1. Mode dispatch. The mode field selects one of three retrieval strategies implemented in minirag/operate.py. The flagship mini mode is the *topology-enhanced* retrieval that walks the heterogeneous graph (entities ↔ chunks) rather than relying purely on vector similarity. Source: README.md:MiniRAG Framework section.
  1. Graph/entity retrieval. For mini mode the system looks up the query embedding, retrieves the top-k most similar entity nodes (default top_k=50, see the --top-k argument in minirag_server.py), then expands to chunks through the graph edges recorded during indexing. Source: minirag/api/minirag_server.py:parse_args --top-k and minirag/operate.py:mini-mode functions.
  1. Context assembly. Retrieved entities, relations, and chunk texts are concatenated into a context block. When only_need_context=True, the workflow short-circuits here and returns the assembled context without invoking the LLM. Source: minirag/api/minirag_server.py:QueryRequest.only_need_context field.
  1. Prompt composition. The context and the user's question are combined using the prompt templates in minirag/prompt.py. These templates are deliberately short so that SLMs can follow them reliably. Source: README.md:Abstract — "lightweight topology-enhanced retrieval".
  1. Answer generation. The composed prompt is sent to the configured LLM binding (ollama, openai, azure_openai, or lollms) selected at server start-up. The server can stream the response chunk-by-chunk (NDJSON for Ollama) or return a single QueryResponse. Source: minirag/api/minirag_server.py:streaming vs non-streaming branches.

4. Known Issues and Practical Guidance

Several community-reported issues are directly related to the query and retrieval workflow and should be considered when operating MiniRAG:

  • Slow indexing that appears to leak into retrieval. Issue #82 reports that reproduce/Step_0_index.py runs very slowly and slows down further over time. Because retrieval depends on a fully built index, a slow or growing index will directly delay the first query and make every subsequent incremental insert slower. Recommendation: complete indexing in one pass, then call aquery() separately. Source: README.md:reproduce workflow and the community thread for issue #82.
  • Repeated entity extraction on ainsert(). Issue #96 shows that calling ainsert() multiple times can re-process chunks that are already in processed state, extracting entities again. This makes the knowledge base grow on every call and degrades retrieval quality over time. The workaround is to insert each document exactly once and then issue queries without re-inserting. Source: reproduce/Step_0_index.py:ainsert usage pattern.
  • most_common selection in path2chunk. Issue #90 questions why count_dict.most_common(max_chunks) is used to populate v['Path'] instead of node_chunk_id.most_common(max_chunks). This is relevant during retrieval because the selected paths determine which chunks appear in the assembled context and therefore which evidence the LLM sees. Source: minirag/operate.py:path2chunk function.
  • Phi-3 / DynamicCache incompatibility. Issue #69 reports that recent Microsoft Phi-3 model files removed get_max_length from DynamicCache, breaking LLM calls during answer generation. Two workarounds are documented: switch LLM binding, or patch the cached modeling_phi3.py to use get_max_cache_shape. Source: minirag/llm.py (LLM binding configuration).
  • Python version drift. Issue #1 notes that the project does not pin a Python version, and dependency conflicts appear with Python 3.10.13. The retrieval workflow itself is not affected, but a mismatched interpreter will prevent the server from starting. Source: setup.py and requirements.txt.

See Also

  • *Indexing Workflow* — how the heterogeneous graph used by mini mode is built.
  • *MiniRAG API Server* — full reference for /query, /documents/*, and Ollama-compatible endpoints.
  • *LiHua-World Dataset* — the benchmark dataset used by the reproduction scripts.
  • LightRAG — the upstream project that MiniRAG extends.

Source: https://github.com/HKUDS/MiniRAG / Human Manual

LLM/Embedding Integrations, API Server & Deployment

Related topics: MiniRAG Overview & System Architecture, Query & Retrieval Workflow

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: MiniRAG Overview & System Architecture, Query & Retrieval Workflow

LLM/Embedding Integrations, API Server & Deployment

Purpose and Scope

MiniRAG exposes a flexible integration layer that decouples the RAG pipeline from any specific model provider, allowing the same MiniRAG core to be paired with interchangeable LLM and embedding backends. On top of this layer, the project ships an optional FastAPI-based server (minirag-server) that turns MiniRAG into a drop-in RAG backend compatible with multiple client ecosystems, including an Ollama-compatible API.

The deployment story covers three concerns: (1) choosing and configuring an LLM/embedding binding via CLI or environment variables, (2) running the API server with the minirag-server entry point, and (3) exposing document insertion, querying, scanning, and deletion through HTTP. Community issue #1 highlights the importance of pinning a compatible Python version (some transitive dependencies require versions other than 3.10.13), and the latest release v0.0.2 added the PyPI distribution plus the official API server described below (Source: README.md).

LLM and Embedding Binding Architecture

MiniRAG supports four LLM/embedding bindings: lollms, ollama, openai, and azure_openai. The binding is configured independently for the LLM and the embedding model, so you can mix providers (for example, Ollama for embeddings and OpenAI for the LLM) (Source: minirag/api/README.md). Bindings are selected through the CLI flags --llm-binding and --embedding-binding, with their corresponding --llm-binding-host, --embedding-binding-host, --llm-binding-api-key, and --embedding-binding-api-key arguments. Defaults are read from environment variables (LLM_BINDING, LLM_BINDING_HOST, LLM_MODEL, EMBEDDING_BINDING, etc.) and fall back to sensible values such as mistral-nemo:latest for the LLM and bge-m3:latest for embeddings (Source: minirag/api/minirag_server.py).

The binding selects one of the completion functions defined in minirag_server.py:

  • openai_alike_model_complete wraps openai_complete_if_cache and targets any OpenAI-compatible HTTP endpoint (Source: minirag/api/minirag_server.py).
  • azure_openai_model_complete wraps azure_openai_complete_if_cache and reads AZURE_OPENAI_API_KEY plus AZURE_OPENAI_API_VERSION (Source: minirag/api/minirag_server.py).
  • The Ollama/LoLLMs path passes host, timeout, num_ctx, and api_key via llm_model_kwargs into the standard LightRAG-style completion pipeline (Source: minirag/api/minirag_server.py).

Embedding functions are wired through a single EmbeddingFunc instance whose callable is selected at runtime by an if/else ladder over args.embedding_binding, dispatching to lollms_embed, ollama_embed, or azure_openai_embed. embedding_dim and max_token_size are forwarded from CLI arguments (Source: minirag/api/minirag_server.py).

The Hugging Face / Microsoft Phi-3 integration is not first-class in the API server; community issue #69 reports a 'DynamicCache' object has no attribute 'get_max_length' error caused by a recent change in Microsoft's modeling_phi3.py. The recommended workarounds are to switch to a different model or to patch the cached modeling_phi3.py (replacing get_max_length with get_max_cache_shape) under the Hugging Face cache directory.

API Server Configuration

The server is built around a single MiniRAG instance constructed from CLI arguments and injected with shared KV, document-status, graph, and vector storage classes (Source: minirag/api/minirag_server.py). The CLI surface includes chunking controls (--chunk_size default 1200, --chunk_overlap_size default 100), concurrency (--max-async default 4), context limits (--max-tokens default 32768, --max-embed-tokens default 8192), timeout, log level, an optional API key, and SSL settings (Source: minirag/api/minirag_server.py). Retrieval-side tuning is exposed via --top-k (default 50) and --cosine-threshold (default 0.4), which together control how many entities/relations are returned in local/global modes (Source: minirag/api/README.md).

Key CLI flags summarized:

FlagDefaultPurpose
--llm-bindingollamaLLM backend: lollms, ollama, openai, azure_openai
--embedding-bindingollamaEmbedding backend (independently selectable)
--chunk-size / --chunk-overlap-size1200 / 100Text chunking window
--max-async4Concurrent LLM/embedding calls
--top-k50Retrieval count per query
--cosine-threshold0.4Cosine similarity cutoff for retrieval
--timeoutNonePer-call timeout; None means infinite

Source: minirag/api/README.md

API Endpoints and Deployment Modes

The FastAPI app exposes three groups of endpoints (Source: minirag/api/README.md):

  1. Document ManagementPOST /documents/text inserts raw text, POST /documents/file uploads a single file, POST /documents/batch uploads many files, POST /documents/scan triggers a directory scan of --input-dir, and DELETE /documents clears the store. File uploads accept .txt, .md, .pdf, .docx, and .pptx; PDF/DOCX/PPTX handlers are lazily installed via pm.install(...) on first use (Source: minirag/api/minirag_server.py).
  2. QueryPOST /query accepts a QueryRequest (query text, mode, stream flag, only_need_context) and dispatches to rag.aquery with a QueryParam carrying top_k. Streaming is handled by accumulating chunks from the async generator before returning a QueryResponse (Source: minirag/api/minirag_server.py).
  3. Ollama EmulationGET /api/version, GET /api/tags, and POST /api/chat make the server usable as a drop-in Ollama backend for tools that already speak the Ollama protocol. Chat requests are streamed back to the client (Source: minirag/api/README.md).

A utility GET /health endpoint reports the server's configuration and liveness (Source: minirag/api/README.md).

Installation is offered in two ways matching LightRAG: from PyPI with pip install "lightrag-hku[api]", or from source with pip install -e ".[api]" (Source: minirag/api/README.md). Once installed, minirag-server is the console entry point. You can also run the variants directly under Uvicorn: uvicorn lollms_minirag_server:app --reload --port 9721, ollama_minirag_server:app, openai_minirag_server:app, or the Azure OpenAI variant (Source: minirag/api/README.md).

Common Failure Modes and Tuning Tips

A few recurring deployment problems surface in the community. Issue #82 reports indexing slowing to over 20 minutes per file when using gpt-4o-mini; the practical levers are --max-async (raise it for parallel calls), --timeout (set a value so a stalled call doesn't block indefinitely), --chunk-size (smaller chunks mean fewer LLM calls per file), and batching uploads through POST /documents/batch so concurrency is amortized across files (Source: minirag/api/minirag_server.py).

Issue #1 notes dependency-version drift; the setup.py should be consulted for the supported Python range before installation (Source: setup.py). Issue #96 describes repeated entity extraction when ainsert is called multiple times; the API server mitigates this by routing uploads through the single rag.ainsert entry point and tracking chunk state in DOC_STATUS_STORAGE, so prefer POST /documents/scan over ad-hoc re-ingestion when re-running on the same --input-dir (Source: minirag/api/minirag_server.py).

When in doubt, use minirag-server --help to enumerate every supported flag and binding combination, and confirm that the chosen LLM and embedding models are pre-pulled in the target Ollama or LoLLMs instance before serving traffic (Source: minirag/api/README.md).

See Also

  • MiniRAG Framework Overview
  • Heterogeneous Graph Indexing
  • LiHua-World Benchmark Dataset
  • Retrieval Modes (local, global, hybrid)

Source: https://github.com/HKUDS/MiniRAG / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 13 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Security or permission risk - Security or permission risk requires verification.

1. Security or permission risk: Security or permission risk requires verification

  • Severity: high
  • Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/104

2. Security or permission risk: Security or permission risk requires verification

  • Severity: high
  • Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/108

3. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/97

4. Capability evidence risk: Capability evidence risk requires verification

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: capability.assumptions | https://github.com/HKUDS/MiniRAG

5. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/95

6. Maintenance risk: Maintenance risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/102

7. Maintenance risk: Maintenance risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/HKUDS/MiniRAG

8. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: downstream_validation.risk_items | https://github.com/HKUDS/MiniRAG

9. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: risks.scoring_risks | https://github.com/HKUDS/MiniRAG

10. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/109

11. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/HKUDS/MiniRAG/issues/98

12. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: issue_or_pr_quality=unknown。
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/HKUDS/MiniRAG

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using MiniRAG with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence