# https://github.com/D-Star-AI/dsRAG Project Manual

Generated at: 2026-06-28 14:17:27 UTC

## Table of Contents

- [Overview & System Architecture](#page-1)
- [Pluggable Retrieval Components](#page-2)
- [Core Retrieval Innovations](#page-3)
- [dsParse: Multimodal File Parsing & VLM Integration](#page-4)

<a id='page-1'></a>

## Overview & System Architecture

### Related Pages

Related topics: [Pluggable Retrieval Components](#page-2), [Core Retrieval Innovations](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md)
- [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md)
- [dsrag/dsparse/models/types.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/models/types.py)
</details>

# Overview & System Architecture

## Purpose and Scope

dsRAG is a retrieval engine for unstructured data, optimized for challenging queries over dense text such as financial reports, legal documents, and academic papers. According to the project README, on the FinanceBench benchmark dsRAG reaches 96.6% accuracy versus 32% for a vanilla RAG baseline. Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md).

The system is organized around a single high-level abstraction — the `KnowledgeBase` object — that takes in raw documents, performs chunking and embedding, persists state to disk, and at query time returns the most relevant segments of text. The KnowledgeBase is configured by composing six pluggable components: VectorDB, ChunkDB, Embedding, Reranker, LLM, and FileSystem. Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md).

A dedicated sub-module called **dsParse** handles multimodal file parsing, semantic sectioning, and chunking. It can be used standalone via `pip install dsparse` or transparently inside a KnowledgeBase by enabling `use_vlm=True` in the file-parsing configuration. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

## High-Level Architecture

The system separates **ingestion-time** concerns (parsing, sectioning, chunking, embedding, persistence) from **query-time** concerns (vector search, reranking, relevant segment extraction). The data flow below reflects the canonical pipeline described in the README.

```mermaid
flowchart LR
    A[Raw File or Text] --> B[dsParse<br/>VLM Parsing + Semantic Sectioning]
    B --> C[AutoContext<br/>Contextual Chunk Headers]
    C --> D[Embedding Model]
    D --> E[(VectorDB)]
    D --> F[(ChunkDB)]
    G[User Query] --> H[Vector Search]
    H --> E
    H --> I[Reranker]
    I --> J[Relevant Segment<br/>Extraction RSE]
    F --> J
    J --> K[LLM-generated Answer]
```

Three key methods drive the accuracy gains documented in the README benchmarks:

| Method | When it runs | What it does |
|---|---|---|
| **Semantic Sectioning** | Ingestion | LLM identifies semantically cohesive sections and titles them |
| **AutoContext** | Ingestion | Prepends document + section context to each chunk header |
| **Relevant Segment Extraction (RSE)** | Query time | Combines adjacent relevant chunks into longer segments |

Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md).

## Core Components

The README enumerates the six configurable components of a KnowledgeBase. Each can be replaced by a custom subclass of its base class. Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md).

- **VectorDB** — stores embedding vectors and metadata. Available options include `BasicVectorDB`, `WeaviateVectorDB`, `ChromaDB`, `QdrantVectorDB`, `MilvusDB`, and `PineconeDB`. ChromaDB additionally supports metadata filtering at query time.
- **ChunkDB** — stores chunk text keyed on `(doc_id, chunk_index)`. Options: `BasicChunkDB` and `SQLiteDB`. RSE reads from here to reconstruct full segments.
- **Embedding** — converts chunks (with AutoContext headers) to vectors. Options include `OpenAIEmbedding`, `CohereEmbedding`, `VoyageAIEmbedding`, and `OllamaEmbedding`. Community issue #6 requests broader local-model support such as sentence-transformers. Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md).
- **Reranker** — re-scores retrieved chunks before RSE. Options: `CohereReranker`, `VoyageReranker`, `NoReranker`.
- **LLM** — used for AutoContext generation and (optionally) answer synthesis. Community issue #5 requests Llama 3-8B support for AutoContext.
- **FileSystem** — controls where intermediate files (e.g., VLM page images, `elements.json`) are persisted.

The configuration surface is typed via `TypedDict` definitions in `dsrag/dsparse/models/types.py`. For example, `FileParsingConfig` accepts `use_vlm`, `vlm_config`, `always_save_page_images`, plus optional serialized `vlm` and `vlm_fallback` clients. Source: [dsrag/dsparse/models/types.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/models/types.py).

## dsParse Sub-Module

dsParse is the document ingestion engine. Its main entry point is `parse_and_chunk(kb_id, doc_id, file_path, file_parsing_config, ...)`, which returns a tuple `(sections, chunks)`. It supports two parsing modes:

1. **Traditional text extraction** — fast and inexpensive, suitable for PDFs with clean extractable text.
2. **VLM (vision language model) parsing** — uses a multimodal model such as `gemini-2.0-flash` to OCR pages, categorize elements into types (NarrativeText, Figure, Image, Table, etc.), and describe visual elements. Requires the external `poppler` dependency for PDF → image conversion. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

VLMs now expose a class-based client abstraction (`GeminiVLM`, etc.) analogous to LLM/Embedding/Reranker. Instances can be attached to a KnowledgeBase via the `vlm_client` constructor parameter, or supplied per-document inside `file_parsing_config["vlm"]`. A fallback client can be configured through `vlm_fallback`. Legacy dict-based `vlm_config` remains supported for backward compatibility. The system prefers the class-based client when both are present. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

## Community Considerations

Several open issues are directly relevant to the architecture:

- **Dependency isolation (#127)** — `dsrag/llm.py` reportedly does an eager `import google.generativeai` instead of using `dsrag.utils.imports.LazyLoader`. This forces the dependency on every user even when Gemini is not used. Source: community context.
- **Platform compatibility (#61)** — uvloop is unsupported on Windows, causing a `RuntimeError` for users on that OS. The runtime stack may need a platform-conditional fallback.
- **Local embedding models (#6)** — users want `sentence-transformers` (and similar) as first-class Embedding components, which would make the system viable fully offline.
- **Local LLMs for AutoContext (#5)** — Llama 3-8B is requested as an AutoContext LLM option. Currently the LLM component accepts provider/model pairs but does not enumerate local runners.
- **LangChain interop (#4)** — proposed by subclassing LangChain's `BaseRetriever` and delegating `_get_relevant_documents` to `kb.query`.

These requests all map onto the existing component slots (Embedding, LLM), suggesting they can be addressed by adding new subclasses rather than restructuring the core architecture.

## Configuration Entry Points

The README documents the main configuration dictionaries passed to `add_document`:

- `file_parsing_config` — controls VLM usage, element exclusion, concurrency, and DPI.
- `semantic_sectioning_config` — selects the LLM provider/model and toggles sectioning.
- `chunking_config` — `chunk_size` and `min_length_for_chunking`.
- `auto_context_config` — toggles for document/section summary generation.
- `rse_params` — `max_length`, `overall_max_length`, `minimum_value`, `irrelevant_chunk_penalty`, `decay_rate`, `top_k_for_document_selection`, `chunk_length_adjustment`.

Persistence is automatic: the full KnowledgeBase configuration is written to a JSON file upon creation and update, making the object reconstructible across processes. Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md).

## See Also

- [Semantic Sectioning, AutoContext, and RSE](semantic-sectioning-autocontext-rse.md)
- [dsParse File Parsing](dsparse-file-parsing.md)
- [KnowledgeBase Components and Customization](kb-components.md)
- [Configuration Reference](configuration-reference.md)

---

<a id='page-2'></a>

## Pluggable Retrieval Components

### Related Pages

Related topics: [Overview & System Architecture](#page-1), [Core Retrieval Innovations](#page-3), [dsParse: Multimodal File Parsing & VLM Integration](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md)
- [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md)
- [dsrag/dsparse/models/types.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/models/types.py)
</details>

# Pluggable Retrieval Components

dsRAG is built around a **pluggable component architecture** in which every part of the retrieval pipeline can be swapped without rewriting the surrounding logic. A `KnowledgeBase` is composed of six configurable components — VectorDB, ChunkDB, Embedding, Reranker, LLM, and FileSystem — plus optional helpers such as a VLM client. Each component has a default implementation, several built-in alternatives, and a documented extension point for fully custom subclasses.

## Purpose and Scope

The pluggable design serves two goals. First, it lets users match infrastructure to deployment constraints (for example, switching from an in-memory `BasicVectorDB` to a managed `PineconeDB` without touching the embedding or chunking code). Second, it isolates dependency-heavy integrations behind a thin interface, so a project that only needs OpenAI embeddings is not forced to install every optional SDK. This pattern is reinforced throughout the codebase, including the dsParse sub-module, where VLM clients, file systems, and sectioning models follow the same class-based abstraction.

The top-level README states:

> There are six key components that define the configuration of a KnowledgeBase, each of which are customizable: 1. VectorDB, 2. ChunkDB, 3. Embedding, 4. Reranker, 5. LLM, 6. FileSystem. There are defaults for each of these components, as well as alternative options included in the repo. You can also define fully custom components by subclassing the base classes and passing in an instance of that subclass to the KnowledgeBase constructor.

Source: [README.md]()

## The Six Core Components

### VectorDB

Stores dense embedding vectors alongside a small amount of metadata. Built-in options include `BasicVectorDB`, `WeaviateVectorDB`, `ChromaDB`, `QdrantVectorDB`, `MilvusDB`, and `PineconeDB`. ChromaDB additionally supports metadata query filters using operators such as `equals`, `not_equals`, `in`, `not_in`, `greater_than`, `less_than`, `greater_than_equals`, and `less_than_equals`.

Source: [README.md]()

### ChunkDB

Stores the original text for every chunk in a nested dictionary keyed on `doc_id` and `chunk_index`. Relevant Segment Extraction (RSE) reads from this store to rebuild longer passages at query time. The supported options are `BasicChunkDB` and `SQLiteDB`.

Source: [README.md]()

### Embedding

Defines the embedding model. Built-in clients include `OpenAIEmbedding`, `CohereEmbedding`, `VoyageAIEmbedding`, and `OllamaEmbedding`. The `OllamaEmbedding` option is the in-tree path for running embeddings against a local model server, which is relevant to community interest in sentence-transformers and other local models.

Source: [README.md]()

### Reranker

Re-ranks the top results returned by the vector store before RSE runs. Cohere and Voyage AI clients are shipped; the `NoReranker` class is provided for users who want to disable the rerank step entirely.

Source: [README.md]()

### LLM

Drives AutoContext contextual chunk headers, semantic sectioning, and (optionally) the response generation step. Models from OpenAI, Anthropic, and Gemini are supported.

Source: [README.md]()

### FileSystem

Abstracts where intermediate artefacts such as page images and `elements.json` are persisted. A `LocalFileSystem` ships in the dsParse sub-module and is consumed by `parse_and_chunk` through a `file_system` argument.

Source: [dsrag/dsparse/README.md]()

## VLM Clients and the Parsing Pipeline

The dsParse sub-module extends the same pluggable pattern to multimodal parsing. A class-based `VLM` client (for example, `GeminiVLM`) can be passed at the `KnowledgeBase` level via `vlm_client=...`, or per document through a serialized dict in `file_parsing_config["vlm"]`. A fallback client is supported through `vlm_fallback`. The legacy dict-based path (`vlm_config` with `provider`/`model`) is still supported; when both are provided, the class-based client takes precedence.

The relevant typed configuration is defined in `dsrag/dsparse/models/types.py`:

| TypedDict | Notable keys |
|---|---|
| `VLMConfig` | `provider`, `model`, `fallback_provider`, `fallback_model`, `exclude_elements`, `element_types`, `dpi`, `vlm_max_concurrent_requests` |
| `FileParsingConfig` | `use_vlm`, `vlm_config`, `always_save_page_images`, `vlm`, `vlm_fallback` |
| `SemanticSectioningConfig` | `use_semantic_sectioning`, `llm_provider`, `model`, `language` |
| `ChunkingConfig` | `chunk_size`, `min_length_for_chunking` |

Source: [dsrag/dsparse/models/types.py]()

A typical class-based override looks like:

```python
from dsrag.knowledge_base import KnowledgeBase
from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM

kb = KnowledgeBase(
    kb_id="my_kb",
    vlm_client=GeminiVLM(model="gemini-2.0-flash"),
)
```

Source: [dsrag/dsparse/README.md]()

## Extending the System

Custom components are introduced by subclassing the base class and passing an instance to the `KnowledgeBase` constructor. This is the same mechanism the bundled options use, and it is the recommended way to integrate with LangChain retrievers or local models that are not yet first-class. Community discussions (for example, the request to add a LangChain `BaseRetriever` subclass whose `_get_relevant_documents` method calls `kb.query`) follow exactly this pattern, as do proposals to surface additional local LLMs and embedding backends. The README's section on `auto_context_config`, `semantic_sectioning_config`, `chunking_config`, and `rse_params` documents the configuration surface that custom components must respect.

Source: [README.md]()

## Component Interaction at Query Time

```mermaid
flowchart LR
    Q[Query] --> KB[KnowledgeBase]
    KB --> VDB[VectorDB]
    VDB --> RR[Reranker]
    RR --> RSE[Relevant Segment Extraction]
    RSE --> CDB[ChunkDB]
    CDB --> Segs[Segments]
    KB -.uses.-> LLM[LLM / AutoContext]
    KB -.uses.-> FS[FileSystem]
```

The diagram illustrates the runtime contract: `VectorDB` returns candidate chunks, the `Reranker` re-orders them, and `RSE` reads the full text back from `ChunkDB` to assemble longer segments. The `LLM` participates in the offline pipeline (AutoContext headers, semantic sectioning) and optionally at response-generation time. The `FileSystem` is touched only during ingestion or when VLM parsing is enabled.

Source: [README.md]()

## Operational Notes and Failure Modes

- **Environment variables** — The README and dsParse README call out that `GEMINI_API_KEY` is required for `GeminiVLM` and that a clear error is raised when it is missing. Custom components should follow the same pattern.
- **Optional dependencies** — Vector databases are now distributed as optional install groups, mirroring how the pluggable architecture is intended to keep core installs lean.
- **Backward compatibility** — The dsParse `vlm`/`vlm_fallback` class-based path supersedes the legacy `provider`/`model` dict path only when both are supplied, so existing configurations continue to work.
- **Cross-platform** — The component abstraction keeps heavy native dependencies (for example, the `poppler` binary needed for VLM PDF parsing) out of the core install path.

Source: [README.md](), [dsrag/dsparse/README.md]()

## See Also

- KnowledgeBase Object
- AutoContext and Contextual Chunk Headers
- Relevant Segment Extraction (RSE)
- Semantic Sectioning and Chunking
- dsParse Multimodal File Parsing

---

<a id='page-3'></a>

## Core Retrieval Innovations

### Related Pages

Related topics: [Overview & System Architecture](#page-1), [Pluggable Retrieval Components](#page-2)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [dsrag/auto_context.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/auto_context.py)
- [dsrag/auto_query.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/auto_query.py)
- [dsrag/rse.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/rse.py)
- [dsrag/dsparse/sectioning_and_chunking/semantic_sectioning.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/sectioning_and_chunking/semantic_sectioning.py)
- [dsrag/dsparse/sectioning_and_chunking/chunking.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/sectioning_and_chunking/chunking.py)
- [dsrag/metadata.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/metadata.py)
- [dsrag/dsparse/models/types.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/models/types.py)
</details>

# Core Retrieval Innovations

dsRAG is a retrieval engine for unstructured data, especially dense text like financial reports, legal documents, and academic papers. Three retrieval innovations sit at the heart of its pipeline and drive its reported jump from ~32% to ~96.6% accuracy on the FinanceBench benchmark (Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md)). This page documents those three mechanisms: **Semantic Sectioning**, **AutoContext (contextual chunk headers)**, and **Relevant Segment Extraction (RSE)**, plus how they are configured through the `KnowledgeBase` API.

## Architecture Overview

The three innovations are applied at distinct stages of the retrieval pipeline. Semantic Sectioning reshapes the document before chunking, AutoContext enriches the chunk text before embedding, and RSE reconstructs longer passages after the vector + reranker search.

```mermaid
flowchart LR
    A[Raw Document] --> B[Semantic Sectioning]
    B --> C[Section-aware Chunking]
    C --> D[AutoContext Headers]
    D --> E[VectorDB + Reranker Search]
    E --> F[Relevant Segment Extraction]
    F --> G[Final Segments to LLM]
```

Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md), [dsrag/rse.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/rse.py)

## Semantic Sectioning

Semantic sectioning is an offline, ingest-time step that uses an LLM to break a document into "semantically cohesive" sections. The implementation annotates the document with line numbers and prompts the LLM to identify the starting line for each section. Sections are expected to span a few paragraphs to a few pages, and the LLM also produces descriptive titles for each (Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md)).

The behavior is encoded in [`dsrag/dsparse/sectioning_and_chunking/semantic_sectioning.py`](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/sectioning_and_chunking/semantic_sectioning.py). The `SemanticSectioningConfig` typed dict in [`dsrag/dsparse/models/types.py`](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/models/types.py) exposes `use_semantic_sectioning`, `llm_provider`, `model`, and `language` as the relevant parameters. Document text is processed in roughly 5,000-token mega-chunks in parallel, so even multi-hundred-page documents finish in 5–10 seconds. The default model is `gpt-4o-mini`, and `gemini-2.0-flash` or `claude-3-5-haiku-latest` are also listed as compatible options (Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md)).

The downstream benefit is twofold: section titles feed into the AutoContext headers, and section boundaries constrain the chunker in [`dsrag/dsparse/sectioning_and_chunking/chunking.py`](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/sectioning_and_chunking/chunking.py) so that chunks do not straddle unrelated topics.

## AutoContext (Contextual Chunk Headers)

AutoContext solves a well-known embedding problem: an isolated chunk often lacks the context needed to embed it accurately. The `auto_context` module in [`dsrag/auto_context.py`](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/auto_context.py) generates a textual header per chunk containing document-level context (title, summary) and section-level context (section title and, optionally, a section summary), then prepends it before embedding. The chunk used for retrieval is the header plus the original text, while the stored chunk body remains the original (Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md)).

Key configuration flags exposed in `KnowledgeBase.add_document` include:

| Flag | Effect |
|---|---|
| `use_generated_title` | Use an LLM-generated document title instead of the user-supplied one |
| `get_document_summary` | Include an LLM-generated document summary in each header |
| `get_section_summaries` | Include LLM-generated section summaries in each header |
| `document_title_guidance`, `section_summarization_guidance` | Free-form prompt guidance strings |

Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md)

Because AutoContext is an LLM call per document (and sometimes per section), there is open community interest in using locally hosted models such as Llama 3-8B for this step (Source: issue [#5](https://github.com/D-Star-AI/dsRAG/issues/5)). The LLM client is currently configured through the KnowledgeBase constructor, and the same concern — that the provider abstraction should make local model backends pluggable — also appears in requests for sentence-transformers embeddings (Source: issue [#6](https://github.com/D-Star-AI/dsRAG/issues/6)).

## Relevant Segment Extraction (RSE)

RSE is a query-time post-processing step. The vector search plus reranker returns a ranked list of individual chunks, but many real questions are answered by a contiguous block of text longer than one chunk. RSE clusters the top-ranked chunks, scores contiguous runs, and emits the highest-scoring runs as "segments" (Source: [dsrag/rse.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/rse.py), [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md)).

The algorithm relies on `ChunkDB` to resolve the full text of each chunk by `(doc_id, chunk_index)`, then applies a length-aware scoring heuristic so longer contiguous runs are not unfairly penalized. The main tuning knobs live under `rse_params` in the `KnowledgeBase` configuration:

| Parameter | Purpose |
|---|---|
| `max_length` | Maximum length of a single segment, in chunks |
| `overall_max_length` | Maximum total length across all returned segments |
| `minimum_value` | Relevance floor below which a segment is dropped |
| `irrelevant_chunk_penalty` | Penalty (0–1) applied when a low-scoring chunk sits inside a run |
| `overall_max_length_extension` | Per-query extension to `overall_max_length` |
| `decay_rate` | Exponential decay applied across segment boundaries |
| `top_k_for_document_selection` | How many distinct documents to consider |
| `chunk_length_adjustment` | Whether to scale chunk scores by chunk length before segment scoring |

Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md)

Because RSE re-reads chunk text from `ChunkDB`, it is sensitive to how chunks are stored. The metadata layer in [`dsrag/metadata.py`](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/metadata.py) is what makes a segment addressable and presentable to the downstream generator.

## How the Three Innovations Combine

The combined effect of section-aware chunking, contextual headers, and segment-level retrieval is the basis of dsRAG's reported benchmark results. The README's KITE table compares vanilla top-k retrieval, RSE alone, contextual headers alone, and the combined default configuration; the combined configuration is uniformly the strongest across the AI Papers, BVP Cloud 10-Ks, Sourcegraph handbook, and Supreme Court opinions datasets (Source: [README.md](https://github.com/D-Star-AI/dsRAG/blob/main/README.md)).

For users who want to consume this pipeline from another framework, the recommended pattern is to subclass the host framework's retriever base class and call `kb.query` from `_get_relevant_documents`, so the same three innovations are reused unchanged (Source: issue [#4](https://github.com/D-Star-AI/dsRAG/issues/4)). This is also what [`dsrag/auto_query.py`](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/auto_query.py) is designed to support for query-side enrichment.

## See Also

- KnowledgeBase component reference (VectorDB, ChunkDB, Embedding, Reranker, LLM, FileSystem)
- dsParse multimodal file parsing
- VLM client configuration and `GeminiVLM` fallback patterns
- Known issue: `google.generativeai` direct import in `dsrag/llm.py` (issue [#127](https://github.com/D-Star-AI/dsRAG/issues/127))

---

<a id='page-4'></a>

## dsParse: Multimodal File Parsing & VLM Integration

### Related Pages

Related topics: [Overview & System Architecture](#page-1), [Pluggable Retrieval Components](#page-2)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md)
- [dsrag/dsparse/main.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/main.py)
- [dsrag/dsparse/models/types.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/models/types.py)
- [dsrag/dsparse/file_parsing/vlm_clients.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/file_parsing/vlm_clients.py)
- [dsrag/dsparse/file_parsing/vlm.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/file_parsing/vlm.py)
- [dsrag/dsparse/file_parsing/vlm_file_parsing.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/file_parsing/vlm_file_parsing.py)
- [dsrag/dsparse/file_parsing/non_vlm_file_parsing.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/file_parsing/non_vlm_file_parsing.py)
</details>

# dsParse: Multimodal File Parsing & VLM Integration

## Overview

dsParse is a sub-module of `dsrag` that performs multimodal file parsing, semantic sectioning, and chunking. It accepts a file path plus configuration and returns clean, structured chunks ready for embedding and retrieval. The module can be used standalone via the standalone `dsparse` pip package, or transparently through a `KnowledgeBase` by setting `use_vlm=True` in `file_parsing_config`. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

The core motivation behind dsParse is to handle documents where vanilla text extraction fails or loses fidelity — scanned PDFs, complex layouts, dense tables, and figures. By delegating page understanding to a Vision Language Model (VLM), dsParse produces rich descriptions of visual content and structurally accurate text, dramatically improving downstream retrieval. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

## Architecture and Data Flow

When VLM parsing is enabled, dsParse converts each PDF page into an image (via poppler), sends the image and a structured prompt to a VLM, and receives a categorized list of page elements. Elements are typed, optionally described (for visuals), and then concatenated into lines that flow into the semantic sectioner and chunker. The default VLM is `gemini-2.0-flash`, chosen for fast, cost-effective, near-SOTA performance. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

```mermaid
flowchart TD
    A[PDF / file_path] --> B[Poppler: convert to page images]
    B --> C[VLM Client: GeminiVLM]
    C --> D[Element list<br/>NarrativeText, Table, Figure, etc.]
    D --> E[Annotate with line numbers]
    E --> F[Semantic Sectioner LLM]
    F --> G[Sections w/ titles]
    G --> H[Chunker]
    H --> I[(sections, chunks)]
    I --> J[AutoContext chunk headers]
    J --> K[(Embedding + VectorDB)]
```

A non-VLM parsing path (`non_vlm_file_parsing.py`) remains available for cases where a VLM is undesirable (cost, latency, or platform restrictions). The selection between VLM and non-VLM parsing is controlled by the `use_vlm` flag inside `FileParsingConfig`. Source: [dsrag/dsparse/models/types.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/models/types.py).

### Element Types

Page content is categorized into eight categories by default: `NarrativeText`, `Figure`, `Image`, `Table`, `Header`, `Footnote`, `Footer`, and `Equation`. Users may define custom categories by supplying an `element_types` list, or exclude existing ones. By default, `Header` and `Footer` are excluded because they rarely carry semantic value and disrupt cross-page flow. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

The `Element` and `Line` types defined in `models/types.py` carry fields such as `type`, `content`, `page_number`, and `is_visual` — these flow through the pipeline and inform whether a chunk should include a visual description. Source: [dsrag/dsparse/models/types.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/models/types.py).

## Configuration

Configuration is exposed through three TypedDict groups: `FileParsingConfig`, `SemanticSectioningConfig`, and `ChunkingConfig`. All fields are optional and fall back to module-level defaults when omitted. Source: [dsrag/dsparse/models/types.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/models/types.py).

| Config Group | Key Field | Purpose |
|---|---|---|
| `FileParsingConfig` | `use_vlm` | Toggle VLM parsing (default False) |
| `FileParsingConfig` | `vlm` / `vlm_fallback` | Serialized class-based VLM clients |
| `FileParsingConfig` | `vlm_config` | Legacy dict: provider, model, dpi, concurrency |
| `FileParsingConfig` | `always_save_page_images` | Persist rasterized pages for reuse |
| `SemanticSectioningConfig` | `use_semantic_sectioning` | Toggle LLM-based sectioning |
| `SemanticSectioningConfig` | `llm_provider` / `model` | Sectioning LLM (default `gpt-4o-mini`) |
| `ChunkingConfig` | `chunk_size` | Max characters per chunk |
| `ChunkingConfig` | `min_length_for_chunking` | Skip chunking below this length |

Source: [dsrag/dsparse/models/types.py](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/models/types.py).

### VLM Clients

dsParse provides class-based VLM clients (e.g., `GeminiVLM`) that mirror the abstraction used for LLMs, embeddings, and rerankers. They support `.to_dict()` serialization so they can be persisted in configuration and rehydrated at runtime. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

A `vlm_fallback` client may be supplied alongside the primary client; the system alternates between them after the initial retries when needed. This mirrors the fallback patterns used elsewhere in dsRAG. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

Legacy dict-based configuration (e.g., `vlm_config={"provider": "gemini", "model": "gemini-2.0-flash"}`) remains fully supported. When both a serialized client and a legacy dict are supplied, the serialized client takes precedence. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

## Usage Patterns

**Standalone parsing.** The `parse_and_chunk` function is the primary entry point and accepts `file_path`, `file_parsing_config`, and optionally a `file_system` parameter (e.g., `LocalFileSystem(base_path="~/dsParse")`) for persisting intermediate artifacts. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

**KnowledgeBase integration.** A VLM client can be attached at the KB level via `KnowledgeBase(..., vlm_client=GeminiVLM(model="..."))`, then overridden per document by passing a serialized client under `file_parsing_config["vlm"]`. This is the recommended pattern when different documents need different models or cost profiles. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

**Reusing pre-extracted images.** When page images already exist in the configured `FileSystem` directory, pass `vlm_config={"images_already_exist": True}` to skip rasterization and avoid redundant API calls. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

## Cost, Latency, and Common Pitfalls

**Cost.** VLM parsing with `gemini-2.0-flash` is approximately $0.10 per 1000 pages (assuming 4 × 258-token image tiles per page at standard DPI plus a ~500-token prompt). Semantic sectioning with `gpt-4o-mini` is roughly $0.15 per 1000 pages. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

**Latency.** A single page takes ~15–20 seconds for VLM parsing. Documents are page-parallelized within the rate-limit budget (`vlm_max_concurrent_requests`). Sectioning operates on ~5000-token mega-chunks (≈10 pages) processed in parallel, typically completing a few-hundred-page document in 5–10 seconds. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

**Platform compatibility.** dsParse depends on poppler for PDF rasterization (`brew install poppler` on macOS). Community issue #61 reports that some async runtimes (uvloop) fail on Windows, which is relevant when integrating dsParse into high-throughput async pipelines. Source: [community issue #61](https://github.com/D-Star-AI/dsRAG/issues/61).

**Optional dependencies.** Community issue #127 highlights that some optional dependencies (e.g., `google.generativeai`) are imported eagerly elsewhere in the codebase rather than via `LazyLoader`. This is worth noting when assembling minimal environments for dsParse on its own. Source: [community issue #127](https://github.com/D-Star-AI/dsRAG/issues/127).

**Environment variables.** `GEMINI_API_KEY` is required for `GeminiVLM`; a clear error is raised at instantiation if missing. Other providers are expected to follow the same convention. Source: [dsrag/dsparse/README.md](https://github.com/D-Star-AI/dsRAG/blob/main/dsrag/dsparse/README.md).

## See Also

- [KnowledgeBase & Components](dsrag_knowledge_base.md) — how dsParse integrates with the broader retrieval engine
- [AutoContext](autocontext.md) — uses section titles produced by semantic sectioning
- [Relevant Segment Extraction](relevant_segment_extraction.md) — consumes the chunk output

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: D-Star-AI/dsRAG

Summary: Found 13 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/113

## 2. Configuration risk - Configuration risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/117

## 3. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/73

## 4. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/127

## 5. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/116

## 6. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/D-Star-AI/dsRAG

## 7. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/124

## 8. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/D-Star-AI/dsRAG

## 9. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/D-Star-AI/dsRAG

## 10. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/D-Star-AI/dsRAG

## 11. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/118

## 12. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/D-Star-AI/dsRAG

## 13. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/D-Star-AI/dsRAG

<!-- canonical_name: D-Star-AI/dsRAG; human_manual_source: deepwiki_human_wiki -->
