Doramagic Project Pack · Human Manual

dsRAG

High-performance retrieval engine for unstructured data

Overview & System Architecture

Related topics: Pluggable Retrieval Components, Core Retrieval Innovations

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Pluggable Retrieval Components, Core Retrieval Innovations

Overview & System Architecture

Purpose and Scope

dsRAG is a retrieval engine for unstructured data, optimized for challenging queries over dense text such as financial reports, legal documents, and academic papers. According to the project README, on the FinanceBench benchmark dsRAG reaches 96.6% accuracy versus 32% for a vanilla RAG baseline. Source: README.md.

The system is organized around a single high-level abstraction — the KnowledgeBase object — that takes in raw documents, performs chunking and embedding, persists state to disk, and at query time returns the most relevant segments of text. The KnowledgeBase is configured by composing six pluggable components: VectorDB, ChunkDB, Embedding, Reranker, LLM, and FileSystem. Source: README.md.

A dedicated sub-module called dsParse handles multimodal file parsing, semantic sectioning, and chunking. It can be used standalone via pip install dsparse or transparently inside a KnowledgeBase by enabling use_vlm=True in the file-parsing configuration. Source: dsrag/dsparse/README.md.

High-Level Architecture

The system separates ingestion-time concerns (parsing, sectioning, chunking, embedding, persistence) from query-time concerns (vector search, reranking, relevant segment extraction). The data flow below reflects the canonical pipeline described in the README.

flowchart LR
    A[Raw File or Text] --> B[dsParse<br/>VLM Parsing + Semantic Sectioning]
    B --> C[AutoContext<br/>Contextual Chunk Headers]
    C --> D[Embedding Model]
    D --> E[(VectorDB)]
    D --> F[(ChunkDB)]
    G[User Query] --> H[Vector Search]
    H --> E
    H --> I[Reranker]
    I --> J[Relevant Segment<br/>Extraction RSE]
    F --> J
    J --> K[LLM-generated Answer]

Three key methods drive the accuracy gains documented in the README benchmarks:

MethodWhen it runsWhat it does
Semantic SectioningIngestionLLM identifies semantically cohesive sections and titles them
AutoContextIngestionPrepends document + section context to each chunk header
Relevant Segment Extraction (RSE)Query timeCombines adjacent relevant chunks into longer segments

Source: README.md.

Core Components

The README enumerates the six configurable components of a KnowledgeBase. Each can be replaced by a custom subclass of its base class. Source: README.md.

  • VectorDB — stores embedding vectors and metadata. Available options include BasicVectorDB, WeaviateVectorDB, ChromaDB, QdrantVectorDB, MilvusDB, and PineconeDB. ChromaDB additionally supports metadata filtering at query time.
  • ChunkDB — stores chunk text keyed on (doc_id, chunk_index). Options: BasicChunkDB and SQLiteDB. RSE reads from here to reconstruct full segments.
  • Embedding — converts chunks (with AutoContext headers) to vectors. Options include OpenAIEmbedding, CohereEmbedding, VoyageAIEmbedding, and OllamaEmbedding. Community issue #6 requests broader local-model support such as sentence-transformers. Source: README.md.
  • Reranker — re-scores retrieved chunks before RSE. Options: CohereReranker, VoyageReranker, NoReranker.
  • LLM — used for AutoContext generation and (optionally) answer synthesis. Community issue #5 requests Llama 3-8B support for AutoContext.
  • FileSystem — controls where intermediate files (e.g., VLM page images, elements.json) are persisted.

The configuration surface is typed via TypedDict definitions in dsrag/dsparse/models/types.py. For example, FileParsingConfig accepts use_vlm, vlm_config, always_save_page_images, plus optional serialized vlm and vlm_fallback clients. Source: dsrag/dsparse/models/types.py.

dsParse Sub-Module

dsParse is the document ingestion engine. Its main entry point is parse_and_chunk(kb_id, doc_id, file_path, file_parsing_config, ...), which returns a tuple (sections, chunks). It supports two parsing modes:

  1. Traditional text extraction — fast and inexpensive, suitable for PDFs with clean extractable text.
  2. VLM (vision language model) parsing — uses a multimodal model such as gemini-2.0-flash to OCR pages, categorize elements into types (NarrativeText, Figure, Image, Table, etc.), and describe visual elements. Requires the external poppler dependency for PDF → image conversion. Source: dsrag/dsparse/README.md.

VLMs now expose a class-based client abstraction (GeminiVLM, etc.) analogous to LLM/Embedding/Reranker. Instances can be attached to a KnowledgeBase via the vlm_client constructor parameter, or supplied per-document inside file_parsing_config["vlm"]. A fallback client can be configured through vlm_fallback. Legacy dict-based vlm_config remains supported for backward compatibility. The system prefers the class-based client when both are present. Source: dsrag/dsparse/README.md.

Community Considerations

Several open issues are directly relevant to the architecture:

  • Dependency isolation (#127)dsrag/llm.py reportedly does an eager import google.generativeai instead of using dsrag.utils.imports.LazyLoader. This forces the dependency on every user even when Gemini is not used. Source: community context.
  • Platform compatibility (#61) — uvloop is unsupported on Windows, causing a RuntimeError for users on that OS. The runtime stack may need a platform-conditional fallback.
  • Local embedding models (#6) — users want sentence-transformers (and similar) as first-class Embedding components, which would make the system viable fully offline.
  • Local LLMs for AutoContext (#5) — Llama 3-8B is requested as an AutoContext LLM option. Currently the LLM component accepts provider/model pairs but does not enumerate local runners.
  • LangChain interop (#4) — proposed by subclassing LangChain's BaseRetriever and delegating _get_relevant_documents to kb.query.

These requests all map onto the existing component slots (Embedding, LLM), suggesting they can be addressed by adding new subclasses rather than restructuring the core architecture.

Configuration Entry Points

The README documents the main configuration dictionaries passed to add_document:

  • file_parsing_config — controls VLM usage, element exclusion, concurrency, and DPI.
  • semantic_sectioning_config — selects the LLM provider/model and toggles sectioning.
  • chunking_configchunk_size and min_length_for_chunking.
  • auto_context_config — toggles for document/section summary generation.
  • rse_paramsmax_length, overall_max_length, minimum_value, irrelevant_chunk_penalty, decay_rate, top_k_for_document_selection, chunk_length_adjustment.

Persistence is automatic: the full KnowledgeBase configuration is written to a JSON file upon creation and update, making the object reconstructible across processes. Source: README.md.

See Also

  • Semantic Sectioning, AutoContext, and RSE
  • dsParse File Parsing
  • KnowledgeBase Components and Customization
  • Configuration Reference

Source: https://github.com/D-Star-AI/dsRAG / Human Manual

Pluggable Retrieval Components

Related topics: Overview & System Architecture, Core Retrieval Innovations, dsParse: Multimodal File Parsing & VLM Integration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section VectorDB

Continue reading this section for the full explanation and source context.

Section ChunkDB

Continue reading this section for the full explanation and source context.

Section Embedding

Continue reading this section for the full explanation and source context.

Related topics: Overview & System Architecture, Core Retrieval Innovations, dsParse: Multimodal File Parsing & VLM Integration

Pluggable Retrieval Components

dsRAG is built around a pluggable component architecture in which every part of the retrieval pipeline can be swapped without rewriting the surrounding logic. A KnowledgeBase is composed of six configurable components — VectorDB, ChunkDB, Embedding, Reranker, LLM, and FileSystem — plus optional helpers such as a VLM client. Each component has a default implementation, several built-in alternatives, and a documented extension point for fully custom subclasses.

Purpose and Scope

The pluggable design serves two goals. First, it lets users match infrastructure to deployment constraints (for example, switching from an in-memory BasicVectorDB to a managed PineconeDB without touching the embedding or chunking code). Second, it isolates dependency-heavy integrations behind a thin interface, so a project that only needs OpenAI embeddings is not forced to install every optional SDK. This pattern is reinforced throughout the codebase, including the dsParse sub-module, where VLM clients, file systems, and sectioning models follow the same class-based abstraction.

The top-level README states:

There are six key components that define the configuration of a KnowledgeBase, each of which are customizable: 1. VectorDB, 2. ChunkDB, 3. Embedding, 4. Reranker, 5. LLM, 6. FileSystem. There are defaults for each of these components, as well as alternative options included in the repo. You can also define fully custom components by subclassing the base classes and passing in an instance of that subclass to the KnowledgeBase constructor.

Source: README.md

The Six Core Components

VectorDB

Stores dense embedding vectors alongside a small amount of metadata. Built-in options include BasicVectorDB, WeaviateVectorDB, ChromaDB, QdrantVectorDB, MilvusDB, and PineconeDB. ChromaDB additionally supports metadata query filters using operators such as equals, not_equals, in, not_in, greater_than, less_than, greater_than_equals, and less_than_equals.

Source: README.md

ChunkDB

Stores the original text for every chunk in a nested dictionary keyed on doc_id and chunk_index. Relevant Segment Extraction (RSE) reads from this store to rebuild longer passages at query time. The supported options are BasicChunkDB and SQLiteDB.

Source: README.md

Embedding

Defines the embedding model. Built-in clients include OpenAIEmbedding, CohereEmbedding, VoyageAIEmbedding, and OllamaEmbedding. The OllamaEmbedding option is the in-tree path for running embeddings against a local model server, which is relevant to community interest in sentence-transformers and other local models.

Source: README.md

Reranker

Re-ranks the top results returned by the vector store before RSE runs. Cohere and Voyage AI clients are shipped; the NoReranker class is provided for users who want to disable the rerank step entirely.

Source: README.md

LLM

Drives AutoContext contextual chunk headers, semantic sectioning, and (optionally) the response generation step. Models from OpenAI, Anthropic, and Gemini are supported.

Source: README.md

FileSystem

Abstracts where intermediate artefacts such as page images and elements.json are persisted. A LocalFileSystem ships in the dsParse sub-module and is consumed by parse_and_chunk through a file_system argument.

Source: dsrag/dsparse/README.md

VLM Clients and the Parsing Pipeline

The dsParse sub-module extends the same pluggable pattern to multimodal parsing. A class-based VLM client (for example, GeminiVLM) can be passed at the KnowledgeBase level via vlm_client=..., or per document through a serialized dict in file_parsing_config["vlm"]. A fallback client is supported through vlm_fallback. The legacy dict-based path (vlm_config with provider/model) is still supported; when both are provided, the class-based client takes precedence.

The relevant typed configuration is defined in dsrag/dsparse/models/types.py:

TypedDictNotable keys
VLMConfigprovider, model, fallback_provider, fallback_model, exclude_elements, element_types, dpi, vlm_max_concurrent_requests
FileParsingConfiguse_vlm, vlm_config, always_save_page_images, vlm, vlm_fallback
SemanticSectioningConfiguse_semantic_sectioning, llm_provider, model, language
ChunkingConfigchunk_size, min_length_for_chunking

Source: dsrag/dsparse/models/types.py

A typical class-based override looks like:

from dsrag.knowledge_base import KnowledgeBase
from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM

kb = KnowledgeBase(
    kb_id="my_kb",
    vlm_client=GeminiVLM(model="gemini-2.0-flash"),
)

Source: dsrag/dsparse/README.md

Extending the System

Custom components are introduced by subclassing the base class and passing an instance to the KnowledgeBase constructor. This is the same mechanism the bundled options use, and it is the recommended way to integrate with LangChain retrievers or local models that are not yet first-class. Community discussions (for example, the request to add a LangChain BaseRetriever subclass whose _get_relevant_documents method calls kb.query) follow exactly this pattern, as do proposals to surface additional local LLMs and embedding backends. The README's section on auto_context_config, semantic_sectioning_config, chunking_config, and rse_params documents the configuration surface that custom components must respect.

Source: README.md

Component Interaction at Query Time

flowchart LR
    Q[Query] --> KB[KnowledgeBase]
    KB --> VDB[VectorDB]
    VDB --> RR[Reranker]
    RR --> RSE[Relevant Segment Extraction]
    RSE --> CDB[ChunkDB]
    CDB --> Segs[Segments]
    KB -.uses.-> LLM[LLM / AutoContext]
    KB -.uses.-> FS[FileSystem]

The diagram illustrates the runtime contract: VectorDB returns candidate chunks, the Reranker re-orders them, and RSE reads the full text back from ChunkDB to assemble longer segments. The LLM participates in the offline pipeline (AutoContext headers, semantic sectioning) and optionally at response-generation time. The FileSystem is touched only during ingestion or when VLM parsing is enabled.

Source: README.md

Operational Notes and Failure Modes

  • Environment variables — The README and dsParse README call out that GEMINI_API_KEY is required for GeminiVLM and that a clear error is raised when it is missing. Custom components should follow the same pattern.
  • Optional dependencies — Vector databases are now distributed as optional install groups, mirroring how the pluggable architecture is intended to keep core installs lean.
  • Backward compatibility — The dsParse vlm/vlm_fallback class-based path supersedes the legacy provider/model dict path only when both are supplied, so existing configurations continue to work.
  • Cross-platform — The component abstraction keeps heavy native dependencies (for example, the poppler binary needed for VLM PDF parsing) out of the core install path.

Source: README.md, dsrag/dsparse/README.md

See Also

  • KnowledgeBase Object
  • AutoContext and Contextual Chunk Headers
  • Relevant Segment Extraction (RSE)
  • Semantic Sectioning and Chunking
  • dsParse Multimodal File Parsing

Source: https://github.com/D-Star-AI/dsRAG / Human Manual

Core Retrieval Innovations

Related topics: Overview & System Architecture, Pluggable Retrieval Components

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Overview & System Architecture, Pluggable Retrieval Components

Core Retrieval Innovations

dsRAG is a retrieval engine for unstructured data, especially dense text like financial reports, legal documents, and academic papers. Three retrieval innovations sit at the heart of its pipeline and drive its reported jump from ~32% to ~96.6% accuracy on the FinanceBench benchmark (Source: README.md). This page documents those three mechanisms: Semantic Sectioning, AutoContext (contextual chunk headers), and Relevant Segment Extraction (RSE), plus how they are configured through the KnowledgeBase API.

Architecture Overview

The three innovations are applied at distinct stages of the retrieval pipeline. Semantic Sectioning reshapes the document before chunking, AutoContext enriches the chunk text before embedding, and RSE reconstructs longer passages after the vector + reranker search.

flowchart LR
    A[Raw Document] --> B[Semantic Sectioning]
    B --> C[Section-aware Chunking]
    C --> D[AutoContext Headers]
    D --> E[VectorDB + Reranker Search]
    E --> F[Relevant Segment Extraction]
    F --> G[Final Segments to LLM]

Source: README.md, dsrag/rse.py

Semantic Sectioning

Semantic sectioning is an offline, ingest-time step that uses an LLM to break a document into "semantically cohesive" sections. The implementation annotates the document with line numbers and prompts the LLM to identify the starting line for each section. Sections are expected to span a few paragraphs to a few pages, and the LLM also produces descriptive titles for each (Source: README.md).

The behavior is encoded in dsrag/dsparse/sectioning_and_chunking/semantic_sectioning.py. The SemanticSectioningConfig typed dict in dsrag/dsparse/models/types.py exposes use_semantic_sectioning, llm_provider, model, and language as the relevant parameters. Document text is processed in roughly 5,000-token mega-chunks in parallel, so even multi-hundred-page documents finish in 5–10 seconds. The default model is gpt-4o-mini, and gemini-2.0-flash or claude-3-5-haiku-latest are also listed as compatible options (Source: README.md).

The downstream benefit is twofold: section titles feed into the AutoContext headers, and section boundaries constrain the chunker in dsrag/dsparse/sectioning_and_chunking/chunking.py so that chunks do not straddle unrelated topics.

AutoContext (Contextual Chunk Headers)

AutoContext solves a well-known embedding problem: an isolated chunk often lacks the context needed to embed it accurately. The auto_context module in dsrag/auto_context.py generates a textual header per chunk containing document-level context (title, summary) and section-level context (section title and, optionally, a section summary), then prepends it before embedding. The chunk used for retrieval is the header plus the original text, while the stored chunk body remains the original (Source: README.md).

Key configuration flags exposed in KnowledgeBase.add_document include:

FlagEffect
use_generated_titleUse an LLM-generated document title instead of the user-supplied one
get_document_summaryInclude an LLM-generated document summary in each header
get_section_summariesInclude LLM-generated section summaries in each header
document_title_guidance, section_summarization_guidanceFree-form prompt guidance strings

Source: README.md

Because AutoContext is an LLM call per document (and sometimes per section), there is open community interest in using locally hosted models such as Llama 3-8B for this step (Source: issue #5). The LLM client is currently configured through the KnowledgeBase constructor, and the same concern — that the provider abstraction should make local model backends pluggable — also appears in requests for sentence-transformers embeddings (Source: issue #6).

Relevant Segment Extraction (RSE)

RSE is a query-time post-processing step. The vector search plus reranker returns a ranked list of individual chunks, but many real questions are answered by a contiguous block of text longer than one chunk. RSE clusters the top-ranked chunks, scores contiguous runs, and emits the highest-scoring runs as "segments" (Source: dsrag/rse.py, README.md).

The algorithm relies on ChunkDB to resolve the full text of each chunk by (doc_id, chunk_index), then applies a length-aware scoring heuristic so longer contiguous runs are not unfairly penalized. The main tuning knobs live under rse_params in the KnowledgeBase configuration:

ParameterPurpose
max_lengthMaximum length of a single segment, in chunks
overall_max_lengthMaximum total length across all returned segments
minimum_valueRelevance floor below which a segment is dropped
irrelevant_chunk_penaltyPenalty (0–1) applied when a low-scoring chunk sits inside a run
overall_max_length_extensionPer-query extension to overall_max_length
decay_rateExponential decay applied across segment boundaries
top_k_for_document_selectionHow many distinct documents to consider
chunk_length_adjustmentWhether to scale chunk scores by chunk length before segment scoring

Source: README.md

Because RSE re-reads chunk text from ChunkDB, it is sensitive to how chunks are stored. The metadata layer in dsrag/metadata.py is what makes a segment addressable and presentable to the downstream generator.

How the Three Innovations Combine

The combined effect of section-aware chunking, contextual headers, and segment-level retrieval is the basis of dsRAG's reported benchmark results. The README's KITE table compares vanilla top-k retrieval, RSE alone, contextual headers alone, and the combined default configuration; the combined configuration is uniformly the strongest across the AI Papers, BVP Cloud 10-Ks, Sourcegraph handbook, and Supreme Court opinions datasets (Source: README.md).

For users who want to consume this pipeline from another framework, the recommended pattern is to subclass the host framework's retriever base class and call kb.query from _get_relevant_documents, so the same three innovations are reused unchanged (Source: issue #4). This is also what dsrag/auto_query.py is designed to support for query-side enrichment.

See Also

  • KnowledgeBase component reference (VectorDB, ChunkDB, Embedding, Reranker, LLM, FileSystem)
  • dsParse multimodal file parsing
  • VLM client configuration and GeminiVLM fallback patterns
  • Known issue: google.generativeai direct import in dsrag/llm.py (issue #127)

Source: https://github.com/D-Star-AI/dsRAG / Human Manual

dsParse: Multimodal File Parsing & VLM Integration

Related topics: Overview & System Architecture, Pluggable Retrieval Components

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Element Types

Continue reading this section for the full explanation and source context.

Section VLM Clients

Continue reading this section for the full explanation and source context.

Related topics: Overview & System Architecture, Pluggable Retrieval Components

dsParse: Multimodal File Parsing & VLM Integration

Overview

dsParse is a sub-module of dsrag that performs multimodal file parsing, semantic sectioning, and chunking. It accepts a file path plus configuration and returns clean, structured chunks ready for embedding and retrieval. The module can be used standalone via the standalone dsparse pip package, or transparently through a KnowledgeBase by setting use_vlm=True in file_parsing_config. Source: dsrag/dsparse/README.md.

The core motivation behind dsParse is to handle documents where vanilla text extraction fails or loses fidelity — scanned PDFs, complex layouts, dense tables, and figures. By delegating page understanding to a Vision Language Model (VLM), dsParse produces rich descriptions of visual content and structurally accurate text, dramatically improving downstream retrieval. Source: dsrag/dsparse/README.md.

Architecture and Data Flow

When VLM parsing is enabled, dsParse converts each PDF page into an image (via poppler), sends the image and a structured prompt to a VLM, and receives a categorized list of page elements. Elements are typed, optionally described (for visuals), and then concatenated into lines that flow into the semantic sectioner and chunker. The default VLM is gemini-2.0-flash, chosen for fast, cost-effective, near-SOTA performance. Source: dsrag/dsparse/README.md.

flowchart TD
    A[PDF / file_path] --> B[Poppler: convert to page images]
    B --> C[VLM Client: GeminiVLM]
    C --> D[Element list<br/>NarrativeText, Table, Figure, etc.]
    D --> E[Annotate with line numbers]
    E --> F[Semantic Sectioner LLM]
    F --> G[Sections w/ titles]
    G --> H[Chunker]
    H --> I[(sections, chunks)]
    I --> J[AutoContext chunk headers]
    J --> K[(Embedding + VectorDB)]

A non-VLM parsing path (non_vlm_file_parsing.py) remains available for cases where a VLM is undesirable (cost, latency, or platform restrictions). The selection between VLM and non-VLM parsing is controlled by the use_vlm flag inside FileParsingConfig. Source: dsrag/dsparse/models/types.py.

Element Types

Page content is categorized into eight categories by default: NarrativeText, Figure, Image, Table, Header, Footnote, Footer, and Equation. Users may define custom categories by supplying an element_types list, or exclude existing ones. By default, Header and Footer are excluded because they rarely carry semantic value and disrupt cross-page flow. Source: dsrag/dsparse/README.md.

The Element and Line types defined in models/types.py carry fields such as type, content, page_number, and is_visual — these flow through the pipeline and inform whether a chunk should include a visual description. Source: dsrag/dsparse/models/types.py.

Configuration

Configuration is exposed through three TypedDict groups: FileParsingConfig, SemanticSectioningConfig, and ChunkingConfig. All fields are optional and fall back to module-level defaults when omitted. Source: dsrag/dsparse/models/types.py.

Config GroupKey FieldPurpose
FileParsingConfiguse_vlmToggle VLM parsing (default False)
FileParsingConfigvlm / vlm_fallbackSerialized class-based VLM clients
FileParsingConfigvlm_configLegacy dict: provider, model, dpi, concurrency
FileParsingConfigalways_save_page_imagesPersist rasterized pages for reuse
SemanticSectioningConfiguse_semantic_sectioningToggle LLM-based sectioning
SemanticSectioningConfigllm_provider / modelSectioning LLM (default gpt-4o-mini)
ChunkingConfigchunk_sizeMax characters per chunk
ChunkingConfigmin_length_for_chunkingSkip chunking below this length

Source: dsrag/dsparse/models/types.py.

VLM Clients

dsParse provides class-based VLM clients (e.g., GeminiVLM) that mirror the abstraction used for LLMs, embeddings, and rerankers. They support .to_dict() serialization so they can be persisted in configuration and rehydrated at runtime. Source: dsrag/dsparse/README.md.

A vlm_fallback client may be supplied alongside the primary client; the system alternates between them after the initial retries when needed. This mirrors the fallback patterns used elsewhere in dsRAG. Source: dsrag/dsparse/README.md.

Legacy dict-based configuration (e.g., vlm_config={"provider": "gemini", "model": "gemini-2.0-flash"}) remains fully supported. When both a serialized client and a legacy dict are supplied, the serialized client takes precedence. Source: dsrag/dsparse/README.md.

Usage Patterns

Standalone parsing. The parse_and_chunk function is the primary entry point and accepts file_path, file_parsing_config, and optionally a file_system parameter (e.g., LocalFileSystem(base_path="~/dsParse")) for persisting intermediate artifacts. Source: dsrag/dsparse/README.md.

KnowledgeBase integration. A VLM client can be attached at the KB level via KnowledgeBase(..., vlm_client=GeminiVLM(model="...")), then overridden per document by passing a serialized client under file_parsing_config["vlm"]. This is the recommended pattern when different documents need different models or cost profiles. Source: dsrag/dsparse/README.md.

Reusing pre-extracted images. When page images already exist in the configured FileSystem directory, pass vlm_config={"images_already_exist": True} to skip rasterization and avoid redundant API calls. Source: dsrag/dsparse/README.md.

Cost, Latency, and Common Pitfalls

Cost. VLM parsing with gemini-2.0-flash is approximately $0.10 per 1000 pages (assuming 4 × 258-token image tiles per page at standard DPI plus a ~500-token prompt). Semantic sectioning with gpt-4o-mini is roughly $0.15 per 1000 pages. Source: dsrag/dsparse/README.md.

Latency. A single page takes ~15–20 seconds for VLM parsing. Documents are page-parallelized within the rate-limit budget (vlm_max_concurrent_requests). Sectioning operates on ~5000-token mega-chunks (≈10 pages) processed in parallel, typically completing a few-hundred-page document in 5–10 seconds. Source: dsrag/dsparse/README.md.

Platform compatibility. dsParse depends on poppler for PDF rasterization (brew install poppler on macOS). Community issue #61 reports that some async runtimes (uvloop) fail on Windows, which is relevant when integrating dsParse into high-throughput async pipelines. Source: community issue #61.

Optional dependencies. Community issue #127 highlights that some optional dependencies (e.g., google.generativeai) are imported eagerly elsewhere in the codebase rather than via LazyLoader. This is worth noting when assembling minimal environments for dsParse on its own. Source: community issue #127.

Environment variables. GEMINI_API_KEY is required for GeminiVLM; a clear error is raised at instantiation if missing. Other providers are expected to follow the same convention. Source: dsrag/dsparse/README.md.

See Also

  • KnowledgeBase & Components — how dsParse integrates with the broader retrieval engine
  • AutoContext — uses section titles produced by semantic sectioning
  • Relevant Segment Extraction — consumes the chunk output

Source: https://github.com/D-Star-AI/dsRAG / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Configuration risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 13 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

  • Severity: high
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/113

2. Configuration risk: Configuration risk requires verification

  • Severity: high
  • Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/117

3. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/73

4. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/127

5. Configuration risk: Configuration risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/116

6. Capability evidence risk: Capability evidence risk requires verification

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: capability.assumptions | https://github.com/D-Star-AI/dsRAG

7. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/124

8. Maintenance risk: Maintenance risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/D-Star-AI/dsRAG

9. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: downstream_validation.risk_items | https://github.com/D-Star-AI/dsRAG

10. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: risks.scoring_risks | https://github.com/D-Star-AI/dsRAG

11. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/118

12. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: issue_or_pr_quality=unknown。
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/D-Star-AI/dsRAG

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 9

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using dsRAG with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence