Doramagic Project Pack · Human Manual
dsRAG
High-performance retrieval engine for unstructured data
Overview & System Architecture
Related topics: Pluggable Retrieval Components, Core Retrieval Innovations
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Pluggable Retrieval Components, Core Retrieval Innovations
Overview & System Architecture
Purpose and Scope
dsRAG is a retrieval engine for unstructured data, optimized for challenging queries over dense text such as financial reports, legal documents, and academic papers. According to the project README, on the FinanceBench benchmark dsRAG reaches 96.6% accuracy versus 32% for a vanilla RAG baseline. Source: README.md.
The system is organized around a single high-level abstraction — the KnowledgeBase object — that takes in raw documents, performs chunking and embedding, persists state to disk, and at query time returns the most relevant segments of text. The KnowledgeBase is configured by composing six pluggable components: VectorDB, ChunkDB, Embedding, Reranker, LLM, and FileSystem. Source: README.md.
A dedicated sub-module called dsParse handles multimodal file parsing, semantic sectioning, and chunking. It can be used standalone via pip install dsparse or transparently inside a KnowledgeBase by enabling use_vlm=True in the file-parsing configuration. Source: dsrag/dsparse/README.md.
High-Level Architecture
The system separates ingestion-time concerns (parsing, sectioning, chunking, embedding, persistence) from query-time concerns (vector search, reranking, relevant segment extraction). The data flow below reflects the canonical pipeline described in the README.
flowchart LR
A[Raw File or Text] --> B[dsParse<br/>VLM Parsing + Semantic Sectioning]
B --> C[AutoContext<br/>Contextual Chunk Headers]
C --> D[Embedding Model]
D --> E[(VectorDB)]
D --> F[(ChunkDB)]
G[User Query] --> H[Vector Search]
H --> E
H --> I[Reranker]
I --> J[Relevant Segment<br/>Extraction RSE]
F --> J
J --> K[LLM-generated Answer]Three key methods drive the accuracy gains documented in the README benchmarks:
| Method | When it runs | What it does |
|---|---|---|
| Semantic Sectioning | Ingestion | LLM identifies semantically cohesive sections and titles them |
| AutoContext | Ingestion | Prepends document + section context to each chunk header |
| Relevant Segment Extraction (RSE) | Query time | Combines adjacent relevant chunks into longer segments |
Source: README.md.
Core Components
The README enumerates the six configurable components of a KnowledgeBase. Each can be replaced by a custom subclass of its base class. Source: README.md.
- VectorDB — stores embedding vectors and metadata. Available options include
BasicVectorDB,WeaviateVectorDB,ChromaDB,QdrantVectorDB,MilvusDB, andPineconeDB. ChromaDB additionally supports metadata filtering at query time. - ChunkDB — stores chunk text keyed on
(doc_id, chunk_index). Options:BasicChunkDBandSQLiteDB. RSE reads from here to reconstruct full segments. - Embedding — converts chunks (with AutoContext headers) to vectors. Options include
OpenAIEmbedding,CohereEmbedding,VoyageAIEmbedding, andOllamaEmbedding. Community issue #6 requests broader local-model support such as sentence-transformers. Source: README.md. - Reranker — re-scores retrieved chunks before RSE. Options:
CohereReranker,VoyageReranker,NoReranker. - LLM — used for AutoContext generation and (optionally) answer synthesis. Community issue #5 requests Llama 3-8B support for AutoContext.
- FileSystem — controls where intermediate files (e.g., VLM page images,
elements.json) are persisted.
The configuration surface is typed via TypedDict definitions in dsrag/dsparse/models/types.py. For example, FileParsingConfig accepts use_vlm, vlm_config, always_save_page_images, plus optional serialized vlm and vlm_fallback clients. Source: dsrag/dsparse/models/types.py.
dsParse Sub-Module
dsParse is the document ingestion engine. Its main entry point is parse_and_chunk(kb_id, doc_id, file_path, file_parsing_config, ...), which returns a tuple (sections, chunks). It supports two parsing modes:
- Traditional text extraction — fast and inexpensive, suitable for PDFs with clean extractable text.
- VLM (vision language model) parsing — uses a multimodal model such as
gemini-2.0-flashto OCR pages, categorize elements into types (NarrativeText, Figure, Image, Table, etc.), and describe visual elements. Requires the externalpopplerdependency for PDF → image conversion. Source: dsrag/dsparse/README.md.
VLMs now expose a class-based client abstraction (GeminiVLM, etc.) analogous to LLM/Embedding/Reranker. Instances can be attached to a KnowledgeBase via the vlm_client constructor parameter, or supplied per-document inside file_parsing_config["vlm"]. A fallback client can be configured through vlm_fallback. Legacy dict-based vlm_config remains supported for backward compatibility. The system prefers the class-based client when both are present. Source: dsrag/dsparse/README.md.
Community Considerations
Several open issues are directly relevant to the architecture:
- Dependency isolation (#127) —
dsrag/llm.pyreportedly does an eagerimport google.generativeaiinstead of usingdsrag.utils.imports.LazyLoader. This forces the dependency on every user even when Gemini is not used. Source: community context. - Platform compatibility (#61) — uvloop is unsupported on Windows, causing a
RuntimeErrorfor users on that OS. The runtime stack may need a platform-conditional fallback. - Local embedding models (#6) — users want
sentence-transformers(and similar) as first-class Embedding components, which would make the system viable fully offline. - Local LLMs for AutoContext (#5) — Llama 3-8B is requested as an AutoContext LLM option. Currently the LLM component accepts provider/model pairs but does not enumerate local runners.
- LangChain interop (#4) — proposed by subclassing LangChain's
BaseRetrieverand delegating_get_relevant_documentstokb.query.
These requests all map onto the existing component slots (Embedding, LLM), suggesting they can be addressed by adding new subclasses rather than restructuring the core architecture.
Configuration Entry Points
The README documents the main configuration dictionaries passed to add_document:
file_parsing_config— controls VLM usage, element exclusion, concurrency, and DPI.semantic_sectioning_config— selects the LLM provider/model and toggles sectioning.chunking_config—chunk_sizeandmin_length_for_chunking.auto_context_config— toggles for document/section summary generation.rse_params—max_length,overall_max_length,minimum_value,irrelevant_chunk_penalty,decay_rate,top_k_for_document_selection,chunk_length_adjustment.
Persistence is automatic: the full KnowledgeBase configuration is written to a JSON file upon creation and update, making the object reconstructible across processes. Source: README.md.
See Also
- Semantic Sectioning, AutoContext, and RSE
- dsParse File Parsing
- KnowledgeBase Components and Customization
- Configuration Reference
Source: https://github.com/D-Star-AI/dsRAG / Human Manual
Pluggable Retrieval Components
Related topics: Overview & System Architecture, Core Retrieval Innovations, dsParse: Multimodal File Parsing & VLM Integration
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview & System Architecture, Core Retrieval Innovations, dsParse: Multimodal File Parsing & VLM Integration
Pluggable Retrieval Components
dsRAG is built around a pluggable component architecture in which every part of the retrieval pipeline can be swapped without rewriting the surrounding logic. A KnowledgeBase is composed of six configurable components — VectorDB, ChunkDB, Embedding, Reranker, LLM, and FileSystem — plus optional helpers such as a VLM client. Each component has a default implementation, several built-in alternatives, and a documented extension point for fully custom subclasses.
Purpose and Scope
The pluggable design serves two goals. First, it lets users match infrastructure to deployment constraints (for example, switching from an in-memory BasicVectorDB to a managed PineconeDB without touching the embedding or chunking code). Second, it isolates dependency-heavy integrations behind a thin interface, so a project that only needs OpenAI embeddings is not forced to install every optional SDK. This pattern is reinforced throughout the codebase, including the dsParse sub-module, where VLM clients, file systems, and sectioning models follow the same class-based abstraction.
The top-level README states:
There are six key components that define the configuration of a KnowledgeBase, each of which are customizable: 1. VectorDB, 2. ChunkDB, 3. Embedding, 4. Reranker, 5. LLM, 6. FileSystem. There are defaults for each of these components, as well as alternative options included in the repo. You can also define fully custom components by subclassing the base classes and passing in an instance of that subclass to the KnowledgeBase constructor.
Source: README.md
The Six Core Components
VectorDB
Stores dense embedding vectors alongside a small amount of metadata. Built-in options include BasicVectorDB, WeaviateVectorDB, ChromaDB, QdrantVectorDB, MilvusDB, and PineconeDB. ChromaDB additionally supports metadata query filters using operators such as equals, not_equals, in, not_in, greater_than, less_than, greater_than_equals, and less_than_equals.
Source: README.md
ChunkDB
Stores the original text for every chunk in a nested dictionary keyed on doc_id and chunk_index. Relevant Segment Extraction (RSE) reads from this store to rebuild longer passages at query time. The supported options are BasicChunkDB and SQLiteDB.
Source: README.md
Embedding
Defines the embedding model. Built-in clients include OpenAIEmbedding, CohereEmbedding, VoyageAIEmbedding, and OllamaEmbedding. The OllamaEmbedding option is the in-tree path for running embeddings against a local model server, which is relevant to community interest in sentence-transformers and other local models.
Source: README.md
Reranker
Re-ranks the top results returned by the vector store before RSE runs. Cohere and Voyage AI clients are shipped; the NoReranker class is provided for users who want to disable the rerank step entirely.
Source: README.md
LLM
Drives AutoContext contextual chunk headers, semantic sectioning, and (optionally) the response generation step. Models from OpenAI, Anthropic, and Gemini are supported.
Source: README.md
FileSystem
Abstracts where intermediate artefacts such as page images and elements.json are persisted. A LocalFileSystem ships in the dsParse sub-module and is consumed by parse_and_chunk through a file_system argument.
Source: dsrag/dsparse/README.md
VLM Clients and the Parsing Pipeline
The dsParse sub-module extends the same pluggable pattern to multimodal parsing. A class-based VLM client (for example, GeminiVLM) can be passed at the KnowledgeBase level via vlm_client=..., or per document through a serialized dict in file_parsing_config["vlm"]. A fallback client is supported through vlm_fallback. The legacy dict-based path (vlm_config with provider/model) is still supported; when both are provided, the class-based client takes precedence.
The relevant typed configuration is defined in dsrag/dsparse/models/types.py:
| TypedDict | Notable keys |
|---|---|
VLMConfig | provider, model, fallback_provider, fallback_model, exclude_elements, element_types, dpi, vlm_max_concurrent_requests |
FileParsingConfig | use_vlm, vlm_config, always_save_page_images, vlm, vlm_fallback |
SemanticSectioningConfig | use_semantic_sectioning, llm_provider, model, language |
ChunkingConfig | chunk_size, min_length_for_chunking |
Source: dsrag/dsparse/models/types.py
A typical class-based override looks like:
from dsrag.knowledge_base import KnowledgeBase
from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM
kb = KnowledgeBase(
kb_id="my_kb",
vlm_client=GeminiVLM(model="gemini-2.0-flash"),
)
Source: dsrag/dsparse/README.md
Extending the System
Custom components are introduced by subclassing the base class and passing an instance to the KnowledgeBase constructor. This is the same mechanism the bundled options use, and it is the recommended way to integrate with LangChain retrievers or local models that are not yet first-class. Community discussions (for example, the request to add a LangChain BaseRetriever subclass whose _get_relevant_documents method calls kb.query) follow exactly this pattern, as do proposals to surface additional local LLMs and embedding backends. The README's section on auto_context_config, semantic_sectioning_config, chunking_config, and rse_params documents the configuration surface that custom components must respect.
Source: README.md
Component Interaction at Query Time
flowchart LR
Q[Query] --> KB[KnowledgeBase]
KB --> VDB[VectorDB]
VDB --> RR[Reranker]
RR --> RSE[Relevant Segment Extraction]
RSE --> CDB[ChunkDB]
CDB --> Segs[Segments]
KB -.uses.-> LLM[LLM / AutoContext]
KB -.uses.-> FS[FileSystem]The diagram illustrates the runtime contract: VectorDB returns candidate chunks, the Reranker re-orders them, and RSE reads the full text back from ChunkDB to assemble longer segments. The LLM participates in the offline pipeline (AutoContext headers, semantic sectioning) and optionally at response-generation time. The FileSystem is touched only during ingestion or when VLM parsing is enabled.
Source: README.md
Operational Notes and Failure Modes
- Environment variables — The README and dsParse README call out that
GEMINI_API_KEYis required forGeminiVLMand that a clear error is raised when it is missing. Custom components should follow the same pattern. - Optional dependencies — Vector databases are now distributed as optional install groups, mirroring how the pluggable architecture is intended to keep core installs lean.
- Backward compatibility — The dsParse
vlm/vlm_fallbackclass-based path supersedes the legacyprovider/modeldict path only when both are supplied, so existing configurations continue to work. - Cross-platform — The component abstraction keeps heavy native dependencies (for example, the
popplerbinary needed for VLM PDF parsing) out of the core install path.
Source: README.md, dsrag/dsparse/README.md
See Also
- KnowledgeBase Object
- AutoContext and Contextual Chunk Headers
- Relevant Segment Extraction (RSE)
- Semantic Sectioning and Chunking
- dsParse Multimodal File Parsing
Source: https://github.com/D-Star-AI/dsRAG / Human Manual
Core Retrieval Innovations
Related topics: Overview & System Architecture, Pluggable Retrieval Components
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview & System Architecture, Pluggable Retrieval Components
Core Retrieval Innovations
dsRAG is a retrieval engine for unstructured data, especially dense text like financial reports, legal documents, and academic papers. Three retrieval innovations sit at the heart of its pipeline and drive its reported jump from ~32% to ~96.6% accuracy on the FinanceBench benchmark (Source: README.md). This page documents those three mechanisms: Semantic Sectioning, AutoContext (contextual chunk headers), and Relevant Segment Extraction (RSE), plus how they are configured through the KnowledgeBase API.
Architecture Overview
The three innovations are applied at distinct stages of the retrieval pipeline. Semantic Sectioning reshapes the document before chunking, AutoContext enriches the chunk text before embedding, and RSE reconstructs longer passages after the vector + reranker search.
flowchart LR
A[Raw Document] --> B[Semantic Sectioning]
B --> C[Section-aware Chunking]
C --> D[AutoContext Headers]
D --> E[VectorDB + Reranker Search]
E --> F[Relevant Segment Extraction]
F --> G[Final Segments to LLM]Source: README.md, dsrag/rse.py
Semantic Sectioning
Semantic sectioning is an offline, ingest-time step that uses an LLM to break a document into "semantically cohesive" sections. The implementation annotates the document with line numbers and prompts the LLM to identify the starting line for each section. Sections are expected to span a few paragraphs to a few pages, and the LLM also produces descriptive titles for each (Source: README.md).
The behavior is encoded in dsrag/dsparse/sectioning_and_chunking/semantic_sectioning.py. The SemanticSectioningConfig typed dict in dsrag/dsparse/models/types.py exposes use_semantic_sectioning, llm_provider, model, and language as the relevant parameters. Document text is processed in roughly 5,000-token mega-chunks in parallel, so even multi-hundred-page documents finish in 5–10 seconds. The default model is gpt-4o-mini, and gemini-2.0-flash or claude-3-5-haiku-latest are also listed as compatible options (Source: README.md).
The downstream benefit is twofold: section titles feed into the AutoContext headers, and section boundaries constrain the chunker in dsrag/dsparse/sectioning_and_chunking/chunking.py so that chunks do not straddle unrelated topics.
AutoContext (Contextual Chunk Headers)
AutoContext solves a well-known embedding problem: an isolated chunk often lacks the context needed to embed it accurately. The auto_context module in dsrag/auto_context.py generates a textual header per chunk containing document-level context (title, summary) and section-level context (section title and, optionally, a section summary), then prepends it before embedding. The chunk used for retrieval is the header plus the original text, while the stored chunk body remains the original (Source: README.md).
Key configuration flags exposed in KnowledgeBase.add_document include:
| Flag | Effect |
|---|---|
use_generated_title | Use an LLM-generated document title instead of the user-supplied one |
get_document_summary | Include an LLM-generated document summary in each header |
get_section_summaries | Include LLM-generated section summaries in each header |
document_title_guidance, section_summarization_guidance | Free-form prompt guidance strings |
Source: README.md
Because AutoContext is an LLM call per document (and sometimes per section), there is open community interest in using locally hosted models such as Llama 3-8B for this step (Source: issue #5). The LLM client is currently configured through the KnowledgeBase constructor, and the same concern — that the provider abstraction should make local model backends pluggable — also appears in requests for sentence-transformers embeddings (Source: issue #6).
Relevant Segment Extraction (RSE)
RSE is a query-time post-processing step. The vector search plus reranker returns a ranked list of individual chunks, but many real questions are answered by a contiguous block of text longer than one chunk. RSE clusters the top-ranked chunks, scores contiguous runs, and emits the highest-scoring runs as "segments" (Source: dsrag/rse.py, README.md).
The algorithm relies on ChunkDB to resolve the full text of each chunk by (doc_id, chunk_index), then applies a length-aware scoring heuristic so longer contiguous runs are not unfairly penalized. The main tuning knobs live under rse_params in the KnowledgeBase configuration:
| Parameter | Purpose |
|---|---|
max_length | Maximum length of a single segment, in chunks |
overall_max_length | Maximum total length across all returned segments |
minimum_value | Relevance floor below which a segment is dropped |
irrelevant_chunk_penalty | Penalty (0–1) applied when a low-scoring chunk sits inside a run |
overall_max_length_extension | Per-query extension to overall_max_length |
decay_rate | Exponential decay applied across segment boundaries |
top_k_for_document_selection | How many distinct documents to consider |
chunk_length_adjustment | Whether to scale chunk scores by chunk length before segment scoring |
Source: README.md
Because RSE re-reads chunk text from ChunkDB, it is sensitive to how chunks are stored. The metadata layer in dsrag/metadata.py is what makes a segment addressable and presentable to the downstream generator.
How the Three Innovations Combine
The combined effect of section-aware chunking, contextual headers, and segment-level retrieval is the basis of dsRAG's reported benchmark results. The README's KITE table compares vanilla top-k retrieval, RSE alone, contextual headers alone, and the combined default configuration; the combined configuration is uniformly the strongest across the AI Papers, BVP Cloud 10-Ks, Sourcegraph handbook, and Supreme Court opinions datasets (Source: README.md).
For users who want to consume this pipeline from another framework, the recommended pattern is to subclass the host framework's retriever base class and call kb.query from _get_relevant_documents, so the same three innovations are reused unchanged (Source: issue #4). This is also what dsrag/auto_query.py is designed to support for query-side enrichment.
See Also
- KnowledgeBase component reference (VectorDB, ChunkDB, Embedding, Reranker, LLM, FileSystem)
- dsParse multimodal file parsing
- VLM client configuration and
GeminiVLMfallback patterns - Known issue:
google.generativeaidirect import indsrag/llm.py(issue #127)
Source: https://github.com/D-Star-AI/dsRAG / Human Manual
dsParse: Multimodal File Parsing & VLM Integration
Related topics: Overview & System Architecture, Pluggable Retrieval Components
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview & System Architecture, Pluggable Retrieval Components
dsParse: Multimodal File Parsing & VLM Integration
Overview
dsParse is a sub-module of dsrag that performs multimodal file parsing, semantic sectioning, and chunking. It accepts a file path plus configuration and returns clean, structured chunks ready for embedding and retrieval. The module can be used standalone via the standalone dsparse pip package, or transparently through a KnowledgeBase by setting use_vlm=True in file_parsing_config. Source: dsrag/dsparse/README.md.
The core motivation behind dsParse is to handle documents where vanilla text extraction fails or loses fidelity — scanned PDFs, complex layouts, dense tables, and figures. By delegating page understanding to a Vision Language Model (VLM), dsParse produces rich descriptions of visual content and structurally accurate text, dramatically improving downstream retrieval. Source: dsrag/dsparse/README.md.
Architecture and Data Flow
When VLM parsing is enabled, dsParse converts each PDF page into an image (via poppler), sends the image and a structured prompt to a VLM, and receives a categorized list of page elements. Elements are typed, optionally described (for visuals), and then concatenated into lines that flow into the semantic sectioner and chunker. The default VLM is gemini-2.0-flash, chosen for fast, cost-effective, near-SOTA performance. Source: dsrag/dsparse/README.md.
flowchart TD
A[PDF / file_path] --> B[Poppler: convert to page images]
B --> C[VLM Client: GeminiVLM]
C --> D[Element list<br/>NarrativeText, Table, Figure, etc.]
D --> E[Annotate with line numbers]
E --> F[Semantic Sectioner LLM]
F --> G[Sections w/ titles]
G --> H[Chunker]
H --> I[(sections, chunks)]
I --> J[AutoContext chunk headers]
J --> K[(Embedding + VectorDB)]A non-VLM parsing path (non_vlm_file_parsing.py) remains available for cases where a VLM is undesirable (cost, latency, or platform restrictions). The selection between VLM and non-VLM parsing is controlled by the use_vlm flag inside FileParsingConfig. Source: dsrag/dsparse/models/types.py.
Element Types
Page content is categorized into eight categories by default: NarrativeText, Figure, Image, Table, Header, Footnote, Footer, and Equation. Users may define custom categories by supplying an element_types list, or exclude existing ones. By default, Header and Footer are excluded because they rarely carry semantic value and disrupt cross-page flow. Source: dsrag/dsparse/README.md.
The Element and Line types defined in models/types.py carry fields such as type, content, page_number, and is_visual — these flow through the pipeline and inform whether a chunk should include a visual description. Source: dsrag/dsparse/models/types.py.
Configuration
Configuration is exposed through three TypedDict groups: FileParsingConfig, SemanticSectioningConfig, and ChunkingConfig. All fields are optional and fall back to module-level defaults when omitted. Source: dsrag/dsparse/models/types.py.
| Config Group | Key Field | Purpose |
|---|---|---|
FileParsingConfig | use_vlm | Toggle VLM parsing (default False) |
FileParsingConfig | vlm / vlm_fallback | Serialized class-based VLM clients |
FileParsingConfig | vlm_config | Legacy dict: provider, model, dpi, concurrency |
FileParsingConfig | always_save_page_images | Persist rasterized pages for reuse |
SemanticSectioningConfig | use_semantic_sectioning | Toggle LLM-based sectioning |
SemanticSectioningConfig | llm_provider / model | Sectioning LLM (default gpt-4o-mini) |
ChunkingConfig | chunk_size | Max characters per chunk |
ChunkingConfig | min_length_for_chunking | Skip chunking below this length |
Source: dsrag/dsparse/models/types.py.
VLM Clients
dsParse provides class-based VLM clients (e.g., GeminiVLM) that mirror the abstraction used for LLMs, embeddings, and rerankers. They support .to_dict() serialization so they can be persisted in configuration and rehydrated at runtime. Source: dsrag/dsparse/README.md.
A vlm_fallback client may be supplied alongside the primary client; the system alternates between them after the initial retries when needed. This mirrors the fallback patterns used elsewhere in dsRAG. Source: dsrag/dsparse/README.md.
Legacy dict-based configuration (e.g., vlm_config={"provider": "gemini", "model": "gemini-2.0-flash"}) remains fully supported. When both a serialized client and a legacy dict are supplied, the serialized client takes precedence. Source: dsrag/dsparse/README.md.
Usage Patterns
Standalone parsing. The parse_and_chunk function is the primary entry point and accepts file_path, file_parsing_config, and optionally a file_system parameter (e.g., LocalFileSystem(base_path="~/dsParse")) for persisting intermediate artifacts. Source: dsrag/dsparse/README.md.
KnowledgeBase integration. A VLM client can be attached at the KB level via KnowledgeBase(..., vlm_client=GeminiVLM(model="...")), then overridden per document by passing a serialized client under file_parsing_config["vlm"]. This is the recommended pattern when different documents need different models or cost profiles. Source: dsrag/dsparse/README.md.
Reusing pre-extracted images. When page images already exist in the configured FileSystem directory, pass vlm_config={"images_already_exist": True} to skip rasterization and avoid redundant API calls. Source: dsrag/dsparse/README.md.
Cost, Latency, and Common Pitfalls
Cost. VLM parsing with gemini-2.0-flash is approximately $0.10 per 1000 pages (assuming 4 × 258-token image tiles per page at standard DPI plus a ~500-token prompt). Semantic sectioning with gpt-4o-mini is roughly $0.15 per 1000 pages. Source: dsrag/dsparse/README.md.
Latency. A single page takes ~15–20 seconds for VLM parsing. Documents are page-parallelized within the rate-limit budget (vlm_max_concurrent_requests). Sectioning operates on ~5000-token mega-chunks (≈10 pages) processed in parallel, typically completing a few-hundred-page document in 5–10 seconds. Source: dsrag/dsparse/README.md.
Platform compatibility. dsParse depends on poppler for PDF rasterization (brew install poppler on macOS). Community issue #61 reports that some async runtimes (uvloop) fail on Windows, which is relevant when integrating dsParse into high-throughput async pipelines. Source: community issue #61.
Optional dependencies. Community issue #127 highlights that some optional dependencies (e.g., google.generativeai) are imported eagerly elsewhere in the codebase rather than via LazyLoader. This is worth noting when assembling minimal environments for dsParse on its own. Source: community issue #127.
Environment variables. GEMINI_API_KEY is required for GeminiVLM; a clear error is raised at instantiation if missing. Other providers are expected to follow the same convention. Source: dsrag/dsparse/README.md.
See Also
- KnowledgeBase & Components — how dsParse integrates with the broader retrieval engine
- AutoContext — uses section titles produced by semantic sectioning
- Relevant Segment Extraction — consumes the chunk output
Source: https://github.com/D-Star-AI/dsRAG / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 13 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.
1. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/113
2. Configuration risk: Configuration risk requires verification
- Severity: high
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/117
3. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/73
4. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/127
5. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/116
6. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/D-Star-AI/dsRAG
7. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/124
8. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/D-Star-AI/dsRAG
9. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/D-Star-AI/dsRAG
10. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | https://github.com/D-Star-AI/dsRAG
11. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/118
12. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/D-Star-AI/dsRAG
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using dsRAG with real data or production workflows.
- raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError - github / github_issue
- llm.py directly imports google.generativeai instead of using LazyLoader - github / github_issue
- A bug at custom_term_mapping? - github / github_issue
- Is ChunkDB really needed? - github / github_issue
- WeaviateVectorDB fails to connect with Weaviate v4 client - missing grpc - github / github_issue
- sqlite3.OperationalError: no such column: model_response_status - github / github_issue
- About Performance of Semantic Chunk - github / github_issue
- Import "dsrag.document_parsing" from the README example couldn't be reso - github / github_issue
- Capability evidence risk requires verification - GitHub / issue
Source: Project Pack community evidence and pitfall evidence