paper-qa Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

paper-qa

High accuracy RAG for answering questions from scientific documents with citations

Overview and Quickstart

Related topics: Installation and CLI Usage, System Architecture and Data Flow

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 1. Configure LLM credentials

Continue reading this section for the full explanation and source context.

Section 2. Build an index over a directory of papers

Continue reading this section for the full explanation and source context.

Section 3. Ask a question

Continue reading this section for the full explanation and source context.

Overview and Quickstart

Introduction

PaperQA is a high-accuracy Retrieval-Augmented Generation (RAG) system designed for answering questions grounded in scientific literature. It combines LLM-driven literature search, full-text PDF parsing, evidence extraction, and answer synthesis into a single, configurable pipeline. The project is structured as a Python monorepo with a core paperqa package and several optional sub-packages that provide alternative PDF backends.

The repository exposes both a Python API and a pqa command-line interface. End users can build a local paper index, run ask against it, or use the agent flow that iteratively searches the web, downloads papers, gathers evidence, and produces a citation-bearing answer.

Core Architecture

The core agent workflow is implemented under src/paperqa/agents/. The flow is driven by an LLM tool-selector that calls into a typed Environment containing the question, the Docs collection, and the active Settings profile.

flowchart LR
    A[User Query] --> B[PQASession]
    B --> C[Agent / ToolSelector]
    C --> D[Search Tool]
    D --> E[Tantivy Index]
    C --> F[Gather Evidence]
    F --> G[LLM Summarize + Score]
    G --> H[Answer Synthesis]
    H --> I[AnswerResponse]

agent_query and ask orchestrate the agent loop defined in src/paperqa/agents/env.py. The environment constructs tools from the settings profile (make_tools) and an initial PQASession carrying the question and the Settings md5.
The Docs object holds parsed text and embeddings, populated either by reading files from paper_directory or by the agent downloading new PDFs.
build_index in src/paperqa/agents/search.py creates a Tantivy full-text index over the directory of papers, optionally synced with the filesystem. It validates files via index_settings.files_filter, which is the supported way to constrain which articles are indexed (community discussion #1330).
The summarization/answer step is split between a summary LLM and an answer LLM, both configured via Settings. JSON-graded summaries are requested when prompts.use_json is true, as seen in src/paperqa/configs/wikicrow.json and src/paperqa/configs/contracrow.json.

Installation and Plugin Packages

The project uses a uv-managed workspace. The core package installs the agent, the indexer, and the CLI. Four optional sub-packages provide alternative PDF parsers:

Sub-package	Backend	Notable extras
`paper-qa-pypdf`	PyPDF	`[media]` adds pypdfium2 screenshots; `[enhanced]` adds pdfplumber table parsing
`paper-qa-pymupdf`	PyMuPDF (fitz)	Block-level text extraction and full-page screenshots
`paper-qa-docling`	Docling	Layout-aware extraction for tables, figures, and formulas
`paper-qa-nemotron`	Nemotron-Parse VLM	NVIDIA-hosted VLM parser returning bbox-grounded text

Each sub-package exposes a parse_pdf_to_pages function with the same signature so it can be plugged into Settings as a PDFParserFn. The PyPDF README documents the media and enhanced extras directly, and the PyMuPDF, Docling, and Nemotron READMEs describe their respective backends.

Note on PDF readers: The PyPDF reader will raise OSError: cannot write mode CMYK as PNG when a PDF contains CMYK images (community bug #1310). For these cases, switch to the Docling or PyMuPDF backend, or pre-process the PDF.

Quickstart

1. Configure LLM credentials

PaperQA uses lmi (LiteLLM) as its model abstraction, so any provider LiteLLM supports (OpenAI, Azure OpenAI, Anthropic, Gemini, Ollama, llamafile, etc.) can be configured. Set the provider's API key as an environment variable — for example OPENAI_API_KEY, AZURE_OPENAI_API_KEY, or GEMINI_API_KEY — before running PaperQA. Community discussions #393, #378, #428, and #1044 confirm that the CLI looks for these standard variables; if none are set it falls back to defaults and may require an OpenAI key.

Memory: Importing paperqa.clients alone still pulls in lmi and adds roughly 100 MB of resident memory (community report #1317). If you only need metadata helpers, consider importing the submodule directly rather than the top-level package.

2. Build an index over a directory of papers

from paperqa import Settings
from paperqa.agents import build_index

settings = Settings()
settings.agent.index.paper_directory = "./my_papers"
settings.agent.index.files_filter = lambda p: p.suffix.lower() == ".pdf"

index = await build_index(settings=settings)

build_index walks paper_directory, filters files through files_filter, and writes a Tantivy index plus a manifest. Re-running it syncs the index with the directory when sync_index_w_directory is enabled, and it warns once the file count exceeds WARN_IF_INDEXING_MORE_THAN.

3. Ask a question

from paperqa import ask

answer = await ask(
    "What method was used for protein folding in this corpus?",
    settings=settings,
)
print(answer.answer)
print(answer.citations)

ask performs the full agent loop: searches the local index, optionally augments with web search and paper downloads, gathers evidence (chunk summaries), and synthesizes a cited answer via AnswerResponse, defined in src/paperqa/agents/models.py. The search-query prompt used to fan out into multiple queries is built by litellm_get_search_query in src/paperqa/agents/helpers.py.

4. Use a non-default reader

from paperqa_pymupdf import parse_pdf_to_pages

settings.pdf_parser = parse_pdf_to_pages

The same pattern works for paperqa_docling.parse_pdf_to_pages and paperqa_nemotron.

Common Failure Modes and Configuration Tips

Pickle in search indexes. Persisted indexes default to SearchDocumentStorage.PICKLE_COMPRESSED and are unsafe to load from untrusted sources (security report #1325). Only load indexes you built yourself, or switch to a non-pickle storage mode.
Selecting a subset of papers. Use Settings.agent.index.files_filter to control which files inside paper_directory are indexed, as asked in community question #1330. You can also set recurse_subdirectories = False to avoid picking up nested folders.
Local models. When using Ollama, llamafile, or other local servers, set both the llm and summary_llm model names in Settings to the LiteLLM route (for example ollama/llama3.1) and configure any required api_base. Community discussions #378 and #428 walk through common pitfalls with litellm.
Zotero and OpenReview sources. The paperqa.contrib package exposes ZoteroDB (src/paperqa/contrib/zotero.py) and OpenReviewPaperHelper (src/paperqa/contrib/openreview_paper_helper.py) for downloading PDFs from those sources into paper_directory before indexing.

Installation and CLI Usage

PaperQA ships as a monorepo composed of a core package and a set of optional PDF reader packages, all driven through a single command-line interface called pqa. This page describes how to install the runtime, how the CLI dispatches its subcommands, and where configuration and known operational issues are documented.

Installation Layout

The repository is organised so the core agent and search logic live in src/paperqa, while PDF parsing backends are extracted into four independently installable packages under packages/:

Package	Backend	Notes
`paper-qa-pypdf`	PyPDF	Default backend; extras `[media]` (pypdfium2) and `[enhanced]` (pdfplumber) enable figure/table extraction (packages/paper-qa-pypdf/README.md:1)
`paper-qa-docling`	Docling / docling-parse	Provides layout-aware parsing via `parse_pdf_to_pages` (packages/paper-qa-docling/src/paperqa_docling/__init__.py:1)
`paper-qa-pymupdf`	PyMuPDF	AGPLv3-licensed backend (packages/paper-qa-pymupdf/README.md:1)
`paper-qa-nemotron`	NVIDIA nemotron-parse VLM	Calls the nemotron-parse NIM/HF endpoint for vision-language parsing (packages/paper-qa-nemotron/README.md:1)

The core package itself pulls in litellm, which is imported unconditionally at module load time (Community issue #1317). Users who only want metadata-only flows via paperqa.clients.DocMetadataClient still incur the LMI/LiteLLM import cost (~100 MB), which is a known design limitation rather than an installation defect.

A representative setup — reproduced from community discussions such as #428 — uses mamba to create an isolated environment, then installs the core package in editable mode from a clone of the repository.

CLI Entry Point and Subcommands

The CLI dispatcher lives in src/paperqa/agents/__init__.py. When the module is executed as __main__, it sets a _INITIATED_FROM_CLI flag and routes on the first positional argument. The supported subcommands are view, ask, search, and index; any other input prints a brief help banner listing the commands (src/paperqa/agents/__init__.py:1).

The index subcommand delegates directly to build_index(args.index, args.directory, settings). build_index constructs a SearchIndex named via index_settings.name or _settings.get_index_name(), then walks index_settings.paper_directory (recursively if recurse_subdirectories is set) and applies index_settings.files_filter to each candidate path (src/paperqa/agents/search.py:1). If build=False and the index has no files, the function raises RuntimeError(f"Index {search_index.index_name} was empty, please rebuild it."), which is the common failure surface for users running pqa ask before pqa index.

The flow when answering questions is roughly:

flowchart LR
  A[pqa index] --> B[SearchIndex on disk]
  B --> C[pqa ask / pqa search]
  C --> D[agents.agent_query]
  D --> E[AnswerResponse]

Logging for the CLI is configured via configure_cli_logging(verbosity), which installs a Rich handler, calls Settings.parsing.configure_pdf_parser() to wire the chosen PDF backend, and applies LOG_VERBOSITY_MAP to suppress loquacious third-party loggers (src/paperqa/agents/__init__.py:1). Increasing the verbosity beyond zero prints the PaperQA version banner.

Configuration and Settings

Runtime configuration flows through the Pydantic Settings model. JSON presets under src/paperqa/configs/ (for example contracrow.json and wikicrow.json) override prompt templates, summary JSON schemas, and agent type — contracrow.json configures a ToolSelector agent with a gpt-4o-2024-08-06 LLM and a 500 s timeout, and enables structured summary_json/summary_json_system prompts for contradiction detection (src/paperqa/configs/contracrow.json:1). The wikicrow.json preset switches the prompts into MLA-style citation extraction with a structured structured_citation_prompt (src/paperqa/configs/wikicrow.json:1).

Helper modules wire external services to environment variables. ZoteroDB.__init__ reads ZOTERO_USER_ID and ZOTERO_API_KEY automatically and falls back to ~/.paperqa/zotero for PDF storage (src/paperqa/contrib/zotero.py:1). The OpenReview helper writes into the index's paper_directory and re-uses the Settings object for paths, model choice, and concurrency (src/paperqa/contrib/openreview_paper_helper.py:1).

Known Operational Issues

Several recurring failure modes are surfaced by community reports and should be planned for during deployment:

Pickle deserialization in persisted indexes — The default SearchDocumentStorage.PICKLE_COMPRESSED mode unserialises index files when loaded. A poisoned index can therefore execute arbitrary code on load (Community issue #1325). Treat shared indexes as untrusted input.
CMYK image crash — Indexing PDFs whose images use the CMYK color space raises OSError: cannot write mode CMYK as PNG from paperqa_pypdf/reader.py. The crash aborts pqa index instead of skipping the file (Community issue #1310).
Selective indexing — ask, agent_query, and build_index consume the entire paper_directory (subject to files_filter); there is no per-file allow-list exposed by these APIs (Community issue #1330).
Non-OpenAI LLM setup — Users running local servers (Ollama, llamafile, vLLM) report friction because the CLI looks for an OpenAI key by default; switching to Gemini or open-source endpoints requires explicit Settings overrides (Community issues #378, #387, #1044).
LiteLLM import weight — Even metadata-only clients transitively import LiteLLM (Community issue #1317).

The AnswerResponse and SimpleProfiler models in src/paperqa/agents/models.py provide an SMS-style get_summary() helper and an @asynccontextmanager-based timer that emits [Profiling] lines consumed by downstream Google Cloud monitoring — useful when adding --verbose instrumentation to the CLI (src/paperqa/agents/models.py:1).

System Architecture and Data Flow

Related topics: Agentic Workflow and Tools

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Agentic Workflow and Tools

System Architecture and Data Flow

PaperQA is a multi-stage literature QA system that ingests PDFs, builds a searchable index, and runs an LLM-driven agent that gathers evidence and produces cited answers. The codebase is organized into a small core, several optional PDF-reader packages, configuration presets, and agent modules.

High-Level Architecture

The system has four cooperating layers: Configuration, Document Ingestion, Search Indexing, and Agent Execution. Each layer is independently replaceable so the same Docs/Agent abstractions can drive very different deployments — from a local OpenAI-backed CLI to a self-hosted LLM served via LiteLLM (a recurring pain point tracked in issues such as #378, #387, #393, #428, and #1044).

flowchart LR
    A[Config Presets<br/>contracrow / wikicrow / debug] --> D[Settings]
    B[PDF Readers<br/>pypdf · docling · pymupdf · nemotron] --> E[ParsedText + ParsedMedia]
    F[External Sources<br/>Zotero · OpenReview] --> E
    E --> G[Tantivy SearchIndex<br/>via agents/search.py]
    G --> H[Agent Environment<br/>agents/env.py]
    H --> I[ToolSelector Agent<br/>search · gather · cite]
    I --> J[Answer + Citations]

Configuration Layer

Behavior is driven by JSON presets that populate a Settings object, including model identifiers, prompt templates, agent type, and evidence-retrieval knobs. For example, contracrow.json sets agent_llm to gpt-4o-2024-08-06, agent_type to ToolSelector, search_count to 12, and supplies an XML-formatted contradiction prompt that asks the model to label claims on an eleven-point scale (from "explicit contradiction" to "explicit agreement"). wikicrow.json follows the same shape but uses a Wikipedia-style qa prompt and an MLA-formatted citation post-processor (structured_citation_prompt). debug.json is a deliberately lean preset that downscales evidence_k, answer_length, and max_concurrent_requests for fast iteration.

Document Ingestion Layer

Ingestion is pluggable. Each optional package under packages/ exposes a parse_pdf_to_pages function returning ParsedText plus optional ParsedMedia:

Package	Backend	Notes
`paper-qa-pypdf`	pypdf + optional pdfium/pdfplumber	Defines `MediaMode` (`NONE`, `FULL_PAGE`, `INDIVIDUAL`, `INDIVIDUAL_CLUSTERING`) and clusters bounding boxes via `cluster_bboxes`.
`paper-qa-docling`	Docling `StandardPdfPipeline` + `DoclingParseDocumentBackend`	Generates picture and table images, exposes `dpi` and `custom_pipeline_options`.
`paper-qa-pymupdf`	PyMuPDF	AGPLv3-licensed; added in the v2026.01.05 release.
`paper-qa-nemotron`	Nvidia nemotron-parse VLM (NIM)	Vision-language parsing; URLs and support matrix documented in the README.

External paper sources feed the same pipeline. ZoteroDB extends pyzotero.zotero.Zotero, reads ZOTERO_USER_ID / ZOTERO_API_KEY from the environment, and downloads PDFs to ~/.paperqa/zotero. OpenReviewPaperHelper uses an LLM (via litellm_get_search_query-style schema prompting) to pick up to 20 relevant submissions, then downloads PDFs into settings.agent.index.paper_directory.

Common failure modes the community has surfaced — for example, CMYK image crashes during pil_image.save(buf, format="PNG") in paperqa_pypdf/reader.py (issue #1310) — all sit inside this ingestion layer.

Search Indexing Layer

agents/search.py walks the configured paper_directory (recursively when recurse_subdirectories is enabled), applies index_settings.files_filter to each candidate, and feeds the survivors into a SearchIndex whose required fields are extended with title and year. The function honors a persisted manifest (finalize_manifest_file) so unchanged files are skipped on rebuild, and logs a warning if more than WARN_IF_INDEXING_MORE_THAN files are about to be indexed. Index scoping is the topic of issue #1330, where users want to query a curated subset rather than every PDF in a folder — the answer in code is to drive files_filter rather than mutate the directory.

Agent Execution Layer

agents/env.py constructs a PQASession environment that wraps a Docs object, an LLM model, a summary LLM, an embedding model, and (optionally) a session_id. It calls settings_to_tools to expose the configured tools, resets docs on each new query, and registers clinical_trial_status as a status_fn when ClinicalTrialsSearch is enabled. The from_task classmethod is a convenience entry point that builds a default Settings() plus an empty Docs() for one-shot use.

Helpers in agents/helpers.py include litellm_get_search_query, which validates that a custom template contains {count}, {question}, and {date} placeholders (replacing {date} with the current year) before falling back to a default keyword-generation prompt; and get_year, which returns the current year as a string for prompt interpolation. Note that importing paperqa eagerly pulls in lmi / LiteLLM, which costs roughly 100 MB even for metadata-only callers — tracked in issue #1317.

End-to-End Data Flow

A config preset (or Settings() defaults) is loaded.
PDFs are sourced from a local directory, Zotero, or OpenReview and parsed by one of the reader packages.
Parsed text + media is chunked, embedded, and written to a Tantivy SearchIndex (manifest-tracked).
A user query enters the PQASession environment, which exposes search, gather-evidence, paper-collection, and citation tools to a ToolSelector agent.
The agent iterates: searches the index, gathers evidence, follows citations inside evidence, and finally emits a cited answer.

For LLM selection, any provider supported by LiteLLM can be plugged into agent_llm / summary_llm via the config; community guidance in the issues above shows that Azure, Gemini, Ollama, and llamafile all work as long as the model string is correctly formatted and the required API key is exported.

Agentic Workflow and Tools

Related topics: System Architecture and Data Flow

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Environment and state

Continue reading this section for the full explanation and source context.

Section The default tool set

Continue reading this section for the full explanation and source context.

Section Mermaid: agent loop

Continue reading this section for the full explanation and source context.

Related topics: System Architecture and Data Flow

Agentic Workflow and Tools

Overview

The agentic workflow in PaperQA is built on top of the Aviary framework and is the central mechanism that drives multi-step literature question answering. Instead of running a single retrieval–summarize–answer pipeline (the older paperqa.ask() flow), the agent iteratively selects tools — searching for papers, gathering evidence, citing references, and generating answers — until it can confidently terminate. Source: src/paperqa/agents/main.py:1-90

The agent loop is exposed via run_aviary_agent() in src/paperqa/agents/main.py, which orchestrates a PaperQAEnvironment, a ToolSelector agent, and a set of Tool objects derived from the user's Settings. A timeout-bound rollout() is used to bound cost and wall-clock time. Source: src/paperqa/agents/main.py:30-90

Agent Loop and Tool Set

Environment and state

PaperQAEnvironment (defined in src/paperqa/agents/env.py) wires the LLM, summary LLM, embedding model, and a Docs collection into an Aviary environment. Its reset() method clears the docs, constructs a fresh PQASession, and returns the initial message list plus the available tools. Source: src/paperqa/agents/env.py:60-150

A PQASession (src/paperqa/agents/models.py:1-80) holds the user's question, the config MD5, the accumulated evidence, the cited contexts, and the running cost/timing — the canonical state that the agent mutates step-by-step. AnswerResponse is the final structured output returned to callers.

The default tool set

The environment builds its tools via settings_to_tools(). By default this includes: search_papers, gather_evidence, gen_answer, gen_answer_stream, empty_evidence, and a termination tool that lets the LLM signal completion. Tools can be filtered through Settings.agent.tool_names (and tool_call_limit) so power users can disable, e.g., the streaming answer or add custom clinical-trial tools (see the ClinicalTrialsSearch.TOOL_FN_NAME branch in make_initial_state()). Source: src/paperqa/agents/env.py:80-130

The agent system prompt and behavior are driven by the Settings.agent block. The shipped presets contracrow.json and wikicrow.json (src/paperqa/configs/) encode a "search → gather → cite → gather more → answer → complete" loop with search_count: 12 and a 500 s timeout, demonstrating the recommended default cadence. Source: src/paperqa/configs/contracrow.json, src/paperqa/configs/wikicrow.json

Mermaid: agent loop

flowchart LR
    Q[User question] --> R[reset: build tools, init PQASession]
    R --> S[ToolSelector selects a tool]
    S -->|search_papers| SR[Tantivy / paper directory]
    S -->|gather_evidence| GE[Embeddings + LLM summary]
    S -->|gen_answer| AN[Compose answer w/ citations]
    S -->|complete| T[Terminate]
    SR --> S
    GE --> S
    AN --> S
    T --> OUT[AnswerResponse]

Search Index Construction

The search_papers tool relies on a Tantivy index built by read_documents() in src/paperqa/agents/search.py:1-90. The function reads from Settings.agent.index.paper_directory, applies index_settings.files_filter, and warns when more than WARN_IF_INDEXING_MORE_THAN files are queued. A manifest file is written via maybe_get_manifest() so that downstream tools can detect which documents were actually considered. Source: src/paperqa/agents/search.py:30-80

Helpers in src/paperqa/agents/helpers.py, notably litellm_get_search_query(), prompt the LLM to emit {count} keyword queries (with year ranges) that the search tool then executes against the index. This indirection is what allows the agent to reformulate queries across iterations.

PDF Reading Backends (Plugins)

PaperQA ships several opt-in PDF reading packages that plug into the agent's document-adding path:

Package	Backend	Notes
`paperqa-pypdf`	PyPDF	Default; supports clustered image extraction. Source: packages/paper-qa-pypdf/src/paperqa_pypdf/reader.py
`paper-qa-docling`	Docling	High-fidelity layout/table extraction. Source: packages/paper-qa-docling/src/paperqa_docling/reader.py
`paper-qa-pymupdf`	PyMuPDF	AGPL-licensed alternative. See packages/paper-qa-pymupdf/README.md
`paper-qa-nemotron`	Nemotron-Parse VLM	NVIDIA-hosted vision-language model. See packages/paper-qa-nemotron/README.md

Each backend exposes a parse_pdf_to_pages() function with a consistent signature so the agent can swap parsers without code changes — only the Settings.agent.index.parser selection differs.

Configuration, Cost, and Common Failure Modes

Configuring an alternative LLM

Community issues #393, #378, #428, and #1044 all ask how to point PaperQA at non-OpenAI providers (Azure OpenAI, Ollama/llamafile, Gemini). Because the agent goes through LiteLLM, any LiteLLM-supported provider can be selected by overriding Settings.agent.agent_llm and, where needed, the API key environment variables. For local servers (llamafile, ollama), users typically also override the embedding model. Source: src/paperqa/agents/main.py, src/paperqa/agents/helpers.py

Note: Issue #1317 reports that importing paperqa always loads LiteLLM (~100 MB), even when only DocMetadataClient is used. This is a known dependency-bloat issue that the LiteLLM-driven agentic workflow exacerbates.

Selecting which papers to index

Issue #1330 asks how to constrain the search corpus. The answer is in Settings.agent.index.files_filter and the paper_directory / recurse_subdirectories fields, which are honored by read_documents(). Source: src/paperqa/agents/search.py:40-80

Security and crash issues

Pickle deserialization (CVE-class): Issue #1325 documents that the default SearchDocumentStorage.PICKLE_COMPRESSED index format is unsafe to load from untrusted sources. Consider clearing or rebuilding the index when transferring it across trust boundaries.
CMYK images: Issue #1310 reports OSError: cannot write mode CMYK as PNG from paperqa_pypdf/reader.py. Workarounds include pre-converting images with Pillow or switching to the Docling backend, which tolerates CMYK.

External integrations

For users who maintain literature in Zotero or OpenReview, the contrib helpers in src/paperqa/contrib/zotero.py and src/paperqa/contrib/openreview_paper_helper.py download PDFs into Settings.agent.index.paper_directory so that a subsequent agent run can index and query them. The BGPT MCP/HTTP integration proposed in #1338 would follow the same pattern: register a new Tool whose Tool.fn calls BGPT and whose schema is exposed via settings_to_tools().

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Configuration risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Maintenance risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 10 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Future-House/paper-qa/issues/1330

2. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Future-House/paper-qa/issues/1310

3. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/Future-House/paper-qa

4. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/Future-House/paper-qa

5. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/Future-House/paper-qa

6. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | https://github.com/Future-House/paper-qa

7. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Future-House/paper-qa/issues/399

8. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Future-House/paper-qa/issues/1325

9. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/Future-House/paper-qa

10. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/Future-House/paper-qa

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using paper-qa with real data or production workflows.

Dependency Dashboard - github / github_issue
Integration idea: BGPT evidence retrieval tool - github / github_issue
Is it possible to choose which articles to index and query in the paper - github / github_issue
Insecure pickle deserialization in PaperQA2 persisted search indexes can - github / github_issue
LiteLLM unnecessary memory usage when only trying to use paper metadata - github / github_issue
CMYK images in PDFs crash indexing with OSError - github / github_issue
v2026.03.18 - github / github_release
v2026.03.12 - github / github_release
v2026.03.03 - github / github_release
v2026.02.27 - github / github_release
v2026.02.16 - github / github_release
v2026.01.05 - github / github_release

Source: Project Pack community evidence and pitfall evidence

paper-qa

Overview and Quickstart

Related Pages

Overview and Quickstart

Introduction

Core Architecture

Installation and Plugin Packages

Quickstart

1. Configure LLM credentials

2. Build an index over a directory of papers

3. Ask a question

4. Use a non-default reader

Common Failure Modes and Configuration Tips

See Also

Installation and CLI Usage

Related Pages

Installation and CLI Usage

Installation Layout

CLI Entry Point and Subcommands

Configuration and Settings

Known Operational Issues

See Also

System Architecture and Data Flow

Related Pages

System Architecture and Data Flow

High-Level Architecture

Configuration Layer

Document Ingestion Layer

Search Indexing Layer

Agent Execution Layer

End-to-End Data Flow

See Also

Agentic Workflow and Tools

Related Pages

Agentic Workflow and Tools

Overview

Agent Loop and Tool Set

Environment and state

The default tool set

Mermaid: agent loop

Search Index Construction

PDF Reading Backends (Plugins)

Configuration, Cost, and Common Failure Modes

Configuring an alternative LLM

Selecting which papers to index

Security and crash issues

External integrations

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Installation risk: Installation risk requires verification

2. Configuration risk: Configuration risk requires verification

3. Capability evidence risk: Capability evidence risk requires verification

4. Maintenance risk: Maintenance risk requires verification

5. Security or permission risk: Security or permission risk requires verification

6. Security or permission risk: Security or permission risk requires verification

7. Security or permission risk: Security or permission risk requires verification

8. Security or permission risk: Security or permission risk requires verification

9. Maintenance risk: Maintenance risk requires verification

10. Maintenance risk: Maintenance risk requires verification

Community Discussion Evidence

Community Discussion Evidence