# https://github.com/Future-House/paper-qa Project Manual

Generated at: 2026-06-23 04:19:10 UTC

## Table of Contents

- [Overview and Quickstart](#page-1)
- [Installation and CLI Usage](#page-2)
- [System Architecture and Data Flow](#page-3)
- [Agentic Workflow and Tools](#page-4)

<a id='page-1'></a>

## Overview and Quickstart

### Related Pages

Related topics: [Installation and CLI Usage](#page-2), [System Architecture and Data Flow](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/Future-House/paper-qa/blob/main/README.md)
- [src/paperqa/__init__.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/__init__.py)
- [pyproject.toml](https://github.com/Future-House/paper-qa/blob/main/pyproject.toml)
- [src/paperqa/agents/search.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/search.py)
- [src/paperqa/agents/env.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/env.py)
- [src/paperqa/agents/helpers.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/helpers.py)
- [src/paperqa/agents/models.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/models.py)
- [src/paperqa/contrib/zotero.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/zotero.py)
- [src/paperqa/contrib/openreview_paper_helper.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/openreview_paper_helper.py)
- [packages/paper-qa-pypdf/README.md](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-pypdf/README.md)
- [packages/paper-qa-pymupdf/README.md](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-pymupdf/README.md)
- [packages/paper-qa-docling/README.md](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-docling/README.md)
- [packages/paper-qa-nemotron/README.md](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-nemotron/README.md)
- [src/paperqa/configs/wikicrow.json](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/wikicrow.json)
- [src/paperqa/configs/contracrow.json](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/contracrow.json)
</details>

# Overview and Quickstart

## Introduction

PaperQA is a high-accuracy Retrieval-Augmented Generation (RAG) system designed for answering questions grounded in scientific literature. It combines LLM-driven literature search, full-text PDF parsing, evidence extraction, and answer synthesis into a single, configurable pipeline. The project is structured as a Python monorepo with a core `paperqa` package and several optional sub-packages that provide alternative PDF backends.

The repository exposes both a Python API and a `pqa` command-line interface. End users can build a local paper index, run `ask` against it, or use the agent flow that iteratively searches the web, downloads papers, gathers evidence, and produces a citation-bearing answer.

## Core Architecture

The core agent workflow is implemented under `src/paperqa/agents/`. The flow is driven by an LLM tool-selector that calls into a typed `Environment` containing the question, the `Docs` collection, and the active `Settings` profile.

```mermaid
flowchart LR
    A[User Query] --> B[PQASession]
    B --> C[Agent / ToolSelector]
    C --> D[Search Tool]
    D --> E[Tantivy Index]
    C --> F[Gather Evidence]
    F --> G[LLM Summarize + Score]
    G --> H[Answer Synthesis]
    H --> I[AnswerResponse]
```

- `agent_query` and `ask` orchestrate the agent loop defined in [src/paperqa/agents/env.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/env.py). The environment constructs tools from the settings profile (`make_tools`) and an initial `PQASession` carrying the question and the `Settings` md5.
- The `Docs` object holds parsed text and embeddings, populated either by reading files from `paper_directory` or by the agent downloading new PDFs.
- `build_index` in [src/paperqa/agents/search.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/search.py) creates a Tantivy full-text index over the directory of papers, optionally synced with the filesystem. It validates files via `index_settings.files_filter`, which is the supported way to constrain which articles are indexed (community discussion #1330).
- The summarization/answer step is split between a summary LLM and an answer LLM, both configured via `Settings`. JSON-graded summaries are requested when `prompts.use_json` is true, as seen in [src/paperqa/configs/wikicrow.json](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/wikicrow.json) and [src/paperqa/configs/contracrow.json](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/contracrow.json).

## Installation and Plugin Packages

The project uses a uv-managed workspace. The core package installs the agent, the indexer, and the CLI. Four optional sub-packages provide alternative PDF parsers:

| Sub-package | Backend | Notable extras |
|---|---|---|
| `paper-qa-pypdf` | PyPDF | `[media]` adds pypdfium2 screenshots; `[enhanced]` adds pdfplumber table parsing |
| `paper-qa-pymupdf` | PyMuPDF (fitz) | Block-level text extraction and full-page screenshots |
| `paper-qa-docling` | Docling | Layout-aware extraction for tables, figures, and formulas |
| `paper-qa-nemotron` | Nemotron-Parse VLM | NVIDIA-hosted VLM parser returning bbox-grounded text |

Each sub-package exposes a `parse_pdf_to_pages` function with the same signature so it can be plugged into `Settings` as a `PDFParserFn`. The PyPDF README documents the `media` and `enhanced` extras directly, and the PyMuPDF, Docling, and Nemotron READMEs describe their respective backends.

> **Note on PDF readers:** The PyPDF reader will raise `OSError: cannot write mode CMYK as PNG` when a PDF contains CMYK images (community bug #1310). For these cases, switch to the Docling or PyMuPDF backend, or pre-process the PDF.

## Quickstart

### 1. Configure LLM credentials

PaperQA uses `lmi` (LiteLLM) as its model abstraction, so any provider LiteLLM supports (OpenAI, Azure OpenAI, Anthropic, Gemini, Ollama, llamafile, etc.) can be configured. Set the provider's API key as an environment variable — for example `OPENAI_API_KEY`, `AZURE_OPENAI_API_KEY`, or `GEMINI_API_KEY` — before running PaperQA. Community discussions #393, #378, #428, and #1044 confirm that the CLI looks for these standard variables; if none are set it falls back to defaults and may require an OpenAI key.

> **Memory:** Importing `paperqa.clients` alone still pulls in `lmi` and adds roughly 100 MB of resident memory (community report #1317). If you only need metadata helpers, consider importing the submodule directly rather than the top-level package.

### 2. Build an index over a directory of papers

```python
from paperqa import Settings
from paperqa.agents import build_index

settings = Settings()
settings.agent.index.paper_directory = "./my_papers"
settings.agent.index.files_filter = lambda p: p.suffix.lower() == ".pdf"

index = await build_index(settings=settings)
```

`build_index` walks `paper_directory`, filters files through `files_filter`, and writes a Tantivy index plus a manifest. Re-running it syncs the index with the directory when `sync_index_w_directory` is enabled, and it warns once the file count exceeds `WARN_IF_INDEXING_MORE_THAN`.

### 3. Ask a question

```python
from paperqa import ask

answer = await ask(
    "What method was used for protein folding in this corpus?",
    settings=settings,
)
print(answer.answer)
print(answer.citations)
```

`ask` performs the full agent loop: searches the local index, optionally augments with web search and paper downloads, gathers evidence (chunk summaries), and synthesizes a cited answer via `AnswerResponse`, defined in [src/paperqa/agents/models.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/models.py). The search-query prompt used to fan out into multiple queries is built by `litellm_get_search_query` in [src/paperqa/agents/helpers.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/helpers.py).

### 4. Use a non-default reader

```python
from paperqa_pymupdf import parse_pdf_to_pages

settings.pdf_parser = parse_pdf_to_pages
```

The same pattern works for `paperqa_docling.parse_pdf_to_pages` and `paperqa_nemotron`.

## Common Failure Modes and Configuration Tips

- **Pickle in search indexes.** Persisted indexes default to `SearchDocumentStorage.PICKLE_COMPRESSED` and are unsafe to load from untrusted sources (security report #1325). Only load indexes you built yourself, or switch to a non-pickle storage mode.
- **Selecting a subset of papers.** Use `Settings.agent.index.files_filter` to control which files inside `paper_directory` are indexed, as asked in community question #1330. You can also set `recurse_subdirectories = False` to avoid picking up nested folders.
- **Local models.** When using Ollama, llamafile, or other local servers, set both the `llm` and `summary_llm` model names in `Settings` to the LiteLLM route (for example `ollama/llama3.1`) and configure any required `api_base`. Community discussions #378 and #428 walk through common pitfalls with `litellm`.
- **Zotero and OpenReview sources.** The `paperqa.contrib` package exposes `ZoteroDB` ([src/paperqa/contrib/zotero.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/zotero.py)) and `OpenReviewPaperHelper` ([src/paperqa/contrib/openreview_paper_helper.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/openreview_paper_helper.py)) for downloading PDFs from those sources into `paper_directory` before indexing.

## See Also

- Settings and Configuration
- Index Management and Search
- Agent and Tooling Reference
- PDF Reader Plugins
- Contributing Guide

---

<a id='page-2'></a>

## Installation and CLI Usage

### Related Pages

Related topics: [Overview and Quickstart](#page-1)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/paperqa/agents/__init__.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/__init__.py)
- [src/paperqa/agents/search.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/search.py)
- [src/paperqa/agents/env.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/env.py)
- [src/paperqa/agents/helpers.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/helpers.py)
- [src/paperqa/agents/models.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/models.py)
- [src/paperqa/contrib/zotero.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/zotero.py)
- [src/paperqa/contrib/openreview_paper_helper.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/openreview_paper_helper.py)
- [src/paperqa/configs/contracrow.json](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/contracrow.json)
- [src/paperqa/configs/wikicrow.json](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/wikicrow.json)
- [packages/paper-qa-pypdf/README.md](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-pypdf/README.md)
- [packages/paper-qa-docling/README.md](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-docling/README.md)
- [packages/paper-qa-pymupdf/README.md](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-pymupdf/README.md)
- [packages/paper-qa-nemotron/README.md](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-nemotron/README.md)
</details>

# Installation and CLI Usage

PaperQA ships as a monorepo composed of a core package and a set of optional PDF reader packages, all driven through a single command-line interface called `pqa`. This page describes how to install the runtime, how the CLI dispatches its subcommands, and where configuration and known operational issues are documented.

## Installation Layout

The repository is organised so the core agent and search logic live in `src/paperqa`, while PDF parsing backends are extracted into four independently installable packages under `packages/`:

| Package | Backend | Notes |
| --- | --- | --- |
| `paper-qa-pypdf` | PyPDF | Default backend; extras `[media]` (pypdfium2) and `[enhanced]` (pdfplumber) enable figure/table extraction ([packages/paper-qa-pypdf/README.md:1](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-pypdf/README.md)) |
| `paper-qa-docling` | Docling / docling-parse | Provides layout-aware parsing via `parse_pdf_to_pages` ([packages/paper-qa-docling/src/paperqa_docling/__init__.py:1](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-docling/src/paperqa_docling/__init__.py)) |
| `paper-qa-pymupdf` | PyMuPDF | AGPLv3-licensed backend ([packages/paper-qa-pymupdf/README.md:1](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-pymupdf/README.md)) |
| `paper-qa-nemotron` | NVIDIA nemotron-parse VLM | Calls the nemotron-parse NIM/HF endpoint for vision-language parsing ([packages/paper-qa-nemotron/README.md:1](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-nemotron/README.md)) |

The core package itself pulls in `litellm`, which is imported unconditionally at module load time ([Community issue #1317](https://github.com/Future-House/paper-qa/issues/1317)). Users who only want metadata-only flows via `paperqa.clients.DocMetadataClient` still incur the LMI/LiteLLM import cost (~100 MB), which is a known design limitation rather than an installation defect.

A representative setup — reproduced from community discussions such as #428 — uses `mamba` to create an isolated environment, then installs the core package in editable mode from a clone of the repository.

## CLI Entry Point and Subcommands

The CLI dispatcher lives in `src/paperqa/agents/__init__.py`. When the module is executed as `__main__`, it sets a `_INITIATED_FROM_CLI` flag and routes on the first positional argument. The supported subcommands are `view`, `ask`, `search`, and `index`; any other input prints a brief help banner listing the commands ([src/paperqa/agents/__init__.py:1](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/__init__.py)).

The `index` subcommand delegates directly to `build_index(args.index, args.directory, settings)`. `build_index` constructs a `SearchIndex` named via `index_settings.name or _settings.get_index_name()`, then walks `index_settings.paper_directory` (recursively if `recurse_subdirectories` is set) and applies `index_settings.files_filter` to each candidate path ([src/paperqa/agents/search.py:1](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/search.py)). If `build=False` and the index has no files, the function raises `RuntimeError(f"Index {search_index.index_name} was empty, please rebuild it.")`, which is the common failure surface for users running `pqa ask` before `pqa index`.

The flow when answering questions is roughly:

```mermaid
flowchart LR
  A[pqa index] --> B[SearchIndex on disk]
  B --> C[pqa ask / pqa search]
  C --> D[agents.agent_query]
  D --> E[AnswerResponse]
```

Logging for the CLI is configured via `configure_cli_logging(verbosity)`, which installs a Rich handler, calls `Settings.parsing.configure_pdf_parser()` to wire the chosen PDF backend, and applies `LOG_VERBOSITY_MAP` to suppress loquacious third-party loggers ([src/paperqa/agents/__init__.py:1](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/__init__.py)). Increasing the verbosity beyond zero prints the PaperQA version banner.

## Configuration and Settings

Runtime configuration flows through the Pydantic `Settings` model. JSON presets under `src/paperqa/configs/` (for example `contracrow.json` and `wikicrow.json`) override prompt templates, summary JSON schemas, and agent type — `contracrow.json` configures a `ToolSelector` agent with a `gpt-4o-2024-08-06` LLM and a 500 s timeout, and enables structured `summary_json`/`summary_json_system` prompts for contradiction detection ([src/paperqa/configs/contracrow.json:1](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/contracrow.json)). The `wikicrow.json` preset switches the prompts into MLA-style citation extraction with a structured `structured_citation_prompt` ([src/paperqa/configs/wikicrow.json:1](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/wikicrow.json)).

Helper modules wire external services to environment variables. `ZoteroDB.__init__` reads `ZOTERO_USER_ID` and `ZOTERO_API_KEY` automatically and falls back to `~/.paperqa/zotero` for PDF storage ([src/paperqa/contrib/zotero.py:1](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/zotero.py)). The OpenReview helper writes into the index's `paper_directory` and re-uses the `Settings` object for paths, model choice, and concurrency ([src/paperqa/contrib/openreview_paper_helper.py:1](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/openreview_paper_helper.py)).

## Known Operational Issues

Several recurring failure modes are surfaced by community reports and should be planned for during deployment:

- **Pickle deserialization in persisted indexes** — The default `SearchDocumentStorage.PICKLE_COMPRESSED` mode unserialises index files when loaded. A poisoned index can therefore execute arbitrary code on load ([Community issue #1325](https://github.com/Future-House/paper-qa/issues/1325)). Treat shared indexes as untrusted input.
- **CMYK image crash** — Indexing PDFs whose images use the CMYK color space raises `OSError: cannot write mode CMYK as PNG` from `paperqa_pypdf/reader.py`. The crash aborts `pqa index` instead of skipping the file ([Community issue #1310](https://github.com/Future-House/paper-qa/issues/1310)).
- **Selective indexing** — `ask`, `agent_query`, and `build_index` consume the entire `paper_directory` (subject to `files_filter`); there is no per-file allow-list exposed by these APIs ([Community issue #1330](https://github.com/Future-House/paper-qa/issues/1330)).
- **Non-OpenAI LLM setup** — Users running local servers (Ollama, llamafile, vLLM) report friction because the CLI looks for an OpenAI key by default; switching to Gemini or open-source endpoints requires explicit `Settings` overrides ([Community issues #378, #387, #1044](https://github.com/Future-House/paper-qa/issues/378)).
- **LiteLLM import weight** — Even metadata-only clients transitively import LiteLLM ([Community issue #1317](https://github.com/Future-House/paper-qa/issues/1317)).

The `AnswerResponse` and `SimpleProfiler` models in `src/paperqa/agents/models.py` provide an SMS-style `get_summary()` helper and an `@asynccontextmanager`-based timer that emits `[Profiling]` lines consumed by downstream Google Cloud monitoring — useful when adding `--verbose` instrumentation to the CLI ([src/paperqa/agents/models.py:1](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/models.py)).

## See Also

- [PaperQA README](https://github.com/Future-House/paper-qa/blob/main/README.md) for end-to-end quick-start examples.
- Community issue [#387 "End-to-end working example?"](https://github.com/Future-House/paper-qa/issues/387) for additional CLI recipes contributed by users.

---

<a id='page-3'></a>

## System Architecture and Data Flow

### Related Pages

Related topics: [Agentic Workflow and Tools](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/paperqa/configs/contracrow.json](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/contracrow.json)
- [src/paperqa/configs/wikicrow.json](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/wikicrow.json)
- [src/paperqa/configs/debug.json](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/debug.json)
- [packages/paper-qa-pypdf/src/paperqa_pypdf/reader.py](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-pypdf/src/paperqa_pypdf/reader.py)
- [packages/paper-qa-docling/src/paperqa_docling/reader.py](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-docling/src/paperqa_docling/reader.py)
- [packages/paper-qa-docling/src/paperqa_docling/__init__.py](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-docling/src/paperqa_docling/__init__.py)
- [packages/paper-qa-pymupdf/README.md](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-pymupdf/README.md)
- [packages/paper-qa-nemotron/README.md](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-nemotron/README.md)
- [src/paperqa/agents/search.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/search.py)
- [src/paperqa/agents/env.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/env.py)
- [src/paperqa/agents/helpers.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/helpers.py)
- [src/paperqa/contrib/zotero.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/zotero.py)
- [src/paperqa/contrib/openreview_paper_helper.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/openreview_paper_helper.py)
</details>

# System Architecture and Data Flow

PaperQA is a multi-stage literature QA system that ingests PDFs, builds a searchable index, and runs an LLM-driven agent that gathers evidence and produces cited answers. The codebase is organized into a small core, several optional PDF-reader packages, configuration presets, and agent modules.

## High-Level Architecture

The system has four cooperating layers: **Configuration**, **Document Ingestion**, **Search Indexing**, and **Agent Execution**. Each layer is independently replaceable so the same `Docs`/`Agent` abstractions can drive very different deployments — from a local OpenAI-backed CLI to a self-hosted LLM served via LiteLLM (a recurring pain point tracked in issues such as [#378](https://github.com/Future-House/paper-qa/issues/378), [#387](https://github.com/Future-House/paper-qa/issues/387), [#393](https://github.com/Future-House/paper-qa/issues/393), [#428](https://github.com/Future-House/paper-qa/issues/428), and [#1044](https://github.com/Future-House/paper-qa/issues/1044)).

```mermaid
flowchart LR
    A[Config Presets<br/>contracrow / wikicrow / debug] --> D[Settings]
    B[PDF Readers<br/>pypdf · docling · pymupdf · nemotron] --> E[ParsedText + ParsedMedia]
    F[External Sources<br/>Zotero · OpenReview] --> E
    E --> G[Tantivy SearchIndex<br/>via agents/search.py]
    G --> H[Agent Environment<br/>agents/env.py]
    H --> I[ToolSelector Agent<br/>search · gather · cite]
    I --> J[Answer + Citations]
```

## Configuration Layer

Behavior is driven by JSON presets that populate a `Settings` object, including model identifiers, prompt templates, agent type, and evidence-retrieval knobs. For example, [`contracrow.json`](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/contracrow.json) sets `agent_llm` to `gpt-4o-2024-08-06`, `agent_type` to `ToolSelector`, `search_count` to 12, and supplies an XML-formatted contradiction prompt that asks the model to label claims on an eleven-point scale (from "explicit contradiction" to "explicit agreement"). [`wikicrow.json`](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/wikicrow.json) follows the same shape but uses a Wikipedia-style `qa` prompt and an MLA-formatted citation post-processor (`structured_citation_prompt`). [`debug.json`](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/debug.json) is a deliberately lean preset that downscales `evidence_k`, `answer_length`, and `max_concurrent_requests` for fast iteration.

## Document Ingestion Layer

Ingestion is pluggable. Each optional package under `packages/` exposes a `parse_pdf_to_pages` function returning `ParsedText` plus optional `ParsedMedia`:

| Package | Backend | Notes |
|---|---|---|
| [`paper-qa-pypdf`](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-pypdf/src/paperqa_pypdf/reader.py) | pypdf + optional pdfium/pdfplumber | Defines `MediaMode` (`NONE`, `FULL_PAGE`, `INDIVIDUAL`, `INDIVIDUAL_CLUSTERING`) and clusters bounding boxes via `cluster_bboxes`. |
| [`paper-qa-docling`](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-docling/src/paperqa_docling/reader.py) | Docling `StandardPdfPipeline` + `DoclingParseDocumentBackend` | Generates picture and table images, exposes `dpi` and `custom_pipeline_options`. |
| [`paper-qa-pymupdf`](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-pymupdf/README.md) | PyMuPDF | AGPLv3-licensed; added in the v2026.01.05 release. |
| [`paper-qa-nemotron`](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-nemotron/README.md) | Nvidia nemotron-parse VLM (NIM) | Vision-language parsing; URLs and support matrix documented in the README. |

External paper sources feed the same pipeline. [`ZoteroDB`](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/zotero.py) extends `pyzotero.zotero.Zotero`, reads `ZOTERO_USER_ID` / `ZOTERO_API_KEY` from the environment, and downloads PDFs to `~/.paperqa/zotero`. [`OpenReviewPaperHelper`](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/openreview_paper_helper.py) uses an LLM (via `litellm_get_search_query`-style schema prompting) to pick up to 20 relevant submissions, then downloads PDFs into `settings.agent.index.paper_directory`.

Common failure modes the community has surfaced — for example, CMYK image crashes during `pil_image.save(buf, format="PNG")` in `paperqa_pypdf/reader.py` (issue [#1310](https://github.com/Future-House/paper-qa/issues/1310)) — all sit inside this ingestion layer.

## Search Indexing Layer

[`agents/search.py`](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/search.py) walks the configured `paper_directory` (recursively when `recurse_subdirectories` is enabled), applies `index_settings.files_filter` to each candidate, and feeds the survivors into a `SearchIndex` whose required fields are extended with `title` and `year`. The function honors a persisted manifest (`finalize_manifest_file`) so unchanged files are skipped on rebuild, and logs a warning if more than `WARN_IF_INDEXING_MORE_THAN` files are about to be indexed. Index scoping is the topic of issue [#1330](https://github.com/Future-House/paper-qa/issues/1330), where users want to query a curated subset rather than every PDF in a folder — the answer in code is to drive `files_filter` rather than mutate the directory.

## Agent Execution Layer

[`agents/env.py`](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/env.py) constructs a `PQASession` environment that wraps a `Docs` object, an LLM model, a summary LLM, an embedding model, and (optionally) a `session_id`. It calls `settings_to_tools` to expose the configured tools, resets docs on each new query, and registers `clinical_trial_status` as a `status_fn` when `ClinicalTrialsSearch` is enabled. The `from_task` classmethod is a convenience entry point that builds a default `Settings()` plus an empty `Docs()` for one-shot use.

Helpers in [`agents/helpers.py`](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/helpers.py) include `litellm_get_search_query`, which validates that a custom template contains `{count}`, `{question}`, and `{date}` placeholders (replacing `{date}` with the current year) before falling back to a default keyword-generation prompt; and `get_year`, which returns the current year as a string for prompt interpolation. Note that importing paperqa eagerly pulls in `lmi` / LiteLLM, which costs roughly 100 MB even for metadata-only callers — tracked in issue [#1317](https://github.com/Future-House/paper-qa/issues/1317).

## End-to-End Data Flow

1. A config preset (or `Settings()` defaults) is loaded.
2. PDFs are sourced from a local directory, Zotero, or OpenReview and parsed by one of the reader packages.
3. Parsed text + media is chunked, embedded, and written to a Tantivy `SearchIndex` (manifest-tracked).
4. A user query enters the `PQASession` environment, which exposes search, gather-evidence, paper-collection, and citation tools to a `ToolSelector` agent.
5. The agent iterates: searches the index, gathers evidence, follows citations inside evidence, and finally emits a cited answer.

For LLM selection, any provider supported by LiteLLM can be plugged into `agent_llm` / `summary_llm` via the config; community guidance in the issues above shows that Azure, Gemini, Ollama, and llamafile all work as long as the model string is correctly formatted and the required API key is exported.

## See Also

- Settings and configuration presets
- Tantivy search index persistence (security notes in issue [#1325](https://github.com/Future-House/paper-qa/issues/1325))
- Agent tool selection and the `ToolSelector` loop
- Optional PDF reader packages: pypdf, docling, pymupdf, nemotron

---

<a id='page-4'></a>

## Agentic Workflow and Tools

### Related Pages

Related topics: [System Architecture and Data Flow](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/paperqa/agents/main.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/main.py)
- [src/paperqa/agents/env.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/env.py)
- [src/paperqa/agents/models.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/models.py)
- [src/paperqa/agents/search.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/search.py)
- [src/paperqa/agents/helpers.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/helpers.py)
- [src/paperqa/agents/__init__.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/__init__.py)
- [src/paperqa/configs/contracrow.json](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/contracrow.json)
- [src/paperqa/configs/wikicrow.json](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/configs/wikicrow.json)
- [packages/paper-qa-docling/src/paperqa_docling/reader.py](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-docling/src/paperqa_docling/reader.py)
- [packages/paper-qa-pypdf/src/paperqa_pypdf/reader.py](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-pypdf/src/paperqa_pypdf/reader.py)
- [packages/paper-qa-nemotron/README.md](https://github.com/Future-House/paper-qa/blob/main/packages/paper-qa-nemotron/README.md)
- [src/paperqa/contrib/zotero.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/contrib/zotero.py)
</details>

# Agentic Workflow and Tools

## Overview

The agentic workflow in PaperQA is built on top of the [Aviary](https://github.com/Future-House/aviary) framework and is the central mechanism that drives multi-step literature question answering. Instead of running a single retrieval–summarize–answer pipeline (the older `paperqa.ask()` flow), the agent iteratively selects **tools** — searching for papers, gathering evidence, citing references, and generating answers — until it can confidently terminate. Source: [src/paperqa/agents/main.py:1-90]()

The agent loop is exposed via `run_aviary_agent()` in [src/paperqa/agents/main.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/main.py), which orchestrates a `PaperQAEnvironment`, a `ToolSelector` agent, and a set of `Tool` objects derived from the user's `Settings`. A timeout-bound `rollout()` is used to bound cost and wall-clock time. Source: [src/paperqa/agents/main.py:30-90]()

## Agent Loop and Tool Set

### Environment and state

`PaperQAEnvironment` (defined in [src/paperqa/agents/env.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/env.py)) wires the LLM, summary LLM, embedding model, and a `Docs` collection into an Aviary environment. Its `reset()` method clears the docs, constructs a fresh `PQASession`, and returns the initial message list plus the available tools. Source: [src/paperqa/agents/env.py:60-150]()

A `PQASession` ([src/paperqa/agents/models.py:1-80]()) holds the user's question, the config MD5, the accumulated evidence, the cited contexts, and the running cost/timing — the canonical state that the agent mutates step-by-step. `AnswerResponse` is the final structured output returned to callers.

### The default tool set

The environment builds its tools via `settings_to_tools()`. By default this includes: `search_papers`, `gather_evidence`, `gen_answer`, `gen_answer_stream`, `empty_evidence`, and a termination tool that lets the LLM signal completion. Tools can be filtered through `Settings.agent.tool_names` (and `tool_call_limit`) so power users can disable, e.g., the streaming answer or add custom clinical-trial tools (see the `ClinicalTrialsSearch.TOOL_FN_NAME` branch in `make_initial_state()`). Source: [src/paperqa/agents/env.py:80-130]()

The agent system prompt and behavior are driven by the `Settings.agent` block. The shipped presets `contracrow.json` and `wikicrow.json` ([src/paperqa/configs/](https://github.com/Future-House/paper-qa/tree/main/src/paperqa/configs)) encode a "search → gather → cite → gather more → answer → complete" loop with `search_count: 12` and a 500 s timeout, demonstrating the recommended default cadence. Source: [src/paperqa/configs/contracrow.json](), [src/paperqa/configs/wikicrow.json]()

### Mermaid: agent loop

```mermaid
flowchart LR
    Q[User question] --> R[reset: build tools, init PQASession]
    R --> S[ToolSelector selects a tool]
    S -->|search_papers| SR[Tantivy / paper directory]
    S -->|gather_evidence| GE[Embeddings + LLM summary]
    S -->|gen_answer| AN[Compose answer w/ citations]
    S -->|complete| T[Terminate]
    SR --> S
    GE --> S
    AN --> S
    T --> OUT[AnswerResponse]
```

## Search Index Construction

The `search_papers` tool relies on a Tantivy index built by `read_documents()` in [src/paperqa/agents/search.py:1-90](). The function reads from `Settings.agent.index.paper_directory`, applies `index_settings.files_filter`, and warns when more than `WARN_IF_INDEXING_MORE_THAN` files are queued. A manifest file is written via `maybe_get_manifest()` so that downstream tools can detect which documents were actually considered. Source: [src/paperqa/agents/search.py:30-80]()

Helpers in [src/paperqa/agents/helpers.py](https://github.com/Future-House/paper-qa/blob/main/src/paperqa/agents/helpers.py), notably `litellm_get_search_query()`, prompt the LLM to emit `{count}` keyword queries (with year ranges) that the search tool then executes against the index. This indirection is what allows the agent to reformulate queries across iterations.

## PDF Reading Backends (Plugins)

PaperQA ships several opt-in PDF reading packages that plug into the agent's document-adding path:

| Package | Backend | Notes |
|---|---|---|
| `paperqa-pypdf` | PyPDF | Default; supports clustered image extraction. Source: [packages/paper-qa-pypdf/src/paperqa_pypdf/reader.py]() |
| `paper-qa-docling` | Docling | High-fidelity layout/table extraction. Source: [packages/paper-qa-docling/src/paperqa_docling/reader.py]() |
| `paper-qa-pymupdf` | PyMuPDF | AGPL-licensed alternative. See [packages/paper-qa-pymupdf/README.md]() |
| `paper-qa-nemotron` | Nemotron-Parse VLM | NVIDIA-hosted vision-language model. See [packages/paper-qa-nemotron/README.md]() |

Each backend exposes a `parse_pdf_to_pages()` function with a consistent signature so the agent can swap parsers without code changes — only the `Settings.agent.index.parser` selection differs.

## Configuration, Cost, and Common Failure Modes

### Configuring an alternative LLM

Community issues #393, #378, #428, and #1044 all ask how to point PaperQA at non-OpenAI providers (Azure OpenAI, Ollama/llamafile, Gemini). Because the agent goes through LiteLLM, any LiteLLM-supported provider can be selected by overriding `Settings.agent.agent_llm` and, where needed, the API key environment variables. For local servers (`llamafile`, `ollama`), users typically also override the embedding model. Source: [src/paperqa/agents/main.py](), [src/paperqa/agents/helpers.py]()

> **Note:** Issue #1317 reports that importing `paperqa` always loads LiteLLM (~100 MB), even when only `DocMetadataClient` is used. This is a known dependency-bloat issue that the LiteLLM-driven agentic workflow exacerbates.

### Selecting which papers to index

Issue #1330 asks how to constrain the search corpus. The answer is in `Settings.agent.index.files_filter` and the `paper_directory` / `recurse_subdirectories` fields, which are honored by `read_documents()`. Source: [src/paperqa/agents/search.py:40-80]()

### Security and crash issues

- **Pickle deserialization (CVE-class):** Issue #1325 documents that the default `SearchDocumentStorage.PICKLE_COMPRESSED` index format is unsafe to load from untrusted sources. Consider clearing or rebuilding the index when transferring it across trust boundaries.
- **CMYK images:** Issue #1310 reports `OSError: cannot write mode CMYK as PNG` from `paperqa_pypdf/reader.py`. Workarounds include pre-converting images with Pillow or switching to the Docling backend, which tolerates CMYK.

### External integrations

For users who maintain literature in Zotero or OpenReview, the contrib helpers in [src/paperqa/contrib/zotero.py]() and [src/paperqa/contrib/openreview_paper_helper.py]() download PDFs into `Settings.agent.index.paper_directory` so that a subsequent agent run can index and query them. The BGPT MCP/HTTP integration proposed in #1338 would follow the same pattern: register a new `Tool` whose `Tool.fn` calls BGPT and whose schema is exposed via `settings_to_tools()`.

## See Also

- [Search and Indexing](search-and-indexing.md)
- [Settings and Configuration](settings-and-configuration.md)
- [Document Parsers and PDF Backends](document-parsers.md)
- [CLI and `pqa` Commands](cli-reference.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: Future-House/paper-qa

Summary: Found 10 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/Future-House/paper-qa/issues/1330

## 2. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/Future-House/paper-qa/issues/1310

## 3. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/Future-House/paper-qa

## 4. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/Future-House/paper-qa

## 5. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/Future-House/paper-qa

## 6. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/Future-House/paper-qa

## 7. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/Future-House/paper-qa/issues/399

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/Future-House/paper-qa/issues/1325

## 9. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/Future-House/paper-qa

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/Future-House/paper-qa

<!-- canonical_name: Future-House/paper-qa; human_manual_source: deepwiki_human_wiki -->