# https://github.com/jina-ai/reader Project Manual

Generated at: 2026-06-20 19:58:24 UTC

## Table of Contents

- [System Overview & Architecture](#page-1)
- [URL Fetching Engines & Content Extraction](#page-2)
- [Security, SSRF Protection & Abuse Mitigation](#page-3)
- [Search, Proxies, Caching & Self-Hosting Deployment](#page-4)

<a id='page-1'></a>

## System Overview & Architecture

### Related Pages

Related topics: [URL Fetching Engines & Content Extraction](#page-2), [Search, Proxies, Caching & Self-Hosting Deployment](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/jina-ai/reader/blob/main/README.md)
- [package.json](https://github.com/jina-ai/reader/blob/main/package.json)
- [src/utils/markdown.ts](https://github.com/jina-ai/reader/blob/main/src/utils/markdown.ts)
- [src/services/serp/serper.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/serper.ts)
- [src/services/serp/compat.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/compat.ts)
- [src/services/common-llm/open-router.ts](https://github.com/jina-ai/reader/blob/main/src/services/common-llm/open-router.ts)
- [src/services/common-llm/misc.ts](https://github.com/jina-ai/reader/blob/main/src/services/common-llm/misc.ts)
- [src/utils/tailwind-classes.ts](https://github.com/jina-ai/reader/blob/main/src/utils/tailwind-classes.ts)
</details>

# System Overview & Architecture

## 1. Purpose and Scope

Reader is Jina AI's open-source service that turns arbitrary web content into clean, LLM-friendly input. It exposes two public surfaces: a **Read** endpoint (`https://r.jina.ai/`) that converts any URL — HTML page, PDF, Office document, or image — into Markdown, HTML, text, or a screenshot; and a **Search** endpoint (`https://s.jina.ai/`) that issues a web search and returns structured results with snippets, links, and metadata. Source: [README.md](https://github.com/jina-ai/reader/blob/main/README.md).

The codebase is a TypeScript application targeting Node.js `>=22.15` and is packaged as a single `civkit`-based service. Source: [package.json](https://github.com/jina-ai/reader/blob/main/package.json). The `package.json` `exports` field advertises three entry points — `./crawl`, `./search`, and `./serp` — indicating the project is logically decomposed along those three verbs even though it ships as one binary.

The project explicitly positions itself for two downstream use cases: feeding RAG pipelines and powering AI agents. The README's header option matrix (`x-preset`, `x-engine`, `x-respond-with`, etc.) is designed around those scenarios, with named presets such as `reader`, `index`, `research`, `agent`, and `spider`. Source: [README.md](https://github.com/jina-ai/reader/blob/main/README.md).

## 2. High-Level Architecture

The runtime is a Koa-style HTTP/2 service built on the `civkit` framework, with an inverted-dependency container (`tsyringe`) wiring together fetcher engines, content processors, LLM adapters, and SERP providers. The `package.json` dependency list (`koa`, `civkit`, `tsyringe`, `undici`, `@mozilla/readability`, `@nomagick/node-libcurl-impersonate`, `generic-pool`, `pino-pretty`) maps directly onto that layered design. Source: [package.json](https://github.com/jina-ai/reader/blob/main/package.json).

```mermaid
flowchart LR
    A[Client / LLM Agent] -->|URL or query| B(API Layer<br/>r.jina.ai / s.jina.ai)
    B --> C{Engine Selector<br/>x-engine: browser / curl / auto}
    C -->|browser| D[Headless Chrome Pool<br/>generic-pool]
    C -->|curl| E[node-libcurl-impersonate]
    D --> F[Content Pipeline]
    E --> F
    F --> G[Readability + Markdown tidy]
    G --> H[Output Formatter<br/>markdown / html / text / screenshot / frontmatter]
    F --> I[SSRF Gate<br/>assertNormalizedUrl]
    B --> J[SERP Adapters<br/>serper, compat]
    J --> K[WebSearchEntry]
    F --> L[LLM Adapters<br/>OpenRouter, misc]
    H --> A
    K --> A
```

The diagram captures the four responsibilities the codebase is organized around: **inbound API parsing** (the many `X-*` headers and JSON body options), **fetching** (browser pool vs. libcurl impersonation), **content processing** (readability, markdown normalization, image captioning via VLMs), and **output shaping** (presets, chunking, frontmatter). Source: [README.md](https://github.com/jina-ai/reader/blob/main/README.md).

## 3. Service Layers and Modules

### 3.1 Fetching and Rendering

Two fetch backends coexist. The `browser` engine runs headless Chrome through a pooled allocator (`generic-pool` is a direct dependency), enabling JavaScript execution and screenshot capture — required for SPAs and dynamic content, a gap reported in the community (#1242 "Improve content extraction logic to handle dynamic and hidden elements"). The `curl` engine uses `@nomagick/node-libcurl-impersonate` for lightweight, non-JS fetches that still mimic real browser TLS fingerprints. The default `auto` engine decides per URL. Sources: [README.md](https://github.com/jina-ai/reader/blob/main/README.md), [package.json](https://github.com/jina-ai/reader/blob/main/package.json).

### 3.2 Content Processing

After fetching, HTML is reduced to LLM input through a pipeline that includes `@mozilla/readability` and a markdown normalization pass. The `tidyMarkdown` helper consolidates links and images that have been split across lines, normalizes whitespace, and re-wraps nested image-in-link fragments into a single tidy markdown token, which is critical because raw Chromium-to-markdown output is often broken when long links or inline images wrap. Source: [src/utils/markdown.ts](https://github.com/jina-ai/reader/blob/main/src/utils/markdown.ts).

A separate utility, `src/utils/tailwind-classes.ts`, enumerates the Tailwind class universe so the processor can strip or keep layout/utility classes during DOM-to-Markdown conversion — visible in the file's exhaustive lists of color, spacing, outline, and break utilities. Source: [src/utils/tailwind-classes.ts](https://github.com/jina-ai/reader/blob/main/src/utils/tailwind-classes.ts).

### 3.3 LLM and Search Adapters

LLM access is provider-agnostic. `src/services/common-llm/open-router.ts` defines a uniform DTO with `temperature`, `top_p`, `top_k`, `frequency_penalty`, `presence_penalty`, `seed`, plus tool-calling (`tools`, `tool_choice`) that is passed through to OpenAI-compatible providers. Source: [src/services/common-llm/open-router.ts](https://github.com/jina-ai/reader/blob/main/src/services/common-llm/open-router.ts). The companion `src/services/common-llm/misc.ts` provides `chatMLEncode`, a ChatML message assembler that supports system prompts, multi-message history, and arbitrary message metadata, enabling the same client to drive both OpenAI-style and ChatML-style backends. Source: [src/services/common-llm/misc.ts](https://github.com/jina-ai/reader/blob/main/src/services/common-llm/misc.ts).

Search is similarly normalized. `src/services/serp/compat.ts` defines a stable `WebSearchEntry` shape — `link`, `title`, `source`, `date`, `snippet`, `imageUrl`, `siteLinks`, and a `variant` discriminator — that every SERP provider must yield, regardless of upstream differences. Source: [src/services/serp/compat.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/compat.ts). The Google-specific adapter in `src/services/serp/serper.ts` implements explicit operators (`intitle`, `loc`, `site`) and assembles them into a single search string via the `addTo` method. Source: [src/services/serp/serper.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/serper.ts).

## 4. Operational Topology and Community-Reported Failure Modes

The Docker image (`ghcr.io/jina-ai/reader:oss`) exposes two ports: `8080` over HTTP/2 cleartext (production-grade, used by Cloud Run) and `8081` over HTTP/1.1 as a `curl`-friendly fallback. The image bundles headless Chrome, LibreOffice, and CJK fonts so PDFs, Office documents, and CJK pages are processed in-container. Source: [README.md](https://github.com/jina-ai/reader/blob/main/README.md).

Several community-reported issues map directly onto architectural seams in this overview and are worth flagging for self-hosters:

| Concern | Architectural Location | Community Reference |
|---|---|---|
| SSRF via DNS rebinding / per-hop redirect validation | Single-shot `assertNormalizedUrl` gate before fetch | #1252, #1253 |
| Page navigation timeout (default 30 s) | Headless Chrome pool under `generic-pool` | #1118 |
| Output size larger than raw HTML | Markdown normalization + frontmatter expansion in `tidyMarkdown` | #1250 |
| Extraction failures on simple pages | Readability + DOM-to-MD pipeline | #105, #1 |
| Regional blocking of `jina.ai` / `r.jina.ai` | Public hostname topology; mirror via `r.jinaai.cn` | #1237 |

Together these describe a layered system where **API parsing → fetching → processing → output** are independently configurable through headers and presets, and where each layer carries a known set of edge cases that the README's `x-*` option matrix is designed to address. Source: [README.md](https://github.com/jina-ai/reader/blob/main/README.md).

## See Also

- [Cookbooks](https://github.com/jina-ai/reader/blob/main/cookbooks.md) — pipeline-specific header recipes (RAG, indexing, research, agent, spider)
- [Docker deployment guide](https://github.com/jina-ai/reader#self-host-with-docker) — `ghcr.io/jina-ai/reader:oss` and port mapping
- [Rate limits and pricing](https://jina.ai/reader#pricing) — quota guidance for `r.jina.ai` / `s.jina.ai`

---

<a id='page-2'></a>

## URL Fetching Engines & Content Extraction

### Related Pages

Related topics: [System Overview & Architecture](#page-1), [Security, SSRF Protection & Abuse Mitigation](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/jina-ai/reader/blob/main/README.md)
- [src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts)
- [src/api/searcher.ts](https://github.com/jina-ai/reader/blob/main/src/api/searcher.ts)
- [src/utils/markdown.ts](https://github.com/jina-ai/reader/blob/main/src/utils/markdown.ts)
- [src/utils/tailwind-classes.ts](https://github.com/jina-ai/reader/blob/main/src/utils/tailwind-classes.ts)
- [src/services/serp/serper.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/serper.ts)
- [src/services/serp/compat.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/compat.ts)
- [package.json](https://github.com/jina-ai/reader/blob/main/package.json)
</details>

# URL Fetching Engines & Content Extraction

## Overview

The `jina-ai/reader` service converts any HTTP-accessible resource into an LLM-friendly representation. The pipeline is exposed through two public endpoints: `r.jina.ai` for reading a single URL, and `s.jina.ai` for web search. Internally, the request is handled by the `CrawlerAPI.crawl` RPC ([src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts)), which validates the target URL, chooses a fetching engine, fetches the resource, normalizes the result, and serializes it according to the caller's requested output format.

Two core stages are involved:

1. **Fetching** — retrieving the remote document with one of the supported engines.
2. **Content extraction & formatting** — turning the raw response (HTML, PDF, image, etc.) into the chosen representation (Markdown, HTML, text, screenshot, or frontmatter).

## Fetching Engines

The `x-engine` request header selects the engine that the service uses to download a URL ([README.md](https://github.com/jina-ai/reader/blob/main/README.md)):

| Engine    | Description                                                                                          | Trade-offs                                                                  |
|-----------|------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------|
| `auto`    | Default. Combines a lightweight curl path with a headless-browser fallback.                          | Best balance of cost and quality; preferred for general-purpose crawling.  |
| `curl`    | Plain HTTP fetch without JavaScript execution.                                                       | Cheapest, but cannot render SPAs or JS-driven content.                      |
| `browser` | Forces headless Chrome (Puppeteer, see [package.json](https://github.com/jina-ai/reader/blob/main/package.json) dependency on `puppeteer ^24.42.0`). | Highest fidelity; needed for sites that only serve real content to real browsers. |

The crawler normalizes the user-supplied URL through `MiscService.assertNormalizedUrl`, which performs DNS resolution and rejects non-public IP ranges to mitigate SSRF ([src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts)). However, as noted in community issues [#1252](https://github.com/jina-ai/reader/issues/1252) and [#1253](https://github.com/jina-ai/reader/issues/1253), this check is applied once before fetching; subsequent HTTP redirects are not re-validated per hop, which is a known limitation in self-hosted deployments.

A per-host circuit breaker (`puppeteerControl.circuitBreakerHosts`) blocks recursive or self-referential crawling and emits `abuse` events that can store a domain-wide blockade via `storageLayer.storeDomainBlockade` ([src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts)). When abuse is detected, the offending hostname is added to a temporary block list and excluded from future fetches.

```mermaid
flowchart LR
  A[Client request<br/>r.jina.ai/URL] --> B[assertNormalizedUrl<br/>SSRF gate]
  B --> C{Engine?}
  C -- auto --> D[curl first]
  C -- browser --> E[Puppeteer]
  C -- curl --> F[undici/curl]
  D --> G{Content OK?}
  G -- no --> E
  G -- yes --> H[Snapshot]
  E --> H
  F --> H
  H --> I[Format by x-respond-with]
  I --> J[Markdown / HTML / text / screenshot / frontmatter]
```

## Content Extraction Pipeline

After a successful fetch, the response body is dispatched based on MIME type. The crawler applies `jsdomControl.analyzeHTMLTextLite` to derive the document title and to narrow the snapshot ([src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts)). A side-load path exists for binary or non-HTML types — PDFs, Office documents, and images are converted externally (LibreOffice for Office, PDF.js for PDFs, a VLM for image captioning) and then re-introduced as HTML or markdown fragments.

The HTML → Markdown transformation is performed by `markify` (referenced in the README as `readability` for filtered markdown). The `tidyMarkdown` post-processor in [src/utils/markdown.ts](https://github.com/jina-ai/reader/blob/main/src/utils/markdown.ts) normalizes broken links that span multiple lines, collapses whitespace inside link labels, and re-attaches image syntax to the surrounding anchor — a frequent source of malformed Markdown from upstream converters. Tailwind utility class names are tracked in [src/utils/tailwind-classes.ts](https://github.com/jina-ai/reader/blob/main/src/utils/tailwind-classes.ts) so the converter can recognize and strip them from extracted content where appropriate.

Several response-modifier headers control the output shape ([README.md](https://github.com/jina-ai/reader/blob/main/README.md)):

- `x-respond-with` — `markdown`, `html`, `text`, `screenshot`, `pageshot`, `frontmatter`, or `markdown+frontmatter`.
- `x-retain-media` — control how `<video>`, `<audio>`, and embedded iframes appear (`link`, `none`, `text`, `image`, `html`).
- `x-with-links-summary` / `x-with-images-summary` — append a deduplicated footer of links or images.
- `x-markdown-chunking` — opt-in semantic chunking by heading level (`h1`–`h5`) or by block structure (`s1`–`s5`).
- `x-detach-invisibles` — strip elements with eventual `display:none` before snapshotting (forces browser engine, disables caching).
- `x-respond-with: frontmatter` — return Markdown with a YAML frontmatter block (`title`, `description`, `url`).
- `x-with-generated-alt` — use a VLM to caption images and emit `![Image [idx]: [VLM_caption]](img_URL)`.

Caching is keyed by an MD5 digest of the normalized URL ([src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts), `getUrlDigest`). The `x-cache-tolerance` header defines how stale a cached entry is allowed to be, while `x-no-cache: true` bypasses the cache entirely — useful when a stale or already-blocked response is being served.

## Response Formatting & Search Coupling

The `crawl` handler dispatches the formatted page to the appropriate serializer based on `crawlerOptions.respondWith` ([src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts)). The `pageshot` branch issues a 302 redirect to the screenshot URL or returns the raw PNG; `frontmatter` and `markdown+frontmatter` emit a YAML-headed Markdown document; the default path returns `text/plain`.

For search, the `SearcherAPI` reuses the same `CrawlerOptions` pipeline. The result entries from a SERP provider are first mapped by `mapSearchEntryToPartialFormattedPage` ([src/api/searcher.ts](https://github.com/jina-ai/reader/blob/main/src/api/searcher.ts)) and then each URL is fetched through the same engine and extractor machinery. Search query augmentation supports Google explicit operators (`intitle:`, `site:`, `loc:`) via `GoogleSearchExplicitOperatorsDto.addTo` ([src/services/serp/serper.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/serper.ts)), and the normalized entry shape is defined by the `WebSearchEntry` interface ([src/services/serp/compat.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/compat.ts)).

## Common Failure Modes & Mitigations

User-reported issues map to specific stages of the pipeline:

- **30 s navigation timeouts** ([#1118](https://github.com/jina-ai/reader/issues/1118)) — Puppeteer's default `goto` timeout fires on slow sites. Retry with `x-engine: auto` (which prefers curl) or use an API key to access the internal proxy.
- **Empty extraction on simple pages** ([#105](https://github.com/jina-ai/reader/issues/105)) — usually caused by aggressive readability filtering; switch to `x-respond-with: markdown+frontmatter` or `html` to bypass filtering.
- **Dynamic / hidden content** ([#1242](https://github.com/jina-ai-reader/issues/1242)) — content loaded by JavaScript is missed by curl; use `x-engine: browser` or `x-detach-invisibles: true`.
- **Output larger than source HTML** ([#1250](https://github.com/jina-ai/reader/issues/1250)) — frontmatter, link summaries, and image captions add overhead; disable with `x-with-links-summary: false` and use plain `markdown` instead of `frontmatter`.
- **Images rendered as raw URLs** ([#1251](https://github.com/jina-ai/reader/issues/1251)) — by default, images become links; set `x-retain-images: alt` or `image` to embed them as `![…](…)` syntax.
- **Geo-blocking** ([#1237](https://github.com/jina-ai/reader/issues/1237)) — `jina.ai` is DNS-poisoned in some regions; transition aliases `r.jinaai.cn` and `s.jinaai.cn` are provided.
- **SSRF in self-hosted deployments** ([#1252](https://github.com/jina-ai/reader/issues/1252), [#1253](https://github.com/jina-ai/reader/issues/1253)) — the `assertNormalizedUrl` gate is single-shot and does not re-validate redirect targets; self-hosters should front the service with an outbound proxy that enforces per-hop checks.

## See Also

- [Crawler API & Options (src/api/crawler.ts, src/dto/crawler-options.ts)](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts)
- [Search API (src/api/searcher.ts)](https://github.com/jina-ai/reader/blob/main/src/api/searcher.ts)
- [Markdown post-processing (src/utils/markdown.ts)](https://github.com/jina-ai/reader/blob/main/src/utils/markdown.ts)
- [Cookbook recipes (cookbooks.md)](https://github.com/jina-ai/reader/blob/main/cookbooks.md)
- [Self-hosting guide (README.md § Self-host with Docker)](https://github.com/jina-ai/reader/blob/main/README.md)

---

<a id='page-3'></a>

## Security, SSRF Protection & Abuse Mitigation

### Related Pages

Related topics: [System Overview & Architecture](#page-1), [Search, Proxies, Caching & Self-Hosting Deployment](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/services/serp/puppeteer.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/puppeteer.ts)
- [src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts)
- [src/services/serp/serper.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/serper.ts)
- [src/services/serp/compat.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/compat.ts)
- [src/services/common-llm/open-router.ts](https://github.com/jina-ai/reader/blob/main/src/services/common-llm/open-router.ts)
- [src/services/common-iminterrogate/instruct-blip.ts](https://github.com/jina-ai/reader/blob/main/src/services/common-iminterrogate/instruct-blip.ts)
- [src/utils/markdown.ts](https://github.com/jina-ai/reader/blob/main/src/utils/markdown.ts)
- [package.json](https://github.com/jina-ai/reader/blob/main/package.json)
- [README.md](https://github.com/jina-ai/reader/blob/main/README.md)
</details>

# Security, SSRF Protection & Abuse Mitigation

Reader is a server-side URL fetcher: a client submits a URL, the service retrieves it, and the response is converted into LLM-friendly Markdown. Because the server is willing to fetch *any* URL the user supplies, the entire service is fundamentally an SSRF surface. The open-source branch in this repository mitigates that risk with a layered set of guards: a one-shot URL/IP validator, a hardened browser engine, abuse detection events, timeouts, and the engine-selection knobs exposed in the request DTOs.

This page documents the security mechanisms visible in the codebase and the failure modes that have surfaced in the community.

## High-Level Threat Model

Reader accepts an arbitrary URL and returns a structured representation of the remote response. An attacker who can submit a URL can therefore:

- Probe internal services bound to `127.0.0.1`, `169.254.169.254` (cloud metadata), `10.0.0.0/8`, `192.168.0.0/16`, `fc00::/7`, and other non-public ranges.
- Use the public service as an open proxy to anonymize traffic.
- Cause the headless browser to render attacker-controlled JavaScript and exfiltrate data.
- Drive up cost by pointing the service at large or slow resources.

The defenses below are organised so that the cheapest checks run first (URL normalization and IP classification), and the most expensive checks (full browser execution) only run for requests that have already cleared the gate.

## URL Normalization and the SSRF Gate

Per the published security advisories (issues #1252 and #1253), the entry-point guard lives in `MiscService.assertNormalizedUrl`, which performs a one-shot check that resolves the hostname and rejects non-public IP ranges. This is a single-shot gate: it is applied once to the URL the caller submitted, not on every redirect hop.

The practical implications for self-hosted operators are significant:

- An attacker who can control a DNS record (or who can race a TTL flip) can have a hostname resolve to a public IP at validation time and to an internal IP at fetch time — a TOCTOU bypass documented in issue #1253.
- An attacker who can convince the upstream server to issue a `30x` redirect to an internal address is *not* re-validated per hop, as documented in issue #1252. The fetch follows HTTP redirects with no per-hop re-validation, so a public origin can hand the browser a `Location:` header pointing at `http://169.254.169.254/...`.

Both classes of bypass are known and apply to the open-source branch at the affected commits. Operators of self-hosted deployments are expected to add their own egress firewall (e.g. an `iptables`/`nftables` rule that drops RFC1918 destinations from the container) until upstream adds per-hop re-validation. Source: [README.md](https://github.com/jina-ai/reader/blob/main/README.md) (Docker self-host section) and community issue links above.

## Browser Engine Hardening

When the request forces the browser engine (`X-Engine: browser` or `auto` with dynamic content), the fetcher delegates to the Puppeteer-based service. Several defensive behaviours are visible in the source:

- **Referrer injection.** The browser request is given a `referer` option on `page.goto`, which prevents some classes of origin-bound CSRF and helps upstream logs distinguish Reader traffic. Source: [src/services/serp/puppeteer.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/puppeteer.ts).
- **Multi-event wait.** Navigation waits on `load`, `domcontentloaded`, and `networkidle0` together so the rendered DOM is stable before extraction. Source: [src/services/serp/puppeteer.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/puppeteer.ts).
- **Hard timeout.** A `timeoutMs` (default `30_000`) caps the navigation. On `TimeoutError` the request is converted to a `AssertionFailureError` and surfaced as HTTP 422 with a readable message — this is the same error shape users see in issue #1118 (`Failed to goto ... : TimeoutError: Navigation timeout of 30000 ms exceeded`). Source: [src/services/serp/puppeteer.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/puppeteer.ts).
- **Listener cleanup.** A `crippleListener` is registered and removed on the close path so that an aborted navigation cannot leak the `_REPORT_FUNCTION_NAME` handler into the next request. Source: [src/services/serp/puppeteer.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/puppeteer.ts).

```mermaid
flowchart LR
    A[Client submits URL] --> B[assertNormalizedUrl]
    B -->|non-public IP| X[Reject 422]
    B -->|public IP| C{Engine?}
    C -->|curl| D[libcurl-impersonate fetch]
    C -->|browser| E[Headless Chrome + Puppeteer]
    D --> F[Readability / Markdown]
    E --> F
    E -->|abuse event| Y[SecurityCompromiseError]
    E -->|timeout| Z[AssertionFailureError 422]
    F --> G[Response]
```

## Abuse Detection

The browser path emits an `abuse` page event that the service translates into a `SecurityCompromiseError`. The handler rejects the request promise and surfaces the reason to the caller. Source: [src/services/serp/puppeteer.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/puppeteer.ts).

Downstream, the crawler API decides how to envelope the response. Hard errors (assertion failures, security compromises) are wrapped with `envelope: null` and a non-2xx status, while successful markdown, HTML, screenshot, and frontmatter responses carry `Content-Type` headers explicitly set in [src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts). The combination means a misbehaving origin cannot force the API to emit a different content type than the one declared.

## Output Sanitisation

Once a page is rendered, the Markdown pipeline runs through `tidyMarkdown`, which normalises broken cross-line links, collapses stray whitespace, and removes script-like content from link text. Source: [src/utils/markdown.ts](https://github.com/jina-ai/reader/blob/main/src/utils/markdown.ts). This step is the last line of defence against prompt-injection payloads that hide in alt text or link titles, and it is also the cause of the user-reported behaviour in issue #1251 where images appear as raw `[idx](url)` placeholders rather than inline media.

## Common Failure Modes

| Symptom | Likely cause | Reference |
|---|---|---|
| `Failed to goto ... : TimeoutError: Navigation timeout of 30000 ms exceeded` | Slow upstream or blocking script; raise `timeoutMs` or switch to `curl` engine | issue #1118 |
| Reader returns a plain text page with no images | Image extraction disabled or `tidyMarkdown` stripped the media tags | issue #1251 |
| Service blocked at the network layer | DNS poisoning in the operator's region; switch to regional mirror | issue #1237 |
| SSRF succeeds in self-hosted deployment | Missing egress firewall; add per-hop validation in front of the container | issues #1252, #1253 |
| `SecurityCompromiseError` returned | Browser engine emitted an `abuse` event from the target page | [puppeteer.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/puppeteer.ts) |

## Operator Recommendations

For self-hosted deployments, the published minimum hardening is: (1) put the container behind an egress proxy that drops RFC1918, link-local, and cloud-metadata destinations; (2) front the API with an authenticating reverse proxy so anonymous abuse cannot use the public quota; and (3) pin the image to a known commit and monitor the security advisory feed. The single-shot IP gate in `assertNormalizedUrl` is necessary but, as the community evidence shows, not sufficient.

## See Also

- [README.md](https://github.com/jina-ai/reader/blob/main/README.md) — usage, options, and Docker self-host instructions
- [cookbooks.md](https://github.com/jina-ai/reader/blob/main/cookbooks.md) — preset and header combinations
- [src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts) — request envelope and content-type decisions
- [src/services/serp/puppeteer.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/puppeteer.ts) — browser engine and abuse handler
- [src/utils/markdown.ts](https://github.com/jina-ai/reader/blob/main/src/utils/markdown.ts) — output sanitisation

---

<a id='page-4'></a>

## Search, Proxies, Caching & Self-Hosting Deployment

### Related Pages

Related topics: [System Overview & Architecture](#page-1), [Security, SSRF Protection & Abuse Mitigation](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/jina-ai/reader/blob/main/README.md)
- [src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts)
- [src/services/serp/compat.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/compat.ts)
- [src/services/serp/serper.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/serper.ts)
- [src/services/serp/puppeteer.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/puppeteer.ts)
- [package.json](https://github.com/jina-ai/reader/blob/main/package.json)
</details>

# Search, Proxies, Caching & Self-Hosting Deployment

This page covers the four operational pillars of the open-source `jina-ai/reader` codebase behind `r.jina.ai` and `s.jina.ai`: the SERP-driven **search** layer, the multi-tier **proxy** machinery that makes hostile sites fetchable, the **caching** model that keeps repeated reads cheap, and the **self-hosting** path for users who want to run Reader on their own infrastructure.

## 1. Search (`s.jina.ai`)

Search is implemented as a thin orchestration layer that reuses the same fetch + extraction stack used by `r.jina.ai`. When a user hits `s.jina.ai/<query>`, Reader calls a SERP provider, takes the top results, and then runs each result through the crawler — returning the full extracted content rather than just a title/snippet.

### Search providers and result shape

The SERP subsystem is pluggable: `serper` (Serper.dev Google SERP), `google` (direct Google), `bing`, and `puppeteer` (browser-driven scraping) all return a normalized `WebSearchEntry` shape defined in [src/services/serp/compat.ts:1-14](https://github.com/jina-ai/reader/blob/main/src/services/serp/compat.ts). The shape covers web/image/news variants and includes inline `siteLinks`:

```typescript
export interface WebSearchEntry {
    link: string;
    title: string;
    source?: string;
    date?: string;
    snippet?: string;
    imageUrl?: string;
    siteLinks?: { link: string; title: string; snippet?: string; }[];
    variant?: 'web' | 'images' | 'news';
}
```

### Search operators

Advanced query syntax (e.g. `site:`, `intitle:`, `loc:`) is built up programmatically. Source: [src/services/serp/serper.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/serper.ts) — the `addTo()` method composes a Google-style query string from the typed `GoogleSearchExplicitOperatorsDto` and joins chunks with `AND` / `OR` as appropriate.

### In-site search

For restricting results to specific domains, the docs in [README.md](https://github.com/jina-ai/reader/blob/main/README.md) recommend the `site` query parameter (repeatable):

```bash
curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com'
```

## 2. Proxies

Reader distinguishes two proxy modes — a user-supplied explicit proxy and an internal pool of "allocated" proxies that the service can fall back on when a direct fetch returns a thin page or a non-200 status.

### Explicit proxy

The `x-proxy-url` request header (documented in [README.md](https://github.com/jina-ai/reader/blob/main/README.md)) routes the entire traffic through the caller's own proxy. Anonymous traffic is rate-limited aggressively; an API key unlocks the internal proxy pool.

### Allocated proxy fallback

The crawler code in [src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts) decides per-request whether to retry with an allocated proxy. The relevant guard (paraphrased) is: if no explicit proxy is set, no `none` allocation has been requested, the page is short (under ~42 tokens by the lightweight analyzer), or the first side-load returned a non-2xx, the crawler calls `sideLoadWithAllocatedProxy()`. Source: [src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts) — search for the `if ((!crawlOpts?.allocProxy || crawlOpts.allocProxy !== 'none') && ...)` block.

### Abuse-driven domain blockades

When a request looks abusive (e.g. bot-challenge farms, repeated CAPTCHAs), the puppeteer control emits an `abuse` event. The crawler subscribes, logs it, and writes a `DomainBlockade` to the storage layer with a TTL. Source: [src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts) — `puppeteerControl.on('abuse', ...)` handler storing `{ domain, triggerReason, triggerUrl, expireAt }`.

### Curl-side impersonation

For non-browser fetches, the curl engine is told to impersonate whatever User-Agent the puppeteer layer negotiated, so anti-bot heuristics see a consistent client. Source: [src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts) — `init()` calls `this.curlControl.impersonateChrome(this.puppeteerControl.effectiveUA)`.

## 3. Caching

Caching is opt-in by TTL and opt-out by header, and is also implicitly disabled whenever the response would be unsafe to reuse (cookies, dynamic content, large screenshots).

| Header / knob | Effect | Source |
| --- | --- | --- |
| `x-cache-tolerance` | Integer seconds; how stale a cached response may be. | [README.md](https://github.com/jina-ai/reader/blob/main/README.md) |
| `x-no-cache: true` | Force a fresh fetch — use when a stale or already-blocked response is cached. | [README.md](https://github.com/jina-ai-ai/reader/blob/main/README.md) |
| `x-set-cookie` | Forward cookies; **caches are bypassed** for these requests. | [README.md](https://github.com/jina-ai/reader/blob/main/README.md) |
| `x-detach-invisibles` | Implies browser engine and **disables caching**. | [README.md](https://github.com/jina-ai/reader/blob/main/README.md) |
| `setToCache(...)` | Internal eligibility check via `options.eligibleForPageIndex`. | [src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts) |

The open-source branch is stateless out of the box; persistent caching requires the bundled `docker compose` with a MinIO/S3-compatible bucket — see the *Local development* note in [README.md](https://github.com/jina-ai/reader/blob/main/README.md).

## 4. Self-Hosting Deployment

A prebuilt image is published to the GitHub Container Registry with headless Chrome, LibreOffice, and CJK fonts bundled. Source: [README.md](https://github.com/jina-ai/reader/blob/main/README.md).

```bash
docker pull ghcr.io/jina-ai/reader:oss
```

### Port layout

The image exposes two ports, both handled by the same Koa app:

- **`8080`** — h2c (HTTP/2 cleartext). Production-grade and multiplexed; matches what Cloud Run talks to. Plain `curl` needs `--http2-prior-knowledge` to use it.
- **`8081`** — HTTP/1.1 fallback. Use this from browsers, `curl`, and most clients.

For a quick local try-out, only the HTTP/1.1 port needs to be mapped. Source: [README.md](https://github.com/jina-ai/reader/blob/main/README.md) — *Self-host with Docker* section.

### External assets

Geo IP, IP-to-ASN, CJK font (`SourceHanSansSC-Regular.otf`), and the user-agent list used by the curl engine are downloaded by a single script:

```bash
npm run assets:download
```

The script is idempotent and skippable via `SKIP_DOWNLOAD_EXTERNAL=1`; CI also fetches these URLs inline. Source: [README.md](https://github.com/jina-ai/reader/blob/main/README.md).

### Request flow

```mermaid
flowchart LR
    Client["curl / browser"] -->|HTTP| Reader[Reader on :8080 h2c or :8081 HTTP1.1]
    Reader --> SERP[SERP layer]
    SERP -->|top-N results| Crawler[Crawler]
    Crawler -->|direct or x-proxy-url| Target[Target website]
    Crawler -->|fallback: allocated proxy| Target
    Crawler --> Cache[(MinIO/S3 bucket)]
    Crawler --> Client
```

## Common failure modes and mitigations

Several issues reported in the community map directly onto the four pillars above:

- **DNS / geo blocks** ([issue #1237](https://github.com/jina-ai/reader/issues/1237)) — the public `jina.ai` domain was DNS-poisoned for some regions; Jina recommends the `r.jinaai.cn` / `s.jinaai.cn` transition domains or an API key with a dedicated egress address.
- **Stale or blocked response** — bypass cache with `-H 'x-no-cache: true'`, then escalate to `x-engine: browser`, then to `x-proxy-url`. Source: [README.md](https://github.com/jina-ai/reader/blob/main/README.md) — *Having trouble on some websites?* section.
- **Timeouts on slow sites** ([issue #1118](https://github.com/jina-ai/reader/issues/1118)) — Reader surfaces a `TimeoutError: Navigation timeout of 30000 ms exceeded`. Retry with `x-engine: browser` and a longer tolerance, or pre-warm the cache once and reuse.
- **SSRF hardening in self-hosted deployments** ([issue #1252](https://github.com/jina-ai/reader/issues/1252), [#1253](https://github.com/jina-ai/reader/issues/1253)) — self-hosters must not expose Reader to untrusted input without re-applying the URL gate per redirect hop. See the `MiscService.assertNormalizedUrl` flow described in those reports.

## See Also

- [README.md](https://github.com/jina-ai/reader/blob/main/README.md) — full header reference and cookbooks index.
- [src/dto/crawler-options.ts](https://github.com/jina-ai/reader/blob/main/src/dto/crawler-options.ts) — DTO-level defaults and validation for every `x-*` header.
- [src/api/crawler.ts](https://github.com/jina-ai/reader/blob/main/src/api/crawler.ts) — abuse event handling, cache eligibility, allocated-proxy fallback.
- [src/services/serp/compat.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/compat.ts) — `WebSearchEntry` shape.
- [src/services/serp/serper.ts](https://github.com/jina-ai/reader/blob/main/src/services/serp/serper.ts) — search-operator DSL.

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: jina-ai/reader

Summary: Found 28 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Security or permission risk - Security or permission risk requires verification.

## 1. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Developers should check this security_permissions risk before relying on the project: Server-Side Request Forgery via domain resolution bypass in self-hosted deployments
- User impact: Developers may expose sensitive permissions or credentials: Server-Side Request Forgery via domain resolution bypass in self-hosted deployments
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1253

## 2. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Developers should check this security_permissions risk before relying on the project: Unauthenticated SSRF via unvalidated HTTP redirects (single-shot SSRF gate not re-applied per redirect hop)
- User impact: Developers may expose sensitive permissions or credentials: Unauthenticated SSRF via unvalidated HTTP redirects (single-shot SSRF gate not re-applied per redirect hop)
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1252

## 3. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: runtime_trace
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Repro command: `docker run --rm -p 3000:8081 ghcr.io/jina-ai/reader:oss # then: curl http://localhost:3000/https://example.com`
- Evidence: identity.distribution | https://github.com/jina-ai/reader

## 4. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this installation risk before relying on the project: npm run build failed because shared files are not found
- User impact: Developers may fail before the first successful local run: npm run build failed because shared files are not found
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/3

## 5. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/3

## 6. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/2

## 7. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: Improve content extraction logic to handle dynamic and hidden elements
- User impact: Developers may misconfigure credentials, environment, or host setup: Improve content extraction logic to handle dynamic and hidden elements
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1242

## 8. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: Respect robots.txt and identify your system
- User impact: Developers may misconfigure credentials, environment, or host setup: Respect robots.txt and identify your system
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/4

## 9. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: support docker deployment
- User impact: Developers may misconfigure credentials, environment, or host setup: support docker deployment
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/2

## 10. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/jina-ai/reader

## 11. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: Failed to go to
- User impact: Developers may hit a documented source-backed failure mode: Failed to go to
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1118

## 12. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: Reader doesn't extract any content from this page even though its quite simple?
- User impact: Developers may hit a documented source-backed failure mode: Reader doesn't extract any content from this page even though its quite simple?
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/105

## 13. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/1118

## 14. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/jina-ai/reader

## 15. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/jina-ai/reader

## 16. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/jina-ai/reader

## 17. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/1250

## 18. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/1242

## 19. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/1253

## 20. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/1252

## 21. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/2

## 22. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: Bug/Optimization: Reader Output size is larger than Raw HTML size
- User impact: Developers may hit a documented source-backed failure mode: Bug/Optimization: Reader Output size is larger than Raw HTML size
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1250

## 23. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: Extraction didn't work
- User impact: Developers may hit a documented source-backed failure mode: Extraction didn't work
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1

## 24. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Project evidence flags a capability evidence risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1237

## 25. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: Pile in reader format
- User impact: Developers may hit a documented source-backed failure mode: Pile in reader format
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/5

## 26. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: В странном виде сайты открываются.
- User impact: Developers may hit a documented source-backed failure mode: В странном виде сайты открываются.
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1251

## 27. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/jina-ai/reader

## 28. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/jina-ai/reader

<!-- canonical_name: jina-ai/reader; human_manual_source: deepwiki_human_wiki -->
