Doramagic Project Pack · Human Manual

reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/

System Overview & Architecture

Related topics: URL Fetching Engines & Content Extraction, Search, Proxies, Caching & Self-Hosting Deployment

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 3.1 Fetching and Rendering

Continue reading this section for the full explanation and source context.

Section 3.2 Content Processing

Continue reading this section for the full explanation and source context.

Section 3.3 LLM and Search Adapters

Continue reading this section for the full explanation and source context.

Related topics: URL Fetching Engines & Content Extraction, Search, Proxies, Caching & Self-Hosting Deployment

System Overview & Architecture

1. Purpose and Scope

Reader is Jina AI's open-source service that turns arbitrary web content into clean, LLM-friendly input. It exposes two public surfaces: a Read endpoint (https://r.jina.ai/) that converts any URL — HTML page, PDF, Office document, or image — into Markdown, HTML, text, or a screenshot; and a Search endpoint (https://s.jina.ai/) that issues a web search and returns structured results with snippets, links, and metadata. Source: README.md.

The codebase is a TypeScript application targeting Node.js >=22.15 and is packaged as a single civkit-based service. Source: package.json. The package.json exports field advertises three entry points — ./crawl, ./search, and ./serp — indicating the project is logically decomposed along those three verbs even though it ships as one binary.

The project explicitly positions itself for two downstream use cases: feeding RAG pipelines and powering AI agents. The README's header option matrix (x-preset, x-engine, x-respond-with, etc.) is designed around those scenarios, with named presets such as reader, index, research, agent, and spider. Source: README.md.

2. High-Level Architecture

The runtime is a Koa-style HTTP/2 service built on the civkit framework, with an inverted-dependency container (tsyringe) wiring together fetcher engines, content processors, LLM adapters, and SERP providers. The package.json dependency list (koa, civkit, tsyringe, undici, @mozilla/readability, @nomagick/node-libcurl-impersonate, generic-pool, pino-pretty) maps directly onto that layered design. Source: package.json.

flowchart LR
    A[Client / LLM Agent] -->|URL or query| B(API Layer<br/>r.jina.ai / s.jina.ai)
    B --> C{Engine Selector<br/>x-engine: browser / curl / auto}
    C -->|browser| D[Headless Chrome Pool<br/>generic-pool]
    C -->|curl| E[node-libcurl-impersonate]
    D --> F[Content Pipeline]
    E --> F
    F --> G[Readability + Markdown tidy]
    G --> H[Output Formatter<br/>markdown / html / text / screenshot / frontmatter]
    F --> I[SSRF Gate<br/>assertNormalizedUrl]
    B --> J[SERP Adapters<br/>serper, compat]
    J --> K[WebSearchEntry]
    F --> L[LLM Adapters<br/>OpenRouter, misc]
    H --> A
    K --> A

The diagram captures the four responsibilities the codebase is organized around: inbound API parsing (the many X-* headers and JSON body options), fetching (browser pool vs. libcurl impersonation), content processing (readability, markdown normalization, image captioning via VLMs), and output shaping (presets, chunking, frontmatter). Source: README.md.

3. Service Layers and Modules

3.1 Fetching and Rendering

Two fetch backends coexist. The browser engine runs headless Chrome through a pooled allocator (generic-pool is a direct dependency), enabling JavaScript execution and screenshot capture — required for SPAs and dynamic content, a gap reported in the community (#1242 "Improve content extraction logic to handle dynamic and hidden elements"). The curl engine uses @nomagick/node-libcurl-impersonate for lightweight, non-JS fetches that still mimic real browser TLS fingerprints. The default auto engine decides per URL. Sources: README.md, package.json.

3.2 Content Processing

After fetching, HTML is reduced to LLM input through a pipeline that includes @mozilla/readability and a markdown normalization pass. The tidyMarkdown helper consolidates links and images that have been split across lines, normalizes whitespace, and re-wraps nested image-in-link fragments into a single tidy markdown token, which is critical because raw Chromium-to-markdown output is often broken when long links or inline images wrap. Source: src/utils/markdown.ts.

A separate utility, src/utils/tailwind-classes.ts, enumerates the Tailwind class universe so the processor can strip or keep layout/utility classes during DOM-to-Markdown conversion — visible in the file's exhaustive lists of color, spacing, outline, and break utilities. Source: src/utils/tailwind-classes.ts.

3.3 LLM and Search Adapters

LLM access is provider-agnostic. src/services/common-llm/open-router.ts defines a uniform DTO with temperature, top_p, top_k, frequency_penalty, presence_penalty, seed, plus tool-calling (tools, tool_choice) that is passed through to OpenAI-compatible providers. Source: src/services/common-llm/open-router.ts. The companion src/services/common-llm/misc.ts provides chatMLEncode, a ChatML message assembler that supports system prompts, multi-message history, and arbitrary message metadata, enabling the same client to drive both OpenAI-style and ChatML-style backends. Source: src/services/common-llm/misc.ts.

Search is similarly normalized. src/services/serp/compat.ts defines a stable WebSearchEntry shape — link, title, source, date, snippet, imageUrl, siteLinks, and a variant discriminator — that every SERP provider must yield, regardless of upstream differences. Source: src/services/serp/compat.ts. The Google-specific adapter in src/services/serp/serper.ts implements explicit operators (intitle, loc, site) and assembles them into a single search string via the addTo method. Source: src/services/serp/serper.ts.

4. Operational Topology and Community-Reported Failure Modes

The Docker image (ghcr.io/jina-ai/reader:oss) exposes two ports: 8080 over HTTP/2 cleartext (production-grade, used by Cloud Run) and 8081 over HTTP/1.1 as a curl-friendly fallback. The image bundles headless Chrome, LibreOffice, and CJK fonts so PDFs, Office documents, and CJK pages are processed in-container. Source: README.md.

Several community-reported issues map directly onto architectural seams in this overview and are worth flagging for self-hosters:

ConcernArchitectural LocationCommunity Reference
SSRF via DNS rebinding / per-hop redirect validationSingle-shot assertNormalizedUrl gate before fetch#1252, #1253
Page navigation timeout (default 30 s)Headless Chrome pool under generic-pool#1118
Output size larger than raw HTMLMarkdown normalization + frontmatter expansion in tidyMarkdown#1250
Extraction failures on simple pagesReadability + DOM-to-MD pipeline#105, #1
Regional blocking of jina.ai / r.jina.aiPublic hostname topology; mirror via r.jinaai.cn#1237

Together these describe a layered system where API parsing → fetching → processing → output are independently configurable through headers and presets, and where each layer carries a known set of edge cases that the README's x-* option matrix is designed to address. Source: README.md.

See Also

Source: https://github.com/jina-ai/reader / Human Manual

URL Fetching Engines & Content Extraction

Related topics: System Overview & Architecture, Security, SSRF Protection & Abuse Mitigation

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: System Overview & Architecture, Security, SSRF Protection & Abuse Mitigation

URL Fetching Engines & Content Extraction

Overview

The jina-ai/reader service converts any HTTP-accessible resource into an LLM-friendly representation. The pipeline is exposed through two public endpoints: r.jina.ai for reading a single URL, and s.jina.ai for web search. Internally, the request is handled by the CrawlerAPI.crawl RPC (src/api/crawler.ts), which validates the target URL, chooses a fetching engine, fetches the resource, normalizes the result, and serializes it according to the caller's requested output format.

Two core stages are involved:

  1. Fetching — retrieving the remote document with one of the supported engines.
  2. Content extraction & formatting — turning the raw response (HTML, PDF, image, etc.) into the chosen representation (Markdown, HTML, text, screenshot, or frontmatter).

Fetching Engines

The x-engine request header selects the engine that the service uses to download a URL (README.md):

EngineDescriptionTrade-offs
autoDefault. Combines a lightweight curl path with a headless-browser fallback.Best balance of cost and quality; preferred for general-purpose crawling.
curlPlain HTTP fetch without JavaScript execution.Cheapest, but cannot render SPAs or JS-driven content.
browserForces headless Chrome (Puppeteer, see package.json dependency on puppeteer ^24.42.0).Highest fidelity; needed for sites that only serve real content to real browsers.

The crawler normalizes the user-supplied URL through MiscService.assertNormalizedUrl, which performs DNS resolution and rejects non-public IP ranges to mitigate SSRF (src/api/crawler.ts). However, as noted in community issues #1252 and #1253, this check is applied once before fetching; subsequent HTTP redirects are not re-validated per hop, which is a known limitation in self-hosted deployments.

A per-host circuit breaker (puppeteerControl.circuitBreakerHosts) blocks recursive or self-referential crawling and emits abuse events that can store a domain-wide blockade via storageLayer.storeDomainBlockade (src/api/crawler.ts). When abuse is detected, the offending hostname is added to a temporary block list and excluded from future fetches.

flowchart LR
  A[Client request<br/>r.jina.ai/URL] --> B[assertNormalizedUrl<br/>SSRF gate]
  B --> C{Engine?}
  C -- auto --> D[curl first]
  C -- browser --> E[Puppeteer]
  C -- curl --> F[undici/curl]
  D --> G{Content OK?}
  G -- no --> E
  G -- yes --> H[Snapshot]
  E --> H
  F --> H
  H --> I[Format by x-respond-with]
  I --> J[Markdown / HTML / text / screenshot / frontmatter]

Content Extraction Pipeline

After a successful fetch, the response body is dispatched based on MIME type. The crawler applies jsdomControl.analyzeHTMLTextLite to derive the document title and to narrow the snapshot (src/api/crawler.ts). A side-load path exists for binary or non-HTML types — PDFs, Office documents, and images are converted externally (LibreOffice for Office, PDF.js for PDFs, a VLM for image captioning) and then re-introduced as HTML or markdown fragments.

The HTML → Markdown transformation is performed by markify (referenced in the README as readability for filtered markdown). The tidyMarkdown post-processor in src/utils/markdown.ts normalizes broken links that span multiple lines, collapses whitespace inside link labels, and re-attaches image syntax to the surrounding anchor — a frequent source of malformed Markdown from upstream converters. Tailwind utility class names are tracked in src/utils/tailwind-classes.ts so the converter can recognize and strip them from extracted content where appropriate.

Several response-modifier headers control the output shape (README.md):

  • x-respond-withmarkdown, html, text, screenshot, pageshot, frontmatter, or markdown+frontmatter.
  • x-retain-media — control how <video>, <audio>, and embedded iframes appear (link, none, text, image, html).
  • x-with-links-summary / x-with-images-summary — append a deduplicated footer of links or images.
  • x-markdown-chunking — opt-in semantic chunking by heading level (h1h5) or by block structure (s1s5).
  • x-detach-invisibles — strip elements with eventual display:none before snapshotting (forces browser engine, disables caching).
  • x-respond-with: frontmatter — return Markdown with a YAML frontmatter block (title, description, url).
  • x-with-generated-alt — use a VLM to caption images and emit ![Image [idx]: [VLM_caption]](img_URL).

Caching is keyed by an MD5 digest of the normalized URL (src/api/crawler.ts, getUrlDigest). The x-cache-tolerance header defines how stale a cached entry is allowed to be, while x-no-cache: true bypasses the cache entirely — useful when a stale or already-blocked response is being served.

Response Formatting & Search Coupling

The crawl handler dispatches the formatted page to the appropriate serializer based on crawlerOptions.respondWith (src/api/crawler.ts). The pageshot branch issues a 302 redirect to the screenshot URL or returns the raw PNG; frontmatter and markdown+frontmatter emit a YAML-headed Markdown document; the default path returns text/plain.

For search, the SearcherAPI reuses the same CrawlerOptions pipeline. The result entries from a SERP provider are first mapped by mapSearchEntryToPartialFormattedPage (src/api/searcher.ts) and then each URL is fetched through the same engine and extractor machinery. Search query augmentation supports Google explicit operators (intitle:, site:, loc:) via GoogleSearchExplicitOperatorsDto.addTo (src/services/serp/serper.ts), and the normalized entry shape is defined by the WebSearchEntry interface (src/services/serp/compat.ts).

Common Failure Modes & Mitigations

User-reported issues map to specific stages of the pipeline:

  • 30 s navigation timeouts (#1118) — Puppeteer's default goto timeout fires on slow sites. Retry with x-engine: auto (which prefers curl) or use an API key to access the internal proxy.
  • Empty extraction on simple pages (#105) — usually caused by aggressive readability filtering; switch to x-respond-with: markdown+frontmatter or html to bypass filtering.
  • Dynamic / hidden content (#1242) — content loaded by JavaScript is missed by curl; use x-engine: browser or x-detach-invisibles: true.
  • Output larger than source HTML (#1250) — frontmatter, link summaries, and image captions add overhead; disable with x-with-links-summary: false and use plain markdown instead of frontmatter.
  • Images rendered as raw URLs (#1251) — by default, images become links; set x-retain-images: alt or image to embed them as !… syntax.
  • Geo-blocking (#1237) — jina.ai is DNS-poisoned in some regions; transition aliases r.jinaai.cn and s.jinaai.cn are provided.
  • SSRF in self-hosted deployments (#1252, #1253) — the assertNormalizedUrl gate is single-shot and does not re-validate redirect targets; self-hosters should front the service with an outbound proxy that enforces per-hop checks.

See Also

Source: https://github.com/jina-ai/reader / Human Manual

Security, SSRF Protection & Abuse Mitigation

Related topics: System Overview & Architecture, Search, Proxies, Caching & Self-Hosting Deployment

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: System Overview & Architecture, Search, Proxies, Caching & Self-Hosting Deployment

Security, SSRF Protection & Abuse Mitigation

Reader is a server-side URL fetcher: a client submits a URL, the service retrieves it, and the response is converted into LLM-friendly Markdown. Because the server is willing to fetch *any* URL the user supplies, the entire service is fundamentally an SSRF surface. The open-source branch in this repository mitigates that risk with a layered set of guards: a one-shot URL/IP validator, a hardened browser engine, abuse detection events, timeouts, and the engine-selection knobs exposed in the request DTOs.

This page documents the security mechanisms visible in the codebase and the failure modes that have surfaced in the community.

High-Level Threat Model

Reader accepts an arbitrary URL and returns a structured representation of the remote response. An attacker who can submit a URL can therefore:

  • Probe internal services bound to 127.0.0.1, 169.254.169.254 (cloud metadata), 10.0.0.0/8, 192.168.0.0/16, fc00::/7, and other non-public ranges.
  • Use the public service as an open proxy to anonymize traffic.
  • Cause the headless browser to render attacker-controlled JavaScript and exfiltrate data.
  • Drive up cost by pointing the service at large or slow resources.

The defenses below are organised so that the cheapest checks run first (URL normalization and IP classification), and the most expensive checks (full browser execution) only run for requests that have already cleared the gate.

URL Normalization and the SSRF Gate

Per the published security advisories (issues #1252 and #1253), the entry-point guard lives in MiscService.assertNormalizedUrl, which performs a one-shot check that resolves the hostname and rejects non-public IP ranges. This is a single-shot gate: it is applied once to the URL the caller submitted, not on every redirect hop.

The practical implications for self-hosted operators are significant:

  • An attacker who can control a DNS record (or who can race a TTL flip) can have a hostname resolve to a public IP at validation time and to an internal IP at fetch time — a TOCTOU bypass documented in issue #1253.
  • An attacker who can convince the upstream server to issue a 30x redirect to an internal address is *not* re-validated per hop, as documented in issue #1252. The fetch follows HTTP redirects with no per-hop re-validation, so a public origin can hand the browser a Location: header pointing at http://169.254.169.254/....

Both classes of bypass are known and apply to the open-source branch at the affected commits. Operators of self-hosted deployments are expected to add their own egress firewall (e.g. an iptables/nftables rule that drops RFC1918 destinations from the container) until upstream adds per-hop re-validation. Source: README.md (Docker self-host section) and community issue links above.

Browser Engine Hardening

When the request forces the browser engine (X-Engine: browser or auto with dynamic content), the fetcher delegates to the Puppeteer-based service. Several defensive behaviours are visible in the source:

  • Referrer injection. The browser request is given a referer option on page.goto, which prevents some classes of origin-bound CSRF and helps upstream logs distinguish Reader traffic. Source: src/services/serp/puppeteer.ts.
  • Multi-event wait. Navigation waits on load, domcontentloaded, and networkidle0 together so the rendered DOM is stable before extraction. Source: src/services/serp/puppeteer.ts.
  • Hard timeout. A timeoutMs (default 30_000) caps the navigation. On TimeoutError the request is converted to a AssertionFailureError and surfaced as HTTP 422 with a readable message — this is the same error shape users see in issue #1118 (Failed to goto ... : TimeoutError: Navigation timeout of 30000 ms exceeded). Source: src/services/serp/puppeteer.ts.
  • Listener cleanup. A crippleListener is registered and removed on the close path so that an aborted navigation cannot leak the _REPORT_FUNCTION_NAME handler into the next request. Source: src/services/serp/puppeteer.ts.
flowchart LR
    A[Client submits URL] --> B[assertNormalizedUrl]
    B -->|non-public IP| X[Reject 422]
    B -->|public IP| C{Engine?}
    C -->|curl| D[libcurl-impersonate fetch]
    C -->|browser| E[Headless Chrome + Puppeteer]
    D --> F[Readability / Markdown]
    E --> F
    E -->|abuse event| Y[SecurityCompromiseError]
    E -->|timeout| Z[AssertionFailureError 422]
    F --> G[Response]

Abuse Detection

The browser path emits an abuse page event that the service translates into a SecurityCompromiseError. The handler rejects the request promise and surfaces the reason to the caller. Source: src/services/serp/puppeteer.ts.

Downstream, the crawler API decides how to envelope the response. Hard errors (assertion failures, security compromises) are wrapped with envelope: null and a non-2xx status, while successful markdown, HTML, screenshot, and frontmatter responses carry Content-Type headers explicitly set in src/api/crawler.ts. The combination means a misbehaving origin cannot force the API to emit a different content type than the one declared.

Output Sanitisation

Once a page is rendered, the Markdown pipeline runs through tidyMarkdown, which normalises broken cross-line links, collapses stray whitespace, and removes script-like content from link text. Source: src/utils/markdown.ts. This step is the last line of defence against prompt-injection payloads that hide in alt text or link titles, and it is also the cause of the user-reported behaviour in issue #1251 where images appear as raw idx placeholders rather than inline media.

Common Failure Modes

SymptomLikely causeReference
Failed to goto ... : TimeoutError: Navigation timeout of 30000 ms exceededSlow upstream or blocking script; raise timeoutMs or switch to curl engineissue #1118
Reader returns a plain text page with no imagesImage extraction disabled or tidyMarkdown stripped the media tagsissue #1251
Service blocked at the network layerDNS poisoning in the operator's region; switch to regional mirrorissue #1237
SSRF succeeds in self-hosted deploymentMissing egress firewall; add per-hop validation in front of the containerissues #1252, #1253
SecurityCompromiseError returnedBrowser engine emitted an abuse event from the target pagepuppeteer.ts

Operator Recommendations

For self-hosted deployments, the published minimum hardening is: (1) put the container behind an egress proxy that drops RFC1918, link-local, and cloud-metadata destinations; (2) front the API with an authenticating reverse proxy so anonymous abuse cannot use the public quota; and (3) pin the image to a known commit and monitor the security advisory feed. The single-shot IP gate in assertNormalizedUrl is necessary but, as the community evidence shows, not sufficient.

See Also

Source: https://github.com/jina-ai/reader / Human Manual

Search, Proxies, Caching & Self-Hosting Deployment

Related topics: System Overview & Architecture, Security, SSRF Protection & Abuse Mitigation

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Search providers and result shape

Continue reading this section for the full explanation and source context.

Section Search operators

Continue reading this section for the full explanation and source context.

Section In-site search

Continue reading this section for the full explanation and source context.

Related topics: System Overview & Architecture, Security, SSRF Protection & Abuse Mitigation

Search, Proxies, Caching & Self-Hosting Deployment

This page covers the four operational pillars of the open-source jina-ai/reader codebase behind r.jina.ai and s.jina.ai: the SERP-driven search layer, the multi-tier proxy machinery that makes hostile sites fetchable, the caching model that keeps repeated reads cheap, and the self-hosting path for users who want to run Reader on their own infrastructure.

1. Search (`s.jina.ai`)

Search is implemented as a thin orchestration layer that reuses the same fetch + extraction stack used by r.jina.ai. When a user hits s.jina.ai/<query>, Reader calls a SERP provider, takes the top results, and then runs each result through the crawler — returning the full extracted content rather than just a title/snippet.

Search providers and result shape

The SERP subsystem is pluggable: serper (Serper.dev Google SERP), google (direct Google), bing, and puppeteer (browser-driven scraping) all return a normalized WebSearchEntry shape defined in src/services/serp/compat.ts:1-14. The shape covers web/image/news variants and includes inline siteLinks:

export interface WebSearchEntry {
    link: string;
    title: string;
    source?: string;
    date?: string;
    snippet?: string;
    imageUrl?: string;
    siteLinks?: { link: string; title: string; snippet?: string; }[];
    variant?: 'web' | 'images' | 'news';
}

Search operators

Advanced query syntax (e.g. site:, intitle:, loc:) is built up programmatically. Source: src/services/serp/serper.ts — the addTo() method composes a Google-style query string from the typed GoogleSearchExplicitOperatorsDto and joins chunks with AND / OR as appropriate.

For restricting results to specific domains, the docs in README.md recommend the site query parameter (repeatable):

curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com'

2. Proxies

Reader distinguishes two proxy modes — a user-supplied explicit proxy and an internal pool of "allocated" proxies that the service can fall back on when a direct fetch returns a thin page or a non-200 status.

Explicit proxy

The x-proxy-url request header (documented in README.md) routes the entire traffic through the caller's own proxy. Anonymous traffic is rate-limited aggressively; an API key unlocks the internal proxy pool.

Allocated proxy fallback

The crawler code in src/api/crawler.ts decides per-request whether to retry with an allocated proxy. The relevant guard (paraphrased) is: if no explicit proxy is set, no none allocation has been requested, the page is short (under ~42 tokens by the lightweight analyzer), or the first side-load returned a non-2xx, the crawler calls sideLoadWithAllocatedProxy(). Source: src/api/crawler.ts — search for the if ((!crawlOpts?.allocProxy || crawlOpts.allocProxy !== 'none') && ...) block.

Abuse-driven domain blockades

When a request looks abusive (e.g. bot-challenge farms, repeated CAPTCHAs), the puppeteer control emits an abuse event. The crawler subscribes, logs it, and writes a DomainBlockade to the storage layer with a TTL. Source: src/api/crawler.tspuppeteerControl.on('abuse', ...) handler storing { domain, triggerReason, triggerUrl, expireAt }.

Curl-side impersonation

For non-browser fetches, the curl engine is told to impersonate whatever User-Agent the puppeteer layer negotiated, so anti-bot heuristics see a consistent client. Source: src/api/crawler.tsinit() calls this.curlControl.impersonateChrome(this.puppeteerControl.effectiveUA).

3. Caching

Caching is opt-in by TTL and opt-out by header, and is also implicitly disabled whenever the response would be unsafe to reuse (cookies, dynamic content, large screenshots).

Header / knobEffectSource
x-cache-toleranceInteger seconds; how stale a cached response may be.README.md
x-no-cache: trueForce a fresh fetch — use when a stale or already-blocked response is cached.README.md
x-set-cookieForward cookies; caches are bypassed for these requests.README.md
x-detach-invisiblesImplies browser engine and disables caching.README.md
setToCache(...)Internal eligibility check via options.eligibleForPageIndex.src/api/crawler.ts

The open-source branch is stateless out of the box; persistent caching requires the bundled docker compose with a MinIO/S3-compatible bucket — see the *Local development* note in README.md.

4. Self-Hosting Deployment

A prebuilt image is published to the GitHub Container Registry with headless Chrome, LibreOffice, and CJK fonts bundled. Source: README.md.

docker pull ghcr.io/jina-ai/reader:oss

Port layout

The image exposes two ports, both handled by the same Koa app:

  • 8080 — h2c (HTTP/2 cleartext). Production-grade and multiplexed; matches what Cloud Run talks to. Plain curl needs --http2-prior-knowledge to use it.
  • 8081 — HTTP/1.1 fallback. Use this from browsers, curl, and most clients.

For a quick local try-out, only the HTTP/1.1 port needs to be mapped. Source: README.md — *Self-host with Docker* section.

External assets

Geo IP, IP-to-ASN, CJK font (SourceHanSansSC-Regular.otf), and the user-agent list used by the curl engine are downloaded by a single script:

npm run assets:download

The script is idempotent and skippable via SKIP_DOWNLOAD_EXTERNAL=1; CI also fetches these URLs inline. Source: README.md.

Request flow

flowchart LR
    Client["curl / browser"] -->|HTTP| Reader[Reader on :8080 h2c or :8081 HTTP1.1]
    Reader --> SERP[SERP layer]
    SERP -->|top-N results| Crawler[Crawler]
    Crawler -->|direct or x-proxy-url| Target[Target website]
    Crawler -->|fallback: allocated proxy| Target
    Crawler --> Cache[(MinIO/S3 bucket)]
    Crawler --> Client

Common failure modes and mitigations

Several issues reported in the community map directly onto the four pillars above:

  • DNS / geo blocks (issue #1237) — the public jina.ai domain was DNS-poisoned for some regions; Jina recommends the r.jinaai.cn / s.jinaai.cn transition domains or an API key with a dedicated egress address.
  • Stale or blocked response — bypass cache with -H 'x-no-cache: true', then escalate to x-engine: browser, then to x-proxy-url. Source: README.md — *Having trouble on some websites?* section.
  • Timeouts on slow sites (issue #1118) — Reader surfaces a TimeoutError: Navigation timeout of 30000 ms exceeded. Retry with x-engine: browser and a longer tolerance, or pre-warm the cache once and reuse.
  • SSRF hardening in self-hosted deployments (issue #1252, #1253) — self-hosters must not expose Reader to untrusted input without re-applying the URL gate per redirect hop. See the MiscService.assertNormalizedUrl flow described in those reports.

See Also

Source: https://github.com/jina-ai/reader / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Security or permission risk requires verification

Developers may expose sensitive permissions or credentials: Server-Side Request Forgery via domain resolution bypass in self-hosted deployments

high Security or permission risk requires verification

Developers may expose sensitive permissions or credentials: Unauthenticated SSRF via unvalidated HTTP redirects (single-shot SSRF gate not re-applied per redirect hop)

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

Developers may fail before the first successful local run: npm run build failed because shared files are not found

Doramagic Pitfall Log

Found 28 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Security or permission risk - Security or permission risk requires verification.

1. Security or permission risk: Security or permission risk requires verification

  • Severity: high
  • Finding: Developers should check this security_permissions risk before relying on the project: Server-Side Request Forgery via domain resolution bypass in self-hosted deployments
  • User impact: Developers may expose sensitive permissions or credentials: Server-Side Request Forgery via domain resolution bypass in self-hosted deployments
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Server-Side Request Forgery via domain resolution bypass in self-hosted deployments. Context: Observed when using docker
  • Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1253

2. Security or permission risk: Security or permission risk requires verification

  • Severity: high
  • Finding: Developers should check this security_permissions risk before relying on the project: Unauthenticated SSRF via unvalidated HTTP redirects (single-shot SSRF gate not re-applied per redirect hop)
  • User impact: Developers may expose sensitive permissions or credentials: Unauthenticated SSRF via unvalidated HTTP redirects (single-shot SSRF gate not re-applied per redirect hop)
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Unauthenticated SSRF via unvalidated HTTP redirects (single-shot SSRF gate not re-applied per redirect hop). Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1252

3. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: identity.distribution | https://github.com/jina-ai/reader

4. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: npm run build failed because shared files are not found
  • User impact: Developers may fail before the first successful local run: npm run build failed because shared files are not found
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: npm run build failed because shared files are not found. Context: Observed when using node
  • Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/3

5. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/3

6. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/2

7. Configuration risk: Configuration risk requires verification

  • Severity: medium
  • Finding: Developers should check this configuration risk before relying on the project: Improve content extraction logic to handle dynamic and hidden elements
  • User impact: Developers may misconfigure credentials, environment, or host setup: Improve content extraction logic to handle dynamic and hidden elements
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Improve content extraction logic to handle dynamic and hidden elements. Context: Observed when using playwright
  • Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1242

8. Configuration risk: Configuration risk requires verification

  • Severity: medium
  • Finding: Developers should check this configuration risk before relying on the project: Respect robots.txt and identify your system
  • User impact: Developers may misconfigure credentials, environment, or host setup: Respect robots.txt and identify your system
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Respect robots.txt and identify your system. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/4

9. Configuration risk: Configuration risk requires verification

  • Severity: medium
  • Finding: Developers should check this configuration risk before relying on the project: support docker deployment
  • User impact: Developers may misconfigure credentials, environment, or host setup: support docker deployment
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: support docker deployment. Context: Observed when using docker
  • Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/2

10. Capability evidence risk: Capability evidence risk requires verification

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: capability.assumptions | https://github.com/jina-ai/reader

11. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Developers should check this runtime risk before relying on the project: Failed to go to
  • User impact: Developers may hit a documented source-backed failure mode: Failed to go to
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Failed to go to. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1118

12. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Developers should check this runtime risk before relying on the project: Reader doesn't extract any content from this page even though its quite simple?
  • User impact: Developers may hit a documented source-backed failure mode: Reader doesn't extract any content from this page even though its quite simple?
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Reader doesn't extract any content from this page even though its quite simple?. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/105

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using reader with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence