Doramagic Project Pack · Human Manual
reader
Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
System Overview & Architecture
Related topics: URL Fetching Engines & Content Extraction, Search, Proxies, Caching & Self-Hosting Deployment
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: URL Fetching Engines & Content Extraction, Search, Proxies, Caching & Self-Hosting Deployment
System Overview & Architecture
1. Purpose and Scope
Reader is Jina AI's open-source service that turns arbitrary web content into clean, LLM-friendly input. It exposes two public surfaces: a Read endpoint (https://r.jina.ai/) that converts any URL — HTML page, PDF, Office document, or image — into Markdown, HTML, text, or a screenshot; and a Search endpoint (https://s.jina.ai/) that issues a web search and returns structured results with snippets, links, and metadata. Source: README.md.
The codebase is a TypeScript application targeting Node.js >=22.15 and is packaged as a single civkit-based service. Source: package.json. The package.json exports field advertises three entry points — ./crawl, ./search, and ./serp — indicating the project is logically decomposed along those three verbs even though it ships as one binary.
The project explicitly positions itself for two downstream use cases: feeding RAG pipelines and powering AI agents. The README's header option matrix (x-preset, x-engine, x-respond-with, etc.) is designed around those scenarios, with named presets such as reader, index, research, agent, and spider. Source: README.md.
2. High-Level Architecture
The runtime is a Koa-style HTTP/2 service built on the civkit framework, with an inverted-dependency container (tsyringe) wiring together fetcher engines, content processors, LLM adapters, and SERP providers. The package.json dependency list (koa, civkit, tsyringe, undici, @mozilla/readability, @nomagick/node-libcurl-impersonate, generic-pool, pino-pretty) maps directly onto that layered design. Source: package.json.
flowchart LR
A[Client / LLM Agent] -->|URL or query| B(API Layer<br/>r.jina.ai / s.jina.ai)
B --> C{Engine Selector<br/>x-engine: browser / curl / auto}
C -->|browser| D[Headless Chrome Pool<br/>generic-pool]
C -->|curl| E[node-libcurl-impersonate]
D --> F[Content Pipeline]
E --> F
F --> G[Readability + Markdown tidy]
G --> H[Output Formatter<br/>markdown / html / text / screenshot / frontmatter]
F --> I[SSRF Gate<br/>assertNormalizedUrl]
B --> J[SERP Adapters<br/>serper, compat]
J --> K[WebSearchEntry]
F --> L[LLM Adapters<br/>OpenRouter, misc]
H --> A
K --> AThe diagram captures the four responsibilities the codebase is organized around: inbound API parsing (the many X-* headers and JSON body options), fetching (browser pool vs. libcurl impersonation), content processing (readability, markdown normalization, image captioning via VLMs), and output shaping (presets, chunking, frontmatter). Source: README.md.
3. Service Layers and Modules
3.1 Fetching and Rendering
Two fetch backends coexist. The browser engine runs headless Chrome through a pooled allocator (generic-pool is a direct dependency), enabling JavaScript execution and screenshot capture — required for SPAs and dynamic content, a gap reported in the community (#1242 "Improve content extraction logic to handle dynamic and hidden elements"). The curl engine uses @nomagick/node-libcurl-impersonate for lightweight, non-JS fetches that still mimic real browser TLS fingerprints. The default auto engine decides per URL. Sources: README.md, package.json.
3.2 Content Processing
After fetching, HTML is reduced to LLM input through a pipeline that includes @mozilla/readability and a markdown normalization pass. The tidyMarkdown helper consolidates links and images that have been split across lines, normalizes whitespace, and re-wraps nested image-in-link fragments into a single tidy markdown token, which is critical because raw Chromium-to-markdown output is often broken when long links or inline images wrap. Source: src/utils/markdown.ts.
A separate utility, src/utils/tailwind-classes.ts, enumerates the Tailwind class universe so the processor can strip or keep layout/utility classes during DOM-to-Markdown conversion — visible in the file's exhaustive lists of color, spacing, outline, and break utilities. Source: src/utils/tailwind-classes.ts.
3.3 LLM and Search Adapters
LLM access is provider-agnostic. src/services/common-llm/open-router.ts defines a uniform DTO with temperature, top_p, top_k, frequency_penalty, presence_penalty, seed, plus tool-calling (tools, tool_choice) that is passed through to OpenAI-compatible providers. Source: src/services/common-llm/open-router.ts. The companion src/services/common-llm/misc.ts provides chatMLEncode, a ChatML message assembler that supports system prompts, multi-message history, and arbitrary message metadata, enabling the same client to drive both OpenAI-style and ChatML-style backends. Source: src/services/common-llm/misc.ts.
Search is similarly normalized. src/services/serp/compat.ts defines a stable WebSearchEntry shape — link, title, source, date, snippet, imageUrl, siteLinks, and a variant discriminator — that every SERP provider must yield, regardless of upstream differences. Source: src/services/serp/compat.ts. The Google-specific adapter in src/services/serp/serper.ts implements explicit operators (intitle, loc, site) and assembles them into a single search string via the addTo method. Source: src/services/serp/serper.ts.
4. Operational Topology and Community-Reported Failure Modes
The Docker image (ghcr.io/jina-ai/reader:oss) exposes two ports: 8080 over HTTP/2 cleartext (production-grade, used by Cloud Run) and 8081 over HTTP/1.1 as a curl-friendly fallback. The image bundles headless Chrome, LibreOffice, and CJK fonts so PDFs, Office documents, and CJK pages are processed in-container. Source: README.md.
Several community-reported issues map directly onto architectural seams in this overview and are worth flagging for self-hosters:
| Concern | Architectural Location | Community Reference |
|---|---|---|
| SSRF via DNS rebinding / per-hop redirect validation | Single-shot assertNormalizedUrl gate before fetch | #1252, #1253 |
| Page navigation timeout (default 30 s) | Headless Chrome pool under generic-pool | #1118 |
| Output size larger than raw HTML | Markdown normalization + frontmatter expansion in tidyMarkdown | #1250 |
| Extraction failures on simple pages | Readability + DOM-to-MD pipeline | #105, #1 |
Regional blocking of jina.ai / r.jina.ai | Public hostname topology; mirror via r.jinaai.cn | #1237 |
Together these describe a layered system where API parsing → fetching → processing → output are independently configurable through headers and presets, and where each layer carries a known set of edge cases that the README's x-* option matrix is designed to address. Source: README.md.
See Also
- Cookbooks — pipeline-specific header recipes (RAG, indexing, research, agent, spider)
- Docker deployment guide —
ghcr.io/jina-ai/reader:ossand port mapping - Rate limits and pricing — quota guidance for
r.jina.ai/s.jina.ai
Source: https://github.com/jina-ai/reader / Human Manual
URL Fetching Engines & Content Extraction
Related topics: System Overview & Architecture, Security, SSRF Protection & Abuse Mitigation
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Overview & Architecture, Security, SSRF Protection & Abuse Mitigation
URL Fetching Engines & Content Extraction
Overview
The jina-ai/reader service converts any HTTP-accessible resource into an LLM-friendly representation. The pipeline is exposed through two public endpoints: r.jina.ai for reading a single URL, and s.jina.ai for web search. Internally, the request is handled by the CrawlerAPI.crawl RPC (src/api/crawler.ts), which validates the target URL, chooses a fetching engine, fetches the resource, normalizes the result, and serializes it according to the caller's requested output format.
Two core stages are involved:
- Fetching — retrieving the remote document with one of the supported engines.
- Content extraction & formatting — turning the raw response (HTML, PDF, image, etc.) into the chosen representation (Markdown, HTML, text, screenshot, or frontmatter).
Fetching Engines
The x-engine request header selects the engine that the service uses to download a URL (README.md):
| Engine | Description | Trade-offs |
|---|---|---|
auto | Default. Combines a lightweight curl path with a headless-browser fallback. | Best balance of cost and quality; preferred for general-purpose crawling. |
curl | Plain HTTP fetch without JavaScript execution. | Cheapest, but cannot render SPAs or JS-driven content. |
browser | Forces headless Chrome (Puppeteer, see package.json dependency on puppeteer ^24.42.0). | Highest fidelity; needed for sites that only serve real content to real browsers. |
The crawler normalizes the user-supplied URL through MiscService.assertNormalizedUrl, which performs DNS resolution and rejects non-public IP ranges to mitigate SSRF (src/api/crawler.ts). However, as noted in community issues #1252 and #1253, this check is applied once before fetching; subsequent HTTP redirects are not re-validated per hop, which is a known limitation in self-hosted deployments.
A per-host circuit breaker (puppeteerControl.circuitBreakerHosts) blocks recursive or self-referential crawling and emits abuse events that can store a domain-wide blockade via storageLayer.storeDomainBlockade (src/api/crawler.ts). When abuse is detected, the offending hostname is added to a temporary block list and excluded from future fetches.
flowchart LR
A[Client request<br/>r.jina.ai/URL] --> B[assertNormalizedUrl<br/>SSRF gate]
B --> C{Engine?}
C -- auto --> D[curl first]
C -- browser --> E[Puppeteer]
C -- curl --> F[undici/curl]
D --> G{Content OK?}
G -- no --> E
G -- yes --> H[Snapshot]
E --> H
F --> H
H --> I[Format by x-respond-with]
I --> J[Markdown / HTML / text / screenshot / frontmatter]Content Extraction Pipeline
After a successful fetch, the response body is dispatched based on MIME type. The crawler applies jsdomControl.analyzeHTMLTextLite to derive the document title and to narrow the snapshot (src/api/crawler.ts). A side-load path exists for binary or non-HTML types — PDFs, Office documents, and images are converted externally (LibreOffice for Office, PDF.js for PDFs, a VLM for image captioning) and then re-introduced as HTML or markdown fragments.
The HTML → Markdown transformation is performed by markify (referenced in the README as readability for filtered markdown). The tidyMarkdown post-processor in src/utils/markdown.ts normalizes broken links that span multiple lines, collapses whitespace inside link labels, and re-attaches image syntax to the surrounding anchor — a frequent source of malformed Markdown from upstream converters. Tailwind utility class names are tracked in src/utils/tailwind-classes.ts so the converter can recognize and strip them from extracted content where appropriate.
Several response-modifier headers control the output shape (README.md):
x-respond-with—markdown,html,text,screenshot,pageshot,frontmatter, ormarkdown+frontmatter.x-retain-media— control how<video>,<audio>, and embedded iframes appear (link,none,text,image,html).x-with-links-summary/x-with-images-summary— append a deduplicated footer of links or images.x-markdown-chunking— opt-in semantic chunking by heading level (h1–h5) or by block structure (s1–s5).x-detach-invisibles— strip elements with eventualdisplay:nonebefore snapshotting (forces browser engine, disables caching).x-respond-with: frontmatter— return Markdown with a YAML frontmatter block (title,description,url).x-with-generated-alt— use a VLM to caption images and emit![Image [idx]: [VLM_caption]](img_URL).
Caching is keyed by an MD5 digest of the normalized URL (src/api/crawler.ts, getUrlDigest). The x-cache-tolerance header defines how stale a cached entry is allowed to be, while x-no-cache: true bypasses the cache entirely — useful when a stale or already-blocked response is being served.
Response Formatting & Search Coupling
The crawl handler dispatches the formatted page to the appropriate serializer based on crawlerOptions.respondWith (src/api/crawler.ts). The pageshot branch issues a 302 redirect to the screenshot URL or returns the raw PNG; frontmatter and markdown+frontmatter emit a YAML-headed Markdown document; the default path returns text/plain.
For search, the SearcherAPI reuses the same CrawlerOptions pipeline. The result entries from a SERP provider are first mapped by mapSearchEntryToPartialFormattedPage (src/api/searcher.ts) and then each URL is fetched through the same engine and extractor machinery. Search query augmentation supports Google explicit operators (intitle:, site:, loc:) via GoogleSearchExplicitOperatorsDto.addTo (src/services/serp/serper.ts), and the normalized entry shape is defined by the WebSearchEntry interface (src/services/serp/compat.ts).
Common Failure Modes & Mitigations
User-reported issues map to specific stages of the pipeline:
- 30 s navigation timeouts (#1118) — Puppeteer's default
gototimeout fires on slow sites. Retry withx-engine: auto(which prefers curl) or use an API key to access the internal proxy. - Empty extraction on simple pages (#105) — usually caused by aggressive readability filtering; switch to
x-respond-with: markdown+frontmatterorhtmlto bypass filtering. - Dynamic / hidden content (#1242) — content loaded by JavaScript is missed by curl; use
x-engine: browserorx-detach-invisibles: true. - Output larger than source HTML (#1250) — frontmatter, link summaries, and image captions add overhead; disable with
x-with-links-summary: falseand use plainmarkdowninstead offrontmatter. - Images rendered as raw URLs (#1251) — by default, images become links; set
x-retain-images: altorimageto embed them as!…syntax. - Geo-blocking (#1237) —
jina.aiis DNS-poisoned in some regions; transition aliasesr.jinaai.cnands.jinaai.cnare provided. - SSRF in self-hosted deployments (#1252, #1253) — the
assertNormalizedUrlgate is single-shot and does not re-validate redirect targets; self-hosters should front the service with an outbound proxy that enforces per-hop checks.
See Also
Source: https://github.com/jina-ai/reader / Human Manual
Security, SSRF Protection & Abuse Mitigation
Related topics: System Overview & Architecture, Search, Proxies, Caching & Self-Hosting Deployment
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Overview & Architecture, Search, Proxies, Caching & Self-Hosting Deployment
Security, SSRF Protection & Abuse Mitigation
Reader is a server-side URL fetcher: a client submits a URL, the service retrieves it, and the response is converted into LLM-friendly Markdown. Because the server is willing to fetch *any* URL the user supplies, the entire service is fundamentally an SSRF surface. The open-source branch in this repository mitigates that risk with a layered set of guards: a one-shot URL/IP validator, a hardened browser engine, abuse detection events, timeouts, and the engine-selection knobs exposed in the request DTOs.
This page documents the security mechanisms visible in the codebase and the failure modes that have surfaced in the community.
High-Level Threat Model
Reader accepts an arbitrary URL and returns a structured representation of the remote response. An attacker who can submit a URL can therefore:
- Probe internal services bound to
127.0.0.1,169.254.169.254(cloud metadata),10.0.0.0/8,192.168.0.0/16,fc00::/7, and other non-public ranges. - Use the public service as an open proxy to anonymize traffic.
- Cause the headless browser to render attacker-controlled JavaScript and exfiltrate data.
- Drive up cost by pointing the service at large or slow resources.
The defenses below are organised so that the cheapest checks run first (URL normalization and IP classification), and the most expensive checks (full browser execution) only run for requests that have already cleared the gate.
URL Normalization and the SSRF Gate
Per the published security advisories (issues #1252 and #1253), the entry-point guard lives in MiscService.assertNormalizedUrl, which performs a one-shot check that resolves the hostname and rejects non-public IP ranges. This is a single-shot gate: it is applied once to the URL the caller submitted, not on every redirect hop.
The practical implications for self-hosted operators are significant:
- An attacker who can control a DNS record (or who can race a TTL flip) can have a hostname resolve to a public IP at validation time and to an internal IP at fetch time — a TOCTOU bypass documented in issue #1253.
- An attacker who can convince the upstream server to issue a
30xredirect to an internal address is *not* re-validated per hop, as documented in issue #1252. The fetch follows HTTP redirects with no per-hop re-validation, so a public origin can hand the browser aLocation:header pointing athttp://169.254.169.254/....
Both classes of bypass are known and apply to the open-source branch at the affected commits. Operators of self-hosted deployments are expected to add their own egress firewall (e.g. an iptables/nftables rule that drops RFC1918 destinations from the container) until upstream adds per-hop re-validation. Source: README.md (Docker self-host section) and community issue links above.
Browser Engine Hardening
When the request forces the browser engine (X-Engine: browser or auto with dynamic content), the fetcher delegates to the Puppeteer-based service. Several defensive behaviours are visible in the source:
- Referrer injection. The browser request is given a
refereroption onpage.goto, which prevents some classes of origin-bound CSRF and helps upstream logs distinguish Reader traffic. Source: src/services/serp/puppeteer.ts. - Multi-event wait. Navigation waits on
load,domcontentloaded, andnetworkidle0together so the rendered DOM is stable before extraction. Source: src/services/serp/puppeteer.ts. - Hard timeout. A
timeoutMs(default30_000) caps the navigation. OnTimeoutErrorthe request is converted to aAssertionFailureErrorand surfaced as HTTP 422 with a readable message — this is the same error shape users see in issue #1118 (Failed to goto ... : TimeoutError: Navigation timeout of 30000 ms exceeded). Source: src/services/serp/puppeteer.ts. - Listener cleanup. A
crippleListeneris registered and removed on the close path so that an aborted navigation cannot leak the_REPORT_FUNCTION_NAMEhandler into the next request. Source: src/services/serp/puppeteer.ts.
flowchart LR
A[Client submits URL] --> B[assertNormalizedUrl]
B -->|non-public IP| X[Reject 422]
B -->|public IP| C{Engine?}
C -->|curl| D[libcurl-impersonate fetch]
C -->|browser| E[Headless Chrome + Puppeteer]
D --> F[Readability / Markdown]
E --> F
E -->|abuse event| Y[SecurityCompromiseError]
E -->|timeout| Z[AssertionFailureError 422]
F --> G[Response]Abuse Detection
The browser path emits an abuse page event that the service translates into a SecurityCompromiseError. The handler rejects the request promise and surfaces the reason to the caller. Source: src/services/serp/puppeteer.ts.
Downstream, the crawler API decides how to envelope the response. Hard errors (assertion failures, security compromises) are wrapped with envelope: null and a non-2xx status, while successful markdown, HTML, screenshot, and frontmatter responses carry Content-Type headers explicitly set in src/api/crawler.ts. The combination means a misbehaving origin cannot force the API to emit a different content type than the one declared.
Output Sanitisation
Once a page is rendered, the Markdown pipeline runs through tidyMarkdown, which normalises broken cross-line links, collapses stray whitespace, and removes script-like content from link text. Source: src/utils/markdown.ts. This step is the last line of defence against prompt-injection payloads that hide in alt text or link titles, and it is also the cause of the user-reported behaviour in issue #1251 where images appear as raw idx placeholders rather than inline media.
Common Failure Modes
| Symptom | Likely cause | Reference |
|---|---|---|
Failed to goto ... : TimeoutError: Navigation timeout of 30000 ms exceeded | Slow upstream or blocking script; raise timeoutMs or switch to curl engine | issue #1118 |
| Reader returns a plain text page with no images | Image extraction disabled or tidyMarkdown stripped the media tags | issue #1251 |
| Service blocked at the network layer | DNS poisoning in the operator's region; switch to regional mirror | issue #1237 |
| SSRF succeeds in self-hosted deployment | Missing egress firewall; add per-hop validation in front of the container | issues #1252, #1253 |
SecurityCompromiseError returned | Browser engine emitted an abuse event from the target page | puppeteer.ts |
Operator Recommendations
For self-hosted deployments, the published minimum hardening is: (1) put the container behind an egress proxy that drops RFC1918, link-local, and cloud-metadata destinations; (2) front the API with an authenticating reverse proxy so anonymous abuse cannot use the public quota; and (3) pin the image to a known commit and monitor the security advisory feed. The single-shot IP gate in assertNormalizedUrl is necessary but, as the community evidence shows, not sufficient.
See Also
- README.md — usage, options, and Docker self-host instructions
- cookbooks.md — preset and header combinations
- src/api/crawler.ts — request envelope and content-type decisions
- src/services/serp/puppeteer.ts — browser engine and abuse handler
- src/utils/markdown.ts — output sanitisation
Source: https://github.com/jina-ai/reader / Human Manual
Search, Proxies, Caching & Self-Hosting Deployment
Related topics: System Overview & Architecture, Security, SSRF Protection & Abuse Mitigation
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Overview & Architecture, Security, SSRF Protection & Abuse Mitigation
Search, Proxies, Caching & Self-Hosting Deployment
This page covers the four operational pillars of the open-source jina-ai/reader codebase behind r.jina.ai and s.jina.ai: the SERP-driven search layer, the multi-tier proxy machinery that makes hostile sites fetchable, the caching model that keeps repeated reads cheap, and the self-hosting path for users who want to run Reader on their own infrastructure.
1. Search (`s.jina.ai`)
Search is implemented as a thin orchestration layer that reuses the same fetch + extraction stack used by r.jina.ai. When a user hits s.jina.ai/<query>, Reader calls a SERP provider, takes the top results, and then runs each result through the crawler — returning the full extracted content rather than just a title/snippet.
Search providers and result shape
The SERP subsystem is pluggable: serper (Serper.dev Google SERP), google (direct Google), bing, and puppeteer (browser-driven scraping) all return a normalized WebSearchEntry shape defined in src/services/serp/compat.ts:1-14. The shape covers web/image/news variants and includes inline siteLinks:
export interface WebSearchEntry {
link: string;
title: string;
source?: string;
date?: string;
snippet?: string;
imageUrl?: string;
siteLinks?: { link: string; title: string; snippet?: string; }[];
variant?: 'web' | 'images' | 'news';
}
Search operators
Advanced query syntax (e.g. site:, intitle:, loc:) is built up programmatically. Source: src/services/serp/serper.ts — the addTo() method composes a Google-style query string from the typed GoogleSearchExplicitOperatorsDto and joins chunks with AND / OR as appropriate.
In-site search
For restricting results to specific domains, the docs in README.md recommend the site query parameter (repeatable):
curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com'
2. Proxies
Reader distinguishes two proxy modes — a user-supplied explicit proxy and an internal pool of "allocated" proxies that the service can fall back on when a direct fetch returns a thin page or a non-200 status.
Explicit proxy
The x-proxy-url request header (documented in README.md) routes the entire traffic through the caller's own proxy. Anonymous traffic is rate-limited aggressively; an API key unlocks the internal proxy pool.
Allocated proxy fallback
The crawler code in src/api/crawler.ts decides per-request whether to retry with an allocated proxy. The relevant guard (paraphrased) is: if no explicit proxy is set, no none allocation has been requested, the page is short (under ~42 tokens by the lightweight analyzer), or the first side-load returned a non-2xx, the crawler calls sideLoadWithAllocatedProxy(). Source: src/api/crawler.ts — search for the if ((!crawlOpts?.allocProxy || crawlOpts.allocProxy !== 'none') && ...) block.
Abuse-driven domain blockades
When a request looks abusive (e.g. bot-challenge farms, repeated CAPTCHAs), the puppeteer control emits an abuse event. The crawler subscribes, logs it, and writes a DomainBlockade to the storage layer with a TTL. Source: src/api/crawler.ts — puppeteerControl.on('abuse', ...) handler storing { domain, triggerReason, triggerUrl, expireAt }.
Curl-side impersonation
For non-browser fetches, the curl engine is told to impersonate whatever User-Agent the puppeteer layer negotiated, so anti-bot heuristics see a consistent client. Source: src/api/crawler.ts — init() calls this.curlControl.impersonateChrome(this.puppeteerControl.effectiveUA).
3. Caching
Caching is opt-in by TTL and opt-out by header, and is also implicitly disabled whenever the response would be unsafe to reuse (cookies, dynamic content, large screenshots).
| Header / knob | Effect | Source |
|---|---|---|
x-cache-tolerance | Integer seconds; how stale a cached response may be. | README.md |
x-no-cache: true | Force a fresh fetch — use when a stale or already-blocked response is cached. | README.md |
x-set-cookie | Forward cookies; caches are bypassed for these requests. | README.md |
x-detach-invisibles | Implies browser engine and disables caching. | README.md |
setToCache(...) | Internal eligibility check via options.eligibleForPageIndex. | src/api/crawler.ts |
The open-source branch is stateless out of the box; persistent caching requires the bundled docker compose with a MinIO/S3-compatible bucket — see the *Local development* note in README.md.
4. Self-Hosting Deployment
A prebuilt image is published to the GitHub Container Registry with headless Chrome, LibreOffice, and CJK fonts bundled. Source: README.md.
docker pull ghcr.io/jina-ai/reader:oss
Port layout
The image exposes two ports, both handled by the same Koa app:
8080— h2c (HTTP/2 cleartext). Production-grade and multiplexed; matches what Cloud Run talks to. Plaincurlneeds--http2-prior-knowledgeto use it.8081— HTTP/1.1 fallback. Use this from browsers,curl, and most clients.
For a quick local try-out, only the HTTP/1.1 port needs to be mapped. Source: README.md — *Self-host with Docker* section.
External assets
Geo IP, IP-to-ASN, CJK font (SourceHanSansSC-Regular.otf), and the user-agent list used by the curl engine are downloaded by a single script:
npm run assets:download
The script is idempotent and skippable via SKIP_DOWNLOAD_EXTERNAL=1; CI also fetches these URLs inline. Source: README.md.
Request flow
flowchart LR
Client["curl / browser"] -->|HTTP| Reader[Reader on :8080 h2c or :8081 HTTP1.1]
Reader --> SERP[SERP layer]
SERP -->|top-N results| Crawler[Crawler]
Crawler -->|direct or x-proxy-url| Target[Target website]
Crawler -->|fallback: allocated proxy| Target
Crawler --> Cache[(MinIO/S3 bucket)]
Crawler --> ClientCommon failure modes and mitigations
Several issues reported in the community map directly onto the four pillars above:
- DNS / geo blocks (issue #1237) — the public
jina.aidomain was DNS-poisoned for some regions; Jina recommends ther.jinaai.cn/s.jinaai.cntransition domains or an API key with a dedicated egress address. - Stale or blocked response — bypass cache with
-H 'x-no-cache: true', then escalate tox-engine: browser, then tox-proxy-url. Source: README.md — *Having trouble on some websites?* section. - Timeouts on slow sites (issue #1118) — Reader surfaces a
TimeoutError: Navigation timeout of 30000 ms exceeded. Retry withx-engine: browserand a longer tolerance, or pre-warm the cache once and reuse. - SSRF hardening in self-hosted deployments (issue #1252, #1253) — self-hosters must not expose Reader to untrusted input without re-applying the URL gate per redirect hop. See the
MiscService.assertNormalizedUrlflow described in those reports.
See Also
- README.md — full header reference and cookbooks index.
- src/dto/crawler-options.ts — DTO-level defaults and validation for every
x-*header. - src/api/crawler.ts — abuse event handling, cache eligibility, allocated-proxy fallback.
- src/services/serp/compat.ts —
WebSearchEntryshape. - src/services/serp/serper.ts — search-operator DSL.
Source: https://github.com/jina-ai/reader / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
Developers may expose sensitive permissions or credentials: Server-Side Request Forgery via domain resolution bypass in self-hosted deployments
Developers may expose sensitive permissions or credentials: Unauthenticated SSRF via unvalidated HTTP redirects (single-shot SSRF gate not re-applied per redirect hop)
May increase setup, validation, or first-run risk for the user.
Developers may fail before the first successful local run: npm run build failed because shared files are not found
Doramagic Pitfall Log
Found 28 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Security or permission risk - Security or permission risk requires verification.
1. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Developers should check this security_permissions risk before relying on the project: Server-Side Request Forgery via domain resolution bypass in self-hosted deployments
- User impact: Developers may expose sensitive permissions or credentials: Server-Side Request Forgery via domain resolution bypass in self-hosted deployments
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Server-Side Request Forgery via domain resolution bypass in self-hosted deployments. Context: Observed when using docker
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1253
2. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Developers should check this security_permissions risk before relying on the project: Unauthenticated SSRF via unvalidated HTTP redirects (single-shot SSRF gate not re-applied per redirect hop)
- User impact: Developers may expose sensitive permissions or credentials: Unauthenticated SSRF via unvalidated HTTP redirects (single-shot SSRF gate not re-applied per redirect hop)
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Unauthenticated SSRF via unvalidated HTTP redirects (single-shot SSRF gate not re-applied per redirect hop). Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1252
3. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: identity.distribution | https://github.com/jina-ai/reader
4. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Developers should check this installation risk before relying on the project: npm run build failed because shared files are not found
- User impact: Developers may fail before the first successful local run: npm run build failed because shared files are not found
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: npm run build failed because shared files are not found. Context: Observed when using node
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/3
5. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/3
6. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/jina-ai/reader/issues/2
7. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: Improve content extraction logic to handle dynamic and hidden elements
- User impact: Developers may misconfigure credentials, environment, or host setup: Improve content extraction logic to handle dynamic and hidden elements
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Improve content extraction logic to handle dynamic and hidden elements. Context: Observed when using playwright
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1242
8. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: Respect robots.txt and identify your system
- User impact: Developers may misconfigure credentials, environment, or host setup: Respect robots.txt and identify your system
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Respect robots.txt and identify your system. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/4
9. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: support docker deployment
- User impact: Developers may misconfigure credentials, environment, or host setup: support docker deployment
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: support docker deployment. Context: Observed when using docker
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/2
10. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/jina-ai/reader
11. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Developers should check this runtime risk before relying on the project: Failed to go to
- User impact: Developers may hit a documented source-backed failure mode: Failed to go to
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Failed to go to. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/1118
12. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Developers should check this runtime risk before relying on the project: Reader doesn't extract any content from this page even though its quite simple?
- User impact: Developers may hit a documented source-backed failure mode: Reader doesn't extract any content from this page even though its quite simple?
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Reader doesn't extract any content from this page even though its quite simple?. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_issue | https://github.com/jina-ai/reader/issues/105
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using reader with real data or production workflows.
- В странном виде сайты открываются. - github / github_issue
- Server-Side Request Forgery via domain resolution bypass in self-hosted - github / github_issue
- Unauthenticated SSRF via unvalidated HTTP redirects (single-shot SSRF ga - github / github_issue
- Bug/Optimization: Reader Output size is larger than Raw HTML size - github / github_issue
- Community source 5 - github / github_issue
- Failed to go to - github / github_issue
- Reader doesn't extract any content from this page even though its quite - github / github_issue
- Improve content extraction logic to handle dynamic and hidden elements - github / github_issue
- Extraction didn't work - github / github_issue
- Installation risk requires verification - GitHub / issue
- Installation risk requires verification - GitHub / issue
- Installation risk requires verification - GitHub / issue
Source: Project Pack community evidence and pitfall evidence