Doramagic Project Pack · Human Manual
webclaw
Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.
Project Overview and Workspace Architecture
Related topics: Extraction Engine, CLI Usage, and Known CLI Bugs, MCP Server, REST API, and LLM Provider Integration
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Extraction Engine, CLI Usage, and Known CLI Bugs, MCP Server, REST API, and LLM Provider Integration
Project Overview and Workspace Architecture
Overview
Webclaw is a web-to-context conversion toolkit that turns websites into clean Markdown, JSON, and LLM-ready text. It exposes the same underlying extraction pipeline through four distribution surfaces: a native CLI (webclaw), a Model Context Protocol server (webclaw-mcp), a self-hosted REST API (webclaw-server), and hosted API plus official SDKs in TypeScript, Python, and Go. Source: README.md.
The project is structured as a Rust Cargo workspace with multiple focused crates that share core extraction, fetch, and LLM primitives. This separation lets the same pipeline serve one-off CLI runs, agent integration, and high-throughput crawling without duplicating logic. Source: crates/webclaw-cli/src/main.rs.
Workspace Layout
The repository is a Cargo workspace under crates/. Each crate owns one concern and is consumed by the binaries that need it.
graph TD
A[webclaw-cli] --> B[webclaw-core]
A --> C[webclaw-fetch]
A --> D[webclaw-llm]
E[webclaw-mcp] --> B
E --> C
F[webclaw-server] --> B
F --> C
F --> D
D --> G[Provider Chain<br/>Ollama → OpenAI → Gemini → Anthropic]
B --> H[Extractors<br/>Vertical Catalogs]
C --> I[wreq Chrome/Firefox<br/>TLS Fingerprints]- webclaw-cli — clap-based CLI binary. Owns subcommands such as
vertical,search,crawl,diff, andbench, plus per-format printers (print_crawl_output,print_batch_output,print_map_output). Source: crates/webclaw-cli/src/main.rs. - webclaw-cli::bench — micro-benchmark harness that measures token reduction from raw HTML to the LLM pipeline output using an approximate
chars/4(Latin) andchars/2(CJK) tokenizer. Source: crates/webclaw-cli/src/bench.rs. - webclaw-mcp — Model Context Protocol server. Exposes tools (
scrape,crawl,map,batch,extract,summarize,vertical_scrape) to MCP clients such as Claude Code, Cursor, and Codex CLI. Source: crates/webclaw-mcp/src/server.rs and crates/webclaw-mcp/src/tools.rs. - webclaw-server — Axum-based REST API. Routes include
POST /v1/extractfor LLM-powered structured extraction,POST /v1/difffor snapshot comparison, andPOST /v1/crawlfor multi-page crawling. Source: crates/webclaw-server/src/routes/extract.rs, crates/webclaw-server/src/routes/diff.rs, and crates/webclaw-server/src/routes/crawl.rs. - webclaw-llm — Local-first LLM integration. The
ProviderChaintries Ollama first, then falls back to OpenAI, Gemini, and Anthropic. Provides schema extraction, prompt extraction, and summarization. Source: crates/webclaw-llm/src/lib.rs and crates/webclaw-llm/src/summarize.rs.
Crate Responsibilities and Shared Primitives
Two cross-cutting crates sit beneath the binaries:
- webclaw-core — Provides the
extract()pipeline,to_llm_text()token-optimized formatter, and a registry of vertical extractors (e.g.,reddit,github_repo,trustpilot_reviews,youtube_video,shopify_product,pypi,npm,arxiv). Vertical extractors return typed JSON specific to the target site rather than generic Markdown. Source: crates/webclaw-mcp/src/server.rs. - webclaw-fetch — Owns the HTTP client and
BrowserProfile(Chrome vs Firefox). The MCPvertical_scrapetool deliberately uses the cached Firefox client because Reddit's.jsonendpoint rejects thewreq-ChromeTLS fingerprint with a 403 even from residential IPs; Firefox's fingerprint still passes and is fine for every other vertical. Source: crates/webclaw-mcp/src/server.rs.
The webclaw-llm crate follows a hybrid architecture: it always tries Ollama (local) first and only escalates to hosted providers if the local model cannot answer, keeping self-hosters on local inference by default. Source: crates/webclaw-llm/src/lib.rs.
Distribution Surfaces and CLI Workflows
The CLI is the primary entry point and demonstrates the breadth of capabilities exposed through the shared pipeline:
- Basic extraction —
webclaw https://example.com -f markdown|json|text|llmwrites clean content in the requested format. Bare domains are auto-prefixed withhttps://. Source: examples/README.md. - Content filtering —
--only-main-contentskips nav/sidebar/footer, while--includeand--excludeaccept CSS selectors. Note: community issue #3 reports that--only-main-contentis not currently applied when passed alongside--urls-filein batch mode. - Batch and crawl —
webclaw --urls-file urls.txtextracts a list;webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50walks a site. The MCP equivalent takesBatchParamsandCrawlParamswith explicitconcurrencyand optionaluse_sitemap. Source: crates/webclaw-mcp/src/tools.rs. - RAG preparation — The
--format llmflag produces token-optimized text that is ~67% smaller than the markdown equivalent, intended for chunking, embedding, and prompt context. Source: examples/html-to-markdown-rag/README.md. - Diff and brand —
--diff-with pricing-old.jsoncompares snapshots;--brandextracts colors, fonts, logos, and metadata. Source: README.md.
The MCP surface is wired through npx create-webclaw, which generates client configuration blocks for Claude Code, Claude Desktop, Cursor, and Codex CLI. Source: README.md. The REST surface mirrors the CLI: POST /v1/scrape, POST /v1/crawl, POST /v1/extract, POST /v1/diff, etc., all returning JSON envelopes suitable for programmatic use. Source: crates/webclaw-server/src/routes/extract.rs and crates/webclaw-server/src/routes/crawl.rs.
Operational Notes
Prebuilt Linux release binaries (webclaw, webclaw-mcp, webclaw-server) require glibc 2.38+, which prevents them from starting on Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9. Community issue #73 documents the failure mode. Self-hosters on those distributions should build from source or run via Docker until a glibc-compatible release is published. The hosted API at api.webclaw.io and the official SDKs (@webclaw/sdk for npm, webclaw for PyPI, github.com/0xMassi/webclaw-go) remain unaffected because they target a fixed runtime. Source: README.md.
See Also
- CLI Usage and Examples
- MCP Server Integration
- REST API Reference
- LLM Provider Chain Configuration
- Proxy-Backed Crawling Guide
Source: https://github.com/0xMassi/webclaw / Human Manual
Extraction Engine, CLI Usage, and Known CLI Bugs
Related topics: Project Overview and Workspace Architecture, Deployment, Configuration, and Known Failure Modes
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Project Overview and Workspace Architecture, Deployment, Configuration, and Known Failure Modes
Extraction Engine, CLI Usage, and Known CLI Bugs
Overview
webclaw is a Rust workspace that turns web pages into clean markdown, JSON, and LLM-ready text. The webclaw binary is the primary entry point for one-off extraction, batch URL lists, crawling, diffing, search, vertical scraping, and benchmarking. The extraction pipeline itself lives in shared crates and is reused by the CLI, the HTTP server (webclaw-server), and the MCP server (webclaw-mcp).
The CLI is implemented with clap subcommands inside crates/webclaw-cli/src/main.rs, and it dispatches to specialized handlers. The core content pipeline (HTML → markdown, metadata, links, structured data) is exposed through webclaw-core and consumed uniformly, so the CLI, the /v1/extract route, and MCP scrape return the same ExtractionResult shape. LLM-powered steps (schema extraction, prompt extraction, summarization) are added by crates/webclaw-llm/src/lib.rs and use a provider chain that prefers local Ollama, then OpenAI, Gemini, and Anthropic.
CLI Command Surface
The webclaw binary exposes a small set of subcommands. The most relevant for extraction are summarized below.
| Subcommand | Purpose | Key options |
|---|---|---|
webclaw <url> (default) | Single-URL extraction | --format, --only-main-content, --include, --exclude, --proxy |
webclaw --urls-file | Batch extraction from a file | --format, (per-row optional filename) |
webclaw crawl | Recursive crawl from a seed | --max-depth, --max-pages |
webclaw diff | Compare two snapshots of a URL | previous content as JSON or markdown |
webclaw summarize | LLM summary of a page | provider config from env |
webclaw brand | Colors, fonts, logos, metadata | runs over an existing fetch |
webclaw search | Serper.dev search, optional --scrape | --serper-key, --num, --country |
webclaw vertical <name> <url> | Site-specific extractor (Reddit, GitHub, PyPI, etc.) | --raw for compact JSON |
webclaw bench <url> | Token-savings micro-benchmark | --facts <path>, --json |
The vertical subcommand, defined in crates/webclaw-cli/src/main.rs, runs site-specific extractors that return typed JSON (title, price, author, rating) rather than generic markdown. Catalog names include reddit, github_repo, trustpilot_reviews, youtube_video, shopify_product, pypi, npm, and arxiv. The MCP counterpart vertical_scrape intentionally uses the Firefox browser fingerprint because Reddit's .json endpoint rejects the default Chrome fingerprint with 403s, even from residential IPs (Source: crates/webclaw-mcp/src/server.rs).
Output Formats and Batch Mode
Output is selected with --format (markdown is the default). The formatter in crates/webclaw-cli/src/main.rs maps each OutputFormat variant to a renderer:
Markdown— frontmatter when--metadatais set, body markdown, then a "Structured Data" JSON block if present.Json— pretty-printedExtractionResult.Text—content.plain_textonly.Llm—to_llm_text(...)produces a token-optimized string (≈67% smaller than the source HTML, per the README banner in packages/create-webclaw/README.md).Html—content.raw_htmlwhen available, falling back to markdown.
For batch input, the collect_urls helper accepts both positional URLs and a --urls-file. Lines with a comma become (url, custom_filename) pairs; bare lines auto-generate filenames from the URL (Source: crates/webclaw-cli/src/main.rs). The per-URL BatchExtractResult is rendered by print_batch_output, which iterates and prints each result joined by ---. JSON output emits a single array of {url, result | error} objects so partial failures are visible.
The HTTP server exposes the same pipeline at POST /v1/scrape, POST /v1/extract, POST /v1/crawl, and POST /v1/diff. The extract route requires either a JSON schema or a natural-language prompt and builds its provider chain from environment variables (Source: crates/webclaw-server/src/routes/extract.rs). LLM extraction uses json_mode: true for schema flows and temperature: 0.0 to keep results deterministic (Source: crates/webclaw-llm/src/extract.rs). Summarization uses temperature: 0.3 and strips any thinking tags defensively before returning plain text (Source: crates/webclaw-llm/src/summarize.rs).
Known CLI Bugs and Workarounds
A small number of issues are visible from the public issue tracker and the codebase. They affect the CLI directly, so they are documented here for quick lookup.
--only-main-contentis ignored in batch mode. Issue #3 reports thatwebclaw --only-main-content --urls-file URLs.txtdoes not strip navigation, sidebar, and footer noise on individual rows. The flag is read on the single-URL path but is not propagated to the per-URL batch invocation. Workaround: run the batch with a small wrapper loop, e.g.while read u; do webclaw --only-main-content "$u"; done < URLs.txt, or call the hostedPOST /v1/scrapeendpoint with"only_main_content": true.- Linux release binaries require glibc 2.38+. Issue #73 reports that the prebuilt
webclaw,webclaw-mcp, andwebclaw-serverbinaries fail to start on Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9 because their glibc is older than 2.38. Workaround: build from source withcargo install --git https://github.com/0xMassi/webclaw webclaw-cli, or upgrade the host glibc. - Update warning when refreshing
webclaw-mcp. Issue #30 reports a warning shown when updating the MCP binary. Workaround: rerunnpx create-webclawto reinstall the binary and re-emit tool config; thecreate-webclawpackage detects installed AI tools and rewrites MCP configuration (Source: packages/create-webclaw/README.md).
Release v0.6.14 (visible in the changelog shared in community context) added docs for search, map, and perf, fixed the deploy step to write WEBCLAW_API_KEY into the generated .env (#68), and repaired the create-webclaw binary download path.
See Also
- Firecrawl-Compatible API example — covers the
/v2scrape, crawl, map, and search routes. - MCP Web Scraping example — connecting the CLI binary into Claude Code, Claude Desktop, Cursor, and Codex.
- Cloudflare Diagnostics example — checklist for blocked or empty results on protected sites.
Source: https://github.com/0xMassi/webclaw / Human Manual
MCP Server, REST API, and LLM Provider Integration
Related topics: Project Overview and Workspace Architecture, Deployment, Configuration, and Known Failure Modes
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Project Overview and Workspace Architecture, Deployment, Configuration, and Known Failure Modes
MCP Server, REST API, and LLM Provider Integration
Webclaw exposes its extraction pipeline through three coordinated surfaces: a local MCP server (webclaw-mcp) for AI agents, a self-hosted REST API (webclaw-server) for programmatic access, and a shared LLM provider chain (webclaw-llm) that powers schema extraction, prompt extraction, and summarization across both. Together they let the same core capabilities — scrape, crawl, extract, summarize — be driven from an IDE assistant, a curl script, or a hosted integration.
High-Level Architecture
flowchart LR
Agent["AI Agent<br/>(Claude Code, Cursor, Codex)"] -->|stdio MCP| MCP[webclaw-mcp]
Curl["cURL / SDK<br/>(TypeScript, Python, Go)"] -->|HTTPS JSON| API[webclaw-server]
MCP --> Core[webclaw-core<br/>fetch + extract]
API --> Core
Core --> LLM[webclaw-llm<br/>ProviderChain]
LLM --> Ollama[(Ollama<br/>local)]
LLM --> OpenAI[(OpenAI)]
LLM --> Gemini[(Gemini)]
LLM --> Anthropic[(Anthropic)]The MCP binary and the REST binary are thin wrappers: both delegate to the same webclaw-core pipeline, and both can fall back to the same ProviderChain when an LLM is needed. The hosted platform at api.webclaw.io is a closed-source extension of webclaw-server that adds anti-bot bypass, JS rendering, and async crawl jobs (Source: crates/webclaw-server/src/main.rs:1-20).
MCP Server (`webclaw-mcp`)
Transport and lifecycle
The MCP server is a stateless stdio service. Its entry point initializes logging to stderr so that stdout remains a clean MCP transport channel, then serves until the client disconnects (Source: crates/webclaw-mcp/src/main.rs:1-22):
let service = WebclawMcp::new().await.serve(stdio()).await?;
service.waiting().await?;
WEBCLAW_API_KEY is optional at the MCP layer — local extraction works without it, but the key enables cloud fallback for protected sites, JS rendering, hosted search, and hosted research.
Tool surface and parameter coercion
Each MCP tool is declared with #[tool] on the WebclawMcp service. Parameters are typed structs that derive JsonSchema (for automatic schema generation) and Deserialize (for parsing tool-call arguments). Because some MCP clients serialize numbers and booleans as JSON strings, tools.rs ships coercion helpers — deser_opt_u32_or_str and deser_opt_bool_or_str — that accept either form transparently (Source: crates/webclaw-mcp/src/tools.rs:1-60).
The exposed tool family mirrors the CLI subcommands: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, and vertical_scrape. The vertical tool, for example, runs a typed extractor by name (e.g. reddit, github_repo, trustpilot_reviews) and returns site-specific JSON rather than generic markdown (Source: crates/webclaw-mcp/src/server.rs:1-30).
TLS-profile selection
A subtle but important detail lives in vertical_scrape: the tool intentionally uses a cached Firefox TLS profile rather than the Chrome profile used by the generic scrape tool. Reddit's .json endpoint rejects the wreq-Chrome fingerprint with a 403 even from residential IPs, while the Firefox fingerprint still passes — Firefox is safe for every other vertical, so it is a strictly safer default for site-specific extractors (Source: crates/webclaw-mcp/src/server.rs:15-35).
Installer
The create-webclaw npm package (v0.1.5) detects supported MCP clients — Claude, Cursor, Windsurf, OpenCode, Codex, Antigravity — and writes the appropriate config (Source: packages/create-webclaw/package.json:1-25).
REST API (`webclaw-server`)
Scope and posture
webclaw-server is deliberately small: a single binary, stateless, with no database and no job queue. The long_about string on the CLI makes the contract explicit — hosted features (anti-bot bypass, JS rendering, async crawl jobs, multi-tenant auth, billing) are *not* implemented here and never will be; they live in the closed-source platform (Source: crates/webclaw-server/src/main.rs:1-15).
The server is built on axum, uses tower-http for CORS and tracing, and wraps the same extraction crates the CLI and MCP server use. JSON shapes mirror the hosted API at api.webclaw.io so the same SDKs work against either deployment.
Structured extraction route
POST /v1/extract accepts a URL plus either a JSON schema or a natural-language prompt (at least one is required) and an optional model override. It validates the input, builds a per-request provider chain from environment variables, and dispatches to either extract_json or extract_with_prompt in webclaw-llm (Source: crates/webclaw-server/src/routes/extract.rs:1-55):
if !has_schema && !has_prompt {
return Err(ApiError::bad_request("either `schema` or `prompt` is required"));
}
The route delegates provider selection entirely to the shared chain, which means self-hosters get the same Ollama → OpenAI → Gemini → Anthropic fallback as the CLI without any per-route configuration.
LLM Provider Integration (`webclaw-llm`)
Provider chain
webclaw-llm is described in its crate doc as a "local-first hybrid architecture" where the provider chain tries Ollama first, then OpenAI, then Gemini, then Anthropic (Source: crates/webclaw-llm/src/lib.rs:1-12). The crate exposes a single abstraction — LlmProvider with an async fn complete — so adding a new provider means implementing one trait, not rewriting call sites.
Schema-based extraction
extract_json builds a system prompt that embeds the user's JSON schema verbatim, sends the page content as the user turn, sets temperature: 0.0 and json_mode: true, and parses the response back into a serde_json::Value. The instruction to the model is explicit: "Return ONLY valid JSON matching the schema. No explanations, no markdown, no commentary." (Source: crates/webclaw-llm/src/extract.rs:1-35).
Prompt-based extraction
extract_with_prompt is the flexible sibling: it accepts a natural-language description of what to extract instead of a schema. The same strip_thinking_tags utility is applied to the response, so models that emit <think>… blocks have those removed before parsing (Source: crates/webclaw-llm/src/lib.rs:1-12).
Summarization
summarize is intentionally minimal — one function, one prompt. It defaults to three sentences, uses temperature: 0.3, and applies strip_thinking_tags as defense-in-depth even though providers already strip them, because summary text is passed directly to end users (Source: crates/webclaw-llm/src/summarize.rs:1-45).
Failure Modes and Known Issues
- Linux glibc requirement. Prebuilt binaries (
webclaw,webclaw-mcp,webclaw-server) require glibc 2.38+, so they fail to start on Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9. Self-hosters on those distros should build from source or use an older release. - MCP parameter coercion. Some MCP clients serialize numbers and booleans as JSON strings. The coercion helpers in
tools.rscover both representations, so callers should never see"invalid type: string"errors. - Update warnings. The
webclaw-mcpupdater may emit a warning when newer binaries are released; reinstalling vianpx create-webclawresolves the configuration mismatch. - Cloud-only features. Anti-bot bypass, JS rendering, async crawl jobs, multi-tenant auth, and billing are not part of
webclaw-server. Self-hosters who need them must point theirWEBCLAW_API_KEYatapi.webclaw.ioso requests can escalate (Source: crates/webclaw-server/src/main.rs:5-15).
See Also
- CLI subcommands and
webclaw benchmicro-benchmark - Firecrawl-Compatible
/v2API routes - Proxy-backed crawling with ColdProxy
- Cloudflare diagnostics for blocked or empty protected-site results
Source: https://github.com/0xMassi/webclaw / Human Manual
Deployment, Configuration, and Known Failure Modes
Related topics: Project Overview and Workspace Architecture, Extraction Engine, CLI Usage, and Known CLI Bugs
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Project Overview and Workspace Architecture, Extraction Engine, CLI Usage, and Known CLI Bugs
Deployment, Configuration, and Known Failure Modes
1. Scope and Purpose
webclaw is a multi-surface extraction tool that ships as a CLI binary, an MCP server, a REST API, and SDKs for TypeScript, Python, and Go. Each surface shares the same webclaw-core extraction pipeline and the same webclaw-fetch client, which means deployment, configuration, and failure handling overlap heavily across the surfaces. This page documents how the components are deployed, how they are configured at runtime, and the failure modes that have been reported by users in the project's issue tracker.
The repository's high-level positioning is summarized in the README as: "Turn websites into clean markdown, JSON, and LLM-ready context" via "CLI, MCP server, REST API, and SDKs for AI agents and RAG pipelines." Source: README.md
2. Deployment Surfaces
2.1 CLI
The CLI is the simplest deployment form. A single binary is invoked against one or many URLs:
webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
Vertical extractors and web search are exposed as subcommands (webclaw vertical reddit <url>, webclaw search "rust async runtime" --scrape) defined in the clap subcommand enum. Source: crates/webclaw-cli/src/main.rs
A per-URL micro-benchmark is provided via webclaw bench <url>, which measures token reduction between raw HTML and the LLM-format output. Source: crates/webclaw-cli/src/bench.rs
2.2 MCP Server
The MCP server is scaffolded for AI agents through npx create-webclaw, which writes the client configuration block. The server registers tools for scrape, crawl, batch_scrape, map, extract, summarize, search, research, and vertical_scrape. Source: crates/webclaw-mcp/src/tools.rs
vertical_scrape deliberately uses the cached Firefox TLS fingerprint because Reddit's .json endpoint rejects the Chrome fingerprint with a 403, even from residential IPs. Source: crates/webclaw-mcp/src/server.rs
2.3 Hosted REST API and Self-Hosted Server
Two REST surfaces exist:
- A hosted
https://api.webclaw.io/v1/*service used by SDKs and the hosted MCP routes. - A self-hosted
webclaw-serverbinary that mirrors the hosted API. Its route handlers validateurlis non-empty and, forextract, that at least one ofschemaorpromptis supplied. Source: crates/webclaw-server/src/routes/extract.rs
The crawl route returns a JSON document with status, total, completed, errors, elapsed_secs, and a pages array that carries per-page markdown, metadata, and error fields. Source: crates/webclaw-server/src/routes/crawl.rs
The diff route requires a non-empty url and accepts a previous extraction in either Extraction or Minimal { markdown, metadata } form; missing markdown/metadata are filled with empty defaults. Source: crates/webclaw-server/src/routes/diff.rs
3. Configuration
3.1 Provider Chain
The LLM subsystem is local-first and provider-chained. The chain tries Ollama (local), then OpenAI, then Gemini, then Anthropic, exposing this as a single ProviderChain API. Source: crates/webclaw-llm/src/lib.rs
POST /v1/extract builds the chain per request from environment variables, and a model field can override the model name on the chosen provider. Source: crates/webclaw-server/src/routes/extract.rs
3.2 Environment Variables
The codebase documents two important environment variables in user-facing examples:
WEBCLAW_API_KEY— Bearer token for the hosted API. Thefix(deploy): write WEBCLAW_API_KEY in generated .envchange in v0.6.14 ensures thecreate-webclawscaffolder writes this key into the generated.envfile. Source: README.mdSERPER_API_KEY— Required forwebclaw search. The CLI subcommand also accepts--serper-keyand falls back to this env var. Source: crates/webclaw-cli/src/main.rs
3.3 CLI Flags
Common flags include --format (markdown, json, text, llm, html), --only-main-content, --include "<css selectors>", --exclude "<css selectors>", --crawl --depth N --max-pages M, and --diff-with <file>. The examples index documents each of these with copy-pasteable invocations. Source: examples/README.md
flowchart LR
A[Caller] --> B{Which surface?}
B -- "shell" --> C[webclaw CLI binary]
B -- "agent" --> D[webclaw-mcp / create-webclaw]
B -- "HTTP" --> E[webclaw-server / api.webclaw.io]
C --> F[webclaw-core extract pipeline]
D --> F
E --> F
F --> G[webclaw-fetch client]
F --> H[webclaw-llm provider chain]
H -- Ollama --> H1[local]
H -- OpenAI --> H2[remote]
H -- Gemini --> H3[remote]
H -- Anthropic --> H4[remote]4. Known Failure Modes
4.1 Prebuilt Linux Binaries and glibc
Issue #73 reports that the prebuilt webclaw, webclaw-mcp, and webclaw-server release binaries require glibc 2.38+. They fail to start on common server distributions such as Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9 because those ship with older glibc. Self-hosters on these distributions must either build from source against the system glibc, run the Docker image, or use the SDKs to call the hosted API instead of running the binary in place.
4.2 `--only-main-content` Ignored in Batch Mode
Issue #3 reports that --only-main-content is not applied when combined with --urls-file. Both ordering variants — webclaw --only-main-content --urls-file URLs.txt and webclaw --urls-file URLs.txt --only-main-content — produce full-page output rather than main-content-only output. The hosted REST API behaves differently: passing "only_main_content": true inside the JSON body of POST /v1/scrape works as expected, so batch users who need main-content filtering today should script the API directly. Source: examples/html-to-markdown-rag/README.md
4.3 MCP Update Warning
Issue #30 reports a warning when updating webclaw-mcp. The v0.6.14 release included fix(create-webclaw): repair binary... which addresses the scaffolder. Operators should rerun npx create-webclaw after upgrading rather than copying the old mcpServers block forward.
4.4 Empty Result Diagnostics
The fetch layer classifies empty results into categories such as ConsentWall. The unit tests in the CLI crate verify that titles like "Before you continue" and redirect URLs such as https://guce.advertising.com/collectIdentifiers?sessionId=... are flagged, while real articles containing the phrase "Cookie consent patterns explained" are not falsely flagged. The examples/cloudflare-diagnostics/ workflow provides a reproducible checklist for blocked or empty protected-site results. Source: crates/webclaw-cli/src/main.rs, examples/README.md
4.5 Diff Route Input Shape
The diff route accepts a previous payload in two shapes: a full Extraction object or a Minimal { markdown, metadata } object. Servers that store only the markdown of a page must still supply a metadata field (which may be null and will be replaced by an empty default). Requests with an empty url are rejected with a 400 "url is required" error. Source: crates/webclaw-server/src/routes/diff.rs
See Also
- README.md — Project overview, SDKs, and install paths.
- examples/README.md — Workflow guides and CLI snippets.
- examples/html-to-markdown-rag/README.md — RAG-oriented extraction including the hosted API usage that works around issue #3.
- examples/cloudflare-diagnostics/ — Checklist for blocked/empty protected-site results.
Source: https://github.com/0xMassi/webclaw / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 11 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.
1. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/73
2. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/15
3. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.host_targets | https://github.com/0xMassi/webclaw
4. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/62
5. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/0xMassi/webclaw
6. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/0xMassi/webclaw
7. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/0xMassi/webclaw
8. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | https://github.com/0xMassi/webclaw
9. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/71
10. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/0xMassi/webclaw
11. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/0xMassi/webclaw
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using webclaw with real data or production workflows.
- Linux release binaries require glibc 2.38+ — fail on Debian 12 / Ubuntu - github / github_issue
- create-webclaw fails on Windows: asset name mismatch + uses missing unzi - github / github_issue
webclaw-serverREST API binary is missing from repo and Docker image - github / github_issue- MCP boolean params rejected when sent as strings (follow-up to #58 / #59 - github / github_issue
- v0.6.14 - github / github_release
- v0.6.13 - github / github_release
- v0.6.12 - github / github_release
- v0.6.11 - github / github_release
- v0.6.10 - github / github_release
- v0.6.9 - github / github_release
- v0.6.8 - github / github_release
- v0.6.7 - github / github_release
Source: Project Pack community evidence and pitfall evidence