webclaw Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

webclaw

Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.

Project Overview and Workspace Architecture

Related topics: Extraction Engine, CLI Usage, and Known CLI Bugs, MCP Server, REST API, and LLM Provider Integration

Section Related Pages

Continue reading this section for the full explanation and source context.

Project Overview and Workspace Architecture

Overview

Webclaw is a web-to-context conversion toolkit that turns websites into clean Markdown, JSON, and LLM-ready text. It exposes the same underlying extraction pipeline through four distribution surfaces: a native CLI (webclaw), a Model Context Protocol server (webclaw-mcp), a self-hosted REST API (webclaw-server), and hosted API plus official SDKs in TypeScript, Python, and Go. Source: README.md.

The project is structured as a Rust Cargo workspace with multiple focused crates that share core extraction, fetch, and LLM primitives. This separation lets the same pipeline serve one-off CLI runs, agent integration, and high-throughput crawling without duplicating logic. Source: crates/webclaw-cli/src/main.rs.

Workspace Layout

The repository is a Cargo workspace under crates/. Each crate owns one concern and is consumed by the binaries that need it.

graph TD
    A[webclaw-cli] --> B[webclaw-core]
    A --> C[webclaw-fetch]
    A --> D[webclaw-llm]
    E[webclaw-mcp] --> B
    E --> C
    F[webclaw-server] --> B
    F --> C
    F --> D
    D --> G[Provider Chain<br/>Ollama → OpenAI → Gemini → Anthropic]
    B --> H[Extractors<br/>Vertical Catalogs]
    C --> I[wreq Chrome/Firefox<br/>TLS Fingerprints]

webclaw-cli — clap-based CLI binary. Owns subcommands such as vertical, search, crawl, diff, and bench, plus per-format printers (print_crawl_output, print_batch_output, print_map_output). Source: crates/webclaw-cli/src/main.rs.
webclaw-cli::bench — micro-benchmark harness that measures token reduction from raw HTML to the LLM pipeline output using an approximate chars/4 (Latin) and chars/2 (CJK) tokenizer. Source: crates/webclaw-cli/src/bench.rs.
webclaw-mcp — Model Context Protocol server. Exposes tools (scrape, crawl, map, batch, extract, summarize, vertical_scrape) to MCP clients such as Claude Code, Cursor, and Codex CLI. Source: crates/webclaw-mcp/src/server.rs and crates/webclaw-mcp/src/tools.rs.
webclaw-server — Axum-based REST API. Routes include POST /v1/extract for LLM-powered structured extraction, POST /v1/diff for snapshot comparison, and POST /v1/crawl for multi-page crawling. Source: crates/webclaw-server/src/routes/extract.rs, crates/webclaw-server/src/routes/diff.rs, and crates/webclaw-server/src/routes/crawl.rs.
webclaw-llm — Local-first LLM integration. The ProviderChain tries Ollama first, then falls back to OpenAI, Gemini, and Anthropic. Provides schema extraction, prompt extraction, and summarization. Source: crates/webclaw-llm/src/lib.rs and crates/webclaw-llm/src/summarize.rs.

Crate Responsibilities and Shared Primitives

Two cross-cutting crates sit beneath the binaries:

webclaw-core — Provides the extract() pipeline, to_llm_text() token-optimized formatter, and a registry of vertical extractors (e.g., reddit, github_repo, trustpilot_reviews, youtube_video, shopify_product, pypi, npm, arxiv). Vertical extractors return typed JSON specific to the target site rather than generic Markdown. Source: crates/webclaw-mcp/src/server.rs.
webclaw-fetch — Owns the HTTP client and BrowserProfile (Chrome vs Firefox). The MCP vertical_scrape tool deliberately uses the cached Firefox client because Reddit's .json endpoint rejects the wreq-Chrome TLS fingerprint with a 403 even from residential IPs; Firefox's fingerprint still passes and is fine for every other vertical. Source: crates/webclaw-mcp/src/server.rs.

The webclaw-llm crate follows a hybrid architecture: it always tries Ollama (local) first and only escalates to hosted providers if the local model cannot answer, keeping self-hosters on local inference by default. Source: crates/webclaw-llm/src/lib.rs.

Distribution Surfaces and CLI Workflows

The CLI is the primary entry point and demonstrates the breadth of capabilities exposed through the shared pipeline:

Basic extraction — webclaw https://example.com -f markdown|json|text|llm writes clean content in the requested format. Bare domains are auto-prefixed with https://. Source: examples/README.md.
Content filtering — --only-main-content skips nav/sidebar/footer, while --include and --exclude accept CSS selectors. Note: community issue #3 reports that --only-main-content is not currently applied when passed alongside --urls-file in batch mode.
Batch and crawl — webclaw --urls-file urls.txt extracts a list; webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50 walks a site. The MCP equivalent takes BatchParams and CrawlParams with explicit concurrency and optional use_sitemap. Source: crates/webclaw-mcp/src/tools.rs.
RAG preparation — The --format llm flag produces token-optimized text that is ~67% smaller than the markdown equivalent, intended for chunking, embedding, and prompt context. Source: examples/html-to-markdown-rag/README.md.
Diff and brand — --diff-with pricing-old.json compares snapshots; --brand extracts colors, fonts, logos, and metadata. Source: README.md.

The MCP surface is wired through npx create-webclaw, which generates client configuration blocks for Claude Code, Claude Desktop, Cursor, and Codex CLI. Source: README.md. The REST surface mirrors the CLI: POST /v1/scrape, POST /v1/crawl, POST /v1/extract, POST /v1/diff, etc., all returning JSON envelopes suitable for programmatic use. Source: crates/webclaw-server/src/routes/extract.rs and crates/webclaw-server/src/routes/crawl.rs.

Operational Notes

Prebuilt Linux release binaries (webclaw, webclaw-mcp, webclaw-server) require glibc 2.38+, which prevents them from starting on Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9. Community issue #73 documents the failure mode. Self-hosters on those distributions should build from source or run via Docker until a glibc-compatible release is published. The hosted API at api.webclaw.io and the official SDKs (@webclaw/sdk for npm, webclaw for PyPI, github.com/0xMassi/webclaw-go) remain unaffected because they target a fixed runtime. Source: README.md.

Extraction Engine, CLI Usage, and Known CLI Bugs

Related topics: Project Overview and Workspace Architecture, Deployment, Configuration, and Known Failure Modes

Section Related Pages

Continue reading this section for the full explanation and source context.

Extraction Engine, CLI Usage, and Known CLI Bugs

Overview

webclaw is a Rust workspace that turns web pages into clean markdown, JSON, and LLM-ready text. The webclaw binary is the primary entry point for one-off extraction, batch URL lists, crawling, diffing, search, vertical scraping, and benchmarking. The extraction pipeline itself lives in shared crates and is reused by the CLI, the HTTP server (webclaw-server), and the MCP server (webclaw-mcp).

The CLI is implemented with clap subcommands inside crates/webclaw-cli/src/main.rs, and it dispatches to specialized handlers. The core content pipeline (HTML → markdown, metadata, links, structured data) is exposed through webclaw-core and consumed uniformly, so the CLI, the /v1/extract route, and MCP scrape return the same ExtractionResult shape. LLM-powered steps (schema extraction, prompt extraction, summarization) are added by crates/webclaw-llm/src/lib.rs and use a provider chain that prefers local Ollama, then OpenAI, Gemini, and Anthropic.

CLI Command Surface

The webclaw binary exposes a small set of subcommands. The most relevant for extraction are summarized below.

Subcommand	Purpose	Key options
`webclaw <url>` (default)	Single-URL extraction	`--format`, `--only-main-content`, `--include`, `--exclude`, `--proxy`
`webclaw --urls-file`	Batch extraction from a file	`--format`, (per-row optional filename)
`webclaw crawl`	Recursive crawl from a seed	`--max-depth`, `--max-pages`
`webclaw diff`	Compare two snapshots of a URL	previous content as JSON or markdown
`webclaw summarize`	LLM summary of a page	provider config from env
`webclaw brand`	Colors, fonts, logos, metadata	runs over an existing fetch
`webclaw search`	Serper.dev search, optional `--scrape`	`--serper-key`, `--num`, `--country`
`webclaw vertical <name> <url>`	Site-specific extractor (Reddit, GitHub, PyPI, etc.)	`--raw` for compact JSON
`webclaw bench <url>`	Token-savings micro-benchmark	`--facts <path>`, `--json`

The vertical subcommand, defined in crates/webclaw-cli/src/main.rs, runs site-specific extractors that return typed JSON (title, price, author, rating) rather than generic markdown. Catalog names include reddit, github_repo, trustpilot_reviews, youtube_video, shopify_product, pypi, npm, and arxiv. The MCP counterpart vertical_scrape intentionally uses the Firefox browser fingerprint because Reddit's .json endpoint rejects the default Chrome fingerprint with 403s, even from residential IPs (Source: crates/webclaw-mcp/src/server.rs).

Output Formats and Batch Mode

Output is selected with --format (markdown is the default). The formatter in crates/webclaw-cli/src/main.rs maps each OutputFormat variant to a renderer:

Markdown — frontmatter when --metadata is set, body markdown, then a "Structured Data" JSON block if present.
Json — pretty-printed ExtractionResult.
Text — content.plain_text only.
Llm — to_llm_text(...) produces a token-optimized string (≈67% smaller than the source HTML, per the README banner in packages/create-webclaw/README.md).
Html — content.raw_html when available, falling back to markdown.

For batch input, the collect_urls helper accepts both positional URLs and a --urls-file. Lines with a comma become (url, custom_filename) pairs; bare lines auto-generate filenames from the URL (Source: crates/webclaw-cli/src/main.rs). The per-URL BatchExtractResult is rendered by print_batch_output, which iterates and prints each result joined by ---. JSON output emits a single array of {url, result | error} objects so partial failures are visible.

The HTTP server exposes the same pipeline at POST /v1/scrape, POST /v1/extract, POST /v1/crawl, and POST /v1/diff. The extract route requires either a JSON schema or a natural-language prompt and builds its provider chain from environment variables (Source: crates/webclaw-server/src/routes/extract.rs). LLM extraction uses json_mode: true for schema flows and temperature: 0.0 to keep results deterministic (Source: crates/webclaw-llm/src/extract.rs). Summarization uses temperature: 0.3 and strips any thinking tags defensively before returning plain text (Source: crates/webclaw-llm/src/summarize.rs).

Known CLI Bugs and Workarounds

A small number of issues are visible from the public issue tracker and the codebase. They affect the CLI directly, so they are documented here for quick lookup.

--only-main-content is ignored in batch mode. Issue #3 reports that webclaw --only-main-content --urls-file URLs.txt does not strip navigation, sidebar, and footer noise on individual rows. The flag is read on the single-URL path but is not propagated to the per-URL batch invocation. Workaround: run the batch with a small wrapper loop, e.g. while read u; do webclaw --only-main-content "$u"; done < URLs.txt, or call the hosted POST /v1/scrape endpoint with "only_main_content": true.
Linux release binaries require glibc 2.38+. Issue #73 reports that the prebuilt webclaw, webclaw-mcp, and webclaw-server binaries fail to start on Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9 because their glibc is older than 2.38. Workaround: build from source with cargo install --git https://github.com/0xMassi/webclaw webclaw-cli, or upgrade the host glibc.
Update warning when refreshing webclaw-mcp. Issue #30 reports a warning shown when updating the MCP binary. Workaround: rerun npx create-webclaw to reinstall the binary and re-emit tool config; the create-webclaw package detects installed AI tools and rewrites MCP configuration (Source: packages/create-webclaw/README.md).

Release v0.6.14 (visible in the changelog shared in community context) added docs for search, map, and perf, fixed the deploy step to write WEBCLAW_API_KEY into the generated .env (#68), and repaired the create-webclaw binary download path.

MCP Server, REST API, and LLM Provider Integration

Related topics: Project Overview and Workspace Architecture, Deployment, Configuration, and Known Failure Modes

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Transport and lifecycle

Continue reading this section for the full explanation and source context.

Section Tool surface and parameter coercion

Continue reading this section for the full explanation and source context.

Section TLS-profile selection

Continue reading this section for the full explanation and source context.

MCP Server, REST API, and LLM Provider Integration

Webclaw exposes its extraction pipeline through three coordinated surfaces: a local MCP server (webclaw-mcp) for AI agents, a self-hosted REST API (webclaw-server) for programmatic access, and a shared LLM provider chain (webclaw-llm) that powers schema extraction, prompt extraction, and summarization across both. Together they let the same core capabilities — scrape, crawl, extract, summarize — be driven from an IDE assistant, a curl script, or a hosted integration.

High-Level Architecture

flowchart LR
    Agent["AI Agent<br/>(Claude Code, Cursor, Codex)"] -->|stdio MCP| MCP[webclaw-mcp]
    Curl["cURL / SDK<br/>(TypeScript, Python, Go)"] -->|HTTPS JSON| API[webclaw-server]
    MCP --> Core[webclaw-core<br/>fetch + extract]
    API --> Core
    Core --> LLM[webclaw-llm<br/>ProviderChain]
    LLM --> Ollama[(Ollama<br/>local)]
    LLM --> OpenAI[(OpenAI)]
    LLM --> Gemini[(Gemini)]
    LLM --> Anthropic[(Anthropic)]

The MCP binary and the REST binary are thin wrappers: both delegate to the same webclaw-core pipeline, and both can fall back to the same ProviderChain when an LLM is needed. The hosted platform at api.webclaw.io is a closed-source extension of webclaw-server that adds anti-bot bypass, JS rendering, and async crawl jobs (Source: crates/webclaw-server/src/main.rs:1-20).

MCP Server (`webclaw-mcp`)

Transport and lifecycle

The MCP server is a stateless stdio service. Its entry point initializes logging to stderr so that stdout remains a clean MCP transport channel, then serves until the client disconnects (Source: crates/webclaw-mcp/src/main.rs:1-22):

let service = WebclawMcp::new().await.serve(stdio()).await?;
service.waiting().await?;

WEBCLAW_API_KEY is optional at the MCP layer — local extraction works without it, but the key enables cloud fallback for protected sites, JS rendering, hosted search, and hosted research.

Tool surface and parameter coercion

Each MCP tool is declared with #[tool] on the WebclawMcp service. Parameters are typed structs that derive JsonSchema (for automatic schema generation) and Deserialize (for parsing tool-call arguments). Because some MCP clients serialize numbers and booleans as JSON strings, tools.rs ships coercion helpers — deser_opt_u32_or_str and deser_opt_bool_or_str — that accept either form transparently (Source: crates/webclaw-mcp/src/tools.rs:1-60).

The exposed tool family mirrors the CLI subcommands: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, and vertical_scrape. The vertical tool, for example, runs a typed extractor by name (e.g. reddit, github_repo, trustpilot_reviews) and returns site-specific JSON rather than generic markdown (Source: crates/webclaw-mcp/src/server.rs:1-30).

TLS-profile selection

A subtle but important detail lives in vertical_scrape: the tool intentionally uses a cached Firefox TLS profile rather than the Chrome profile used by the generic scrape tool. Reddit's .json endpoint rejects the wreq-Chrome fingerprint with a 403 even from residential IPs, while the Firefox fingerprint still passes — Firefox is safe for every other vertical, so it is a strictly safer default for site-specific extractors (Source: crates/webclaw-mcp/src/server.rs:15-35).

Installer

The create-webclaw npm package (v0.1.5) detects supported MCP clients — Claude, Cursor, Windsurf, OpenCode, Codex, Antigravity — and writes the appropriate config (Source: packages/create-webclaw/package.json:1-25).

REST API (`webclaw-server`)

Scope and posture

webclaw-server is deliberately small: a single binary, stateless, with no database and no job queue. The long_about string on the CLI makes the contract explicit — hosted features (anti-bot bypass, JS rendering, async crawl jobs, multi-tenant auth, billing) are *not* implemented here and never will be; they live in the closed-source platform (Source: crates/webclaw-server/src/main.rs:1-15).

The server is built on axum, uses tower-http for CORS and tracing, and wraps the same extraction crates the CLI and MCP server use. JSON shapes mirror the hosted API at api.webclaw.io so the same SDKs work against either deployment.

Structured extraction route

POST /v1/extract accepts a URL plus either a JSON schema or a natural-language prompt (at least one is required) and an optional model override. It validates the input, builds a per-request provider chain from environment variables, and dispatches to either extract_json or extract_with_prompt in webclaw-llm (Source: crates/webclaw-server/src/routes/extract.rs:1-55):

if !has_schema && !has_prompt {
    return Err(ApiError::bad_request("either `schema` or `prompt` is required"));
}

The route delegates provider selection entirely to the shared chain, which means self-hosters get the same Ollama → OpenAI → Gemini → Anthropic fallback as the CLI without any per-route configuration.

LLM Provider Integration (`webclaw-llm`)

Provider chain

webclaw-llm is described in its crate doc as a "local-first hybrid architecture" where the provider chain tries Ollama first, then OpenAI, then Gemini, then Anthropic (Source: crates/webclaw-llm/src/lib.rs:1-12). The crate exposes a single abstraction — LlmProvider with an async fn complete — so adding a new provider means implementing one trait, not rewriting call sites.

Schema-based extraction

extract_json builds a system prompt that embeds the user's JSON schema verbatim, sends the page content as the user turn, sets temperature: 0.0 and json_mode: true, and parses the response back into a serde_json::Value. The instruction to the model is explicit: "Return ONLY valid JSON matching the schema. No explanations, no markdown, no commentary." (Source: crates/webclaw-llm/src/extract.rs:1-35).

Prompt-based extraction

extract_with_prompt is the flexible sibling: it accepts a natural-language description of what to extract instead of a schema. The same strip_thinking_tags utility is applied to the response, so models that emit <think>… blocks have those removed before parsing (Source: crates/webclaw-llm/src/lib.rs:1-12).

Summarization

summarize is intentionally minimal — one function, one prompt. It defaults to three sentences, uses temperature: 0.3, and applies strip_thinking_tags as defense-in-depth even though providers already strip them, because summary text is passed directly to end users (Source: crates/webclaw-llm/src/summarize.rs:1-45).

Failure Modes and Known Issues

Linux glibc requirement. Prebuilt binaries (webclaw, webclaw-mcp, webclaw-server) require glibc 2.38+, so they fail to start on Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9. Self-hosters on those distros should build from source or use an older release.
MCP parameter coercion. Some MCP clients serialize numbers and booleans as JSON strings. The coercion helpers in tools.rs cover both representations, so callers should never see "invalid type: string" errors.
Update warnings. The webclaw-mcp updater may emit a warning when newer binaries are released; reinstalling via npx create-webclaw resolves the configuration mismatch.
Cloud-only features. Anti-bot bypass, JS rendering, async crawl jobs, multi-tenant auth, and billing are not part of webclaw-server. Self-hosters who need them must point their WEBCLAW_API_KEY at api.webclaw.io so requests can escalate (Source: crates/webclaw-server/src/main.rs:5-15).

Deployment, Configuration, and Known Failure Modes

Related topics: Project Overview and Workspace Architecture, Extraction Engine, CLI Usage, and Known CLI Bugs

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 2.1 CLI

Continue reading this section for the full explanation and source context.

Section 2.2 MCP Server

Continue reading this section for the full explanation and source context.

Section 2.3 Hosted REST API and Self-Hosted Server

Continue reading this section for the full explanation and source context.

Deployment, Configuration, and Known Failure Modes

1. Scope and Purpose

webclaw is a multi-surface extraction tool that ships as a CLI binary, an MCP server, a REST API, and SDKs for TypeScript, Python, and Go. Each surface shares the same webclaw-core extraction pipeline and the same webclaw-fetch client, which means deployment, configuration, and failure handling overlap heavily across the surfaces. This page documents how the components are deployed, how they are configured at runtime, and the failure modes that have been reported by users in the project's issue tracker.

The repository's high-level positioning is summarized in the README as: "Turn websites into clean markdown, JSON, and LLM-ready context" via "CLI, MCP server, REST API, and SDKs for AI agents and RAG pipelines." Source: README.md

2. Deployment Surfaces

2.1 CLI

The CLI is the simplest deployment form. A single binary is invoked against one or many URLs:

webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50

Vertical extractors and web search are exposed as subcommands (webclaw vertical reddit <url>, webclaw search "rust async runtime" --scrape) defined in the clap subcommand enum. Source: crates/webclaw-cli/src/main.rs

A per-URL micro-benchmark is provided via webclaw bench <url>, which measures token reduction between raw HTML and the LLM-format output. Source: crates/webclaw-cli/src/bench.rs

2.2 MCP Server

The MCP server is scaffolded for AI agents through npx create-webclaw, which writes the client configuration block. The server registers tools for scrape, crawl, batch_scrape, map, extract, summarize, search, research, and vertical_scrape. Source: crates/webclaw-mcp/src/tools.rs

vertical_scrape deliberately uses the cached Firefox TLS fingerprint because Reddit's .json endpoint rejects the Chrome fingerprint with a 403, even from residential IPs. Source: crates/webclaw-mcp/src/server.rs

2.3 Hosted REST API and Self-Hosted Server

Two REST surfaces exist:

A hosted https://api.webclaw.io/v1/* service used by SDKs and the hosted MCP routes.
A self-hosted webclaw-server binary that mirrors the hosted API. Its route handlers validate url is non-empty and, for extract, that at least one of schema or prompt is supplied. Source: crates/webclaw-server/src/routes/extract.rs

The crawl route returns a JSON document with status, total, completed, errors, elapsed_secs, and a pages array that carries per-page markdown, metadata, and error fields. Source: crates/webclaw-server/src/routes/crawl.rs

The diff route requires a non-empty url and accepts a previous extraction in either Extraction or Minimal { markdown, metadata } form; missing markdown/metadata are filled with empty defaults. Source: crates/webclaw-server/src/routes/diff.rs

3. Configuration

3.1 Provider Chain

The LLM subsystem is local-first and provider-chained. The chain tries Ollama (local), then OpenAI, then Gemini, then Anthropic, exposing this as a single ProviderChain API. Source: crates/webclaw-llm/src/lib.rs

POST /v1/extract builds the chain per request from environment variables, and a model field can override the model name on the chosen provider. Source: crates/webclaw-server/src/routes/extract.rs

3.2 Environment Variables

The codebase documents two important environment variables in user-facing examples:

WEBCLAW_API_KEY — Bearer token for the hosted API. The fix(deploy): write WEBCLAW_API_KEY in generated .env change in v0.6.14 ensures the create-webclaw scaffolder writes this key into the generated .env file. Source: README.md
SERPER_API_KEY — Required for webclaw search. The CLI subcommand also accepts --serper-key and falls back to this env var. Source: crates/webclaw-cli/src/main.rs

3.3 CLI Flags

Common flags include --format (markdown, json, text, llm, html), --only-main-content, --include "<css selectors>", --exclude "<css selectors>", --crawl --depth N --max-pages M, and --diff-with <file>. The examples index documents each of these with copy-pasteable invocations. Source: examples/README.md

flowchart LR
  A[Caller] --> B{Which surface?}
  B -- "shell" --> C[webclaw CLI binary]
  B -- "agent" --> D[webclaw-mcp / create-webclaw]
  B -- "HTTP" --> E[webclaw-server / api.webclaw.io]
  C --> F[webclaw-core extract pipeline]
  D --> F
  E --> F
  F --> G[webclaw-fetch client]
  F --> H[webclaw-llm provider chain]
  H -- Ollama --> H1[local]
  H -- OpenAI --> H2[remote]
  H -- Gemini --> H3[remote]
  H -- Anthropic --> H4[remote]

4. Known Failure Modes

4.1 Prebuilt Linux Binaries and glibc

Issue #73 reports that the prebuilt webclaw, webclaw-mcp, and webclaw-server release binaries require glibc 2.38+. They fail to start on common server distributions such as Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9 because those ship with older glibc. Self-hosters on these distributions must either build from source against the system glibc, run the Docker image, or use the SDKs to call the hosted API instead of running the binary in place.

4.2 `--only-main-content` Ignored in Batch Mode

Issue #3 reports that --only-main-content is not applied when combined with --urls-file. Both ordering variants — webclaw --only-main-content --urls-file URLs.txt and webclaw --urls-file URLs.txt --only-main-content — produce full-page output rather than main-content-only output. The hosted REST API behaves differently: passing "only_main_content": true inside the JSON body of POST /v1/scrape works as expected, so batch users who need main-content filtering today should script the API directly. Source: examples/html-to-markdown-rag/README.md

4.3 MCP Update Warning

Issue #30 reports a warning when updating webclaw-mcp. The v0.6.14 release included fix(create-webclaw): repair binary... which addresses the scaffolder. Operators should rerun npx create-webclaw after upgrading rather than copying the old mcpServers block forward.

4.4 Empty Result Diagnostics

The fetch layer classifies empty results into categories such as ConsentWall. The unit tests in the CLI crate verify that titles like "Before you continue" and redirect URLs such as https://guce.advertising.com/collectIdentifiers?sessionId=... are flagged, while real articles containing the phrase "Cookie consent patterns explained" are not falsely flagged. The examples/cloudflare-diagnostics/ workflow provides a reproducible checklist for blocked or empty protected-site results. Source: crates/webclaw-cli/src/main.rs, examples/README.md

4.5 Diff Route Input Shape

The diff route accepts a previous payload in two shapes: a full Extraction object or a Minimal { markdown, metadata } object. Servers that store only the markdown of a page must still supply a metadata field (which may be null and will be replaced by an empty default). Requests with an empty url are rejected with a 400 "url is required" error. Source: crates/webclaw-server/src/routes/diff.rs

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Configuration risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Configuration risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 11 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/73

2. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/15

3. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.host_targets | https://github.com/0xMassi/webclaw

4. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/62

5. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/0xMassi/webclaw

6. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/0xMassi/webclaw

7. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/0xMassi/webclaw

8. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | https://github.com/0xMassi/webclaw

9. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/71

10. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/0xMassi/webclaw

11. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/0xMassi/webclaw

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using webclaw with real data or production workflows.

Linux release binaries require glibc 2.38+ — fail on Debian 12 / Ubuntu - github / github_issue
create-webclaw fails on Windows: asset name mismatch + uses missing unzi - github / github_issue
webclaw-server REST API binary is missing from repo and Docker image - github / github_issue
MCP boolean params rejected when sent as strings (follow-up to #58 / #59 - github / github_issue
v0.6.14 - github / github_release
v0.6.13 - github / github_release
v0.6.12 - github / github_release
v0.6.11 - github / github_release
v0.6.10 - github / github_release
v0.6.9 - github / github_release
v0.6.8 - github / github_release
v0.6.7 - github / github_release

Source: Project Pack community evidence and pitfall evidence

webclaw

Project Overview and Workspace Architecture

Related Pages

Project Overview and Workspace Architecture

Overview

Workspace Layout

Crate Responsibilities and Shared Primitives

Distribution Surfaces and CLI Workflows

Operational Notes

See Also

Extraction Engine, CLI Usage, and Known CLI Bugs

Related Pages

Extraction Engine, CLI Usage, and Known CLI Bugs

Overview

CLI Command Surface

Output Formats and Batch Mode

Known CLI Bugs and Workarounds

See Also

MCP Server, REST API, and LLM Provider Integration

Related Pages

MCP Server, REST API, and LLM Provider Integration

High-Level Architecture

MCP Server (`webclaw-mcp`)

Transport and lifecycle

Tool surface and parameter coercion

TLS-profile selection

Installer

REST API (`webclaw-server`)

Scope and posture

Structured extraction route

LLM Provider Integration (`webclaw-llm`)

Provider chain

Schema-based extraction

Prompt-based extraction

Summarization

Failure Modes and Known Issues

See Also

Deployment, Configuration, and Known Failure Modes

Related Pages

Deployment, Configuration, and Known Failure Modes

1. Scope and Purpose

2. Deployment Surfaces

2.1 CLI

2.2 MCP Server

2.3 Hosted REST API and Self-Hosted Server

3. Configuration

3.1 Provider Chain

3.2 Environment Variables

3.3 CLI Flags

4. Known Failure Modes

4.1 Prebuilt Linux Binaries and glibc

4.2 `--only-main-content` Ignored in Batch Mode

4.3 MCP Update Warning

4.4 Empty Result Diagnostics

4.5 Diff Route Input Shape

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Installation risk: Installation risk requires verification

2. Installation risk: Installation risk requires verification

3. Configuration risk: Configuration risk requires verification

4. Configuration risk: Configuration risk requires verification

5. Capability evidence risk: Capability evidence risk requires verification

6. Maintenance risk: Maintenance risk requires verification

7. Security or permission risk: Security or permission risk requires verification

8. Security or permission risk: Security or permission risk requires verification

9. Security or permission risk: Security or permission risk requires verification

10. Maintenance risk: Maintenance risk requires verification

11. Maintenance risk: Maintenance risk requires verification

Community Discussion Evidence

Community Discussion Evidence