# https://github.com/0xMassi/webclaw Project Manual

Generated at: 2026-06-28 18:11:44 UTC

## Table of Contents

- [Project Overview and Workspace Architecture](#page-overview)
- [Extraction Engine, CLI Usage, and Known CLI Bugs](#page-extraction)
- [MCP Server, REST API, and LLM Provider Integration](#page-mcp-server)
- [Deployment, Configuration, and Known Failure Modes](#page-deployment)

<a id='page-overview'></a>

## Project Overview and Workspace Architecture

### Related Pages

Related topics: [Extraction Engine, CLI Usage, and Known CLI Bugs](#page-extraction), [MCP Server, REST API, and LLM Provider Integration](#page-mcp-server)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/0xMassi/webclaw/blob/main/README.md)
- [examples/README.md](https://github.com/0xMassi/webclaw/blob/main/examples/README.md)
- [examples/html-to-markdown-rag/README.md](https://github.com/0xMassi/webclaw/blob/main/examples/html-to-markdown-rag/README.md)
- [crates/webclaw-cli/src/main.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/main.rs)
- [crates/webclaw-cli/src/bench.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/bench.rs)
- [crates/webclaw-mcp/src/server.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/server.rs)
- [crates/webclaw-mcp/src/tools.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/tools.rs)
- [crates/webclaw-server/src/routes/extract.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/extract.rs)
- [crates/webclaw-server/src/routes/diff.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/diff.rs)
- [crates/webclaw-server/src/routes/crawl.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/crawl.rs)
- [crates/webclaw-llm/src/lib.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/lib.rs)
- [crates/webclaw-llm/src/summarize.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/summarize.rs)
</details>

# Project Overview and Workspace Architecture

## Overview

Webclaw is a web-to-context conversion toolkit that turns websites into clean Markdown, JSON, and LLM-ready text. It exposes the same underlying extraction pipeline through four distribution surfaces: a native CLI (`webclaw`), a Model Context Protocol server (`webclaw-mcp`), a self-hosted REST API (`webclaw-server`), and hosted API plus official SDKs in TypeScript, Python, and Go. Source: [README.md](https://github.com/0xMassi/webclaw/blob/main/README.md).

The project is structured as a Rust Cargo workspace with multiple focused crates that share core extraction, fetch, and LLM primitives. This separation lets the same pipeline serve one-off CLI runs, agent integration, and high-throughput crawling without duplicating logic. Source: [crates/webclaw-cli/src/main.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/main.rs).

## Workspace Layout

The repository is a Cargo workspace under `crates/`. Each crate owns one concern and is consumed by the binaries that need it.

```mermaid
graph TD
    A[webclaw-cli] --> B[webclaw-core]
    A --> C[webclaw-fetch]
    A --> D[webclaw-llm]
    E[webclaw-mcp] --> B
    E --> C
    F[webclaw-server] --> B
    F --> C
    F --> D
    D --> G[Provider Chain<br/>Ollama → OpenAI → Gemini → Anthropic]
    B --> H[Extractors<br/>Vertical Catalogs]
    C --> I[wreq Chrome/Firefox<br/>TLS Fingerprints]
```

- **webclaw-cli** — clap-based CLI binary. Owns subcommands such as `vertical`, `search`, `crawl`, `diff`, and `bench`, plus per-format printers (`print_crawl_output`, `print_batch_output`, `print_map_output`). Source: [crates/webclaw-cli/src/main.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/main.rs).
- **webclaw-cli::bench** — micro-benchmark harness that measures token reduction from raw HTML to the LLM pipeline output using an approximate `chars/4` (Latin) and `chars/2` (CJK) tokenizer. Source: [crates/webclaw-cli/src/bench.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/bench.rs).
- **webclaw-mcp** — Model Context Protocol server. Exposes tools (`scrape`, `crawl`, `map`, `batch`, `extract`, `summarize`, `vertical_scrape`) to MCP clients such as Claude Code, Cursor, and Codex CLI. Source: [crates/webclaw-mcp/src/server.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/server.rs) and [crates/webclaw-mcp/src/tools.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/tools.rs).
- **webclaw-server** — Axum-based REST API. Routes include `POST /v1/extract` for LLM-powered structured extraction, `POST /v1/diff` for snapshot comparison, and `POST /v1/crawl` for multi-page crawling. Source: [crates/webclaw-server/src/routes/extract.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/extract.rs), [crates/webclaw-server/src/routes/diff.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/diff.rs), and [crates/webclaw-server/src/routes/crawl.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/crawl.rs).
- **webclaw-llm** — Local-first LLM integration. The `ProviderChain` tries Ollama first, then falls back to OpenAI, Gemini, and Anthropic. Provides schema extraction, prompt extraction, and summarization. Source: [crates/webclaw-llm/src/lib.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/lib.rs) and [crates/webclaw-llm/src/summarize.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/summarize.rs).

## Crate Responsibilities and Shared Primitives

Two cross-cutting crates sit beneath the binaries:

- **webclaw-core** — Provides the `extract()` pipeline, `to_llm_text()` token-optimized formatter, and a registry of vertical extractors (e.g., `reddit`, `github_repo`, `trustpilot_reviews`, `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`). Vertical extractors return typed JSON specific to the target site rather than generic Markdown. Source: [crates/webclaw-mcp/src/server.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/server.rs).
- **webclaw-fetch** — Owns the HTTP client and `BrowserProfile` (Chrome vs Firefox). The MCP `vertical_scrape` tool deliberately uses the cached Firefox client because Reddit's `.json` endpoint rejects the `wreq-Chrome` TLS fingerprint with a 403 even from residential IPs; Firefox's fingerprint still passes and is fine for every other vertical. Source: [crates/webclaw-mcp/src/server.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/server.rs).

The `webclaw-llm` crate follows a hybrid architecture: it always tries Ollama (local) first and only escalates to hosted providers if the local model cannot answer, keeping self-hosters on local inference by default. Source: [crates/webclaw-llm/src/lib.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/lib.rs).

## Distribution Surfaces and CLI Workflows

The CLI is the primary entry point and demonstrates the breadth of capabilities exposed through the shared pipeline:

- **Basic extraction** — `webclaw https://example.com -f markdown|json|text|llm` writes clean content in the requested format. Bare domains are auto-prefixed with `https://`. Source: [examples/README.md](https://github.com/0xMassi/webclaw/blob/main/examples/README.md).
- **Content filtering** — `--only-main-content` skips nav/sidebar/footer, while `--include` and `--exclude` accept CSS selectors. Note: community issue #3 reports that `--only-main-content` is not currently applied when passed alongside `--urls-file` in batch mode.
- **Batch and crawl** — `webclaw --urls-file urls.txt` extracts a list; `webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50` walks a site. The MCP equivalent takes `BatchParams` and `CrawlParams` with explicit `concurrency` and optional `use_sitemap`. Source: [crates/webclaw-mcp/src/tools.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/tools.rs).
- **RAG preparation** — The `--format llm` flag produces token-optimized text that is ~67% smaller than the markdown equivalent, intended for chunking, embedding, and prompt context. Source: [examples/html-to-markdown-rag/README.md](https://github.com/0xMassi/webclaw/blob/main/examples/html-to-markdown-rag/README.md).
- **Diff and brand** — `--diff-with pricing-old.json` compares snapshots; `--brand` extracts colors, fonts, logos, and metadata. Source: [README.md](https://github.com/0xMassi/webclaw/blob/main/README.md).

The MCP surface is wired through `npx create-webclaw`, which generates client configuration blocks for Claude Code, Claude Desktop, Cursor, and Codex CLI. Source: [README.md](https://github.com/0xMassi/webclaw/blob/main/README.md). The REST surface mirrors the CLI: `POST /v1/scrape`, `POST /v1/crawl`, `POST /v1/extract`, `POST /v1/diff`, etc., all returning JSON envelopes suitable for programmatic use. Source: [crates/webclaw-server/src/routes/extract.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/extract.rs) and [crates/webclaw-server/src/routes/crawl.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/crawl.rs).

## Operational Notes

Prebuilt Linux release binaries (`webclaw`, `webclaw-mcp`, `webclaw-server`) require glibc 2.38+, which prevents them from starting on Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9. Community issue #73 documents the failure mode. Self-hosters on those distributions should build from source or run via Docker until a glibc-compatible release is published. The hosted API at `api.webclaw.io` and the official SDKs (`@webclaw/sdk` for npm, `webclaw` for PyPI, `github.com/0xMassi/webclaw-go`) remain unaffected because they target a fixed runtime. Source: [README.md](https://github.com/0xMassi/webclaw/blob/main/README.md).

## See Also

- CLI Usage and Examples
- MCP Server Integration
- REST API Reference
- LLM Provider Chain Configuration
- Proxy-Backed Crawling Guide

---

<a id='page-extraction'></a>

## Extraction Engine, CLI Usage, and Known CLI Bugs

### Related Pages

Related topics: [Project Overview and Workspace Architecture](#page-overview), [Deployment, Configuration, and Known Failure Modes](#page-deployment)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [crates/webclaw-cli/src/main.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/main.rs)
- [crates/webclaw-cli/src/bench.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/bench.rs)
- [crates/webclaw-server/src/routes/extract.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/extract.rs)
- [crates/webclaw-server/src/routes/crawl.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/crawl.rs)
- [crates/webclaw-server/src/routes/diff.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/diff.rs)
- [crates/webclaw-mcp/src/server.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/server.rs)
- [crates/webclaw-llm/src/lib.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/lib.rs)
- [crates/webclaw-llm/src/extract.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/extract.rs)
- [crates/webclaw-llm/src/summarize.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/summarize.rs)
- [examples/html-to-markdown-rag/README.md](https://github.com/0xMassi/webclaw/blob/main/examples/html-to-markdown-rag/README.md)
- [examples/README.md](https://github.com/0xMassi/webclaw/blob/main/examples/README.md)
- [README.md](https://github.com/0xMassi/webclaw/blob/main/README.md)
- [packages/create-webclaw/README.md](https://github.com/0xMassi/webclaw/blob/main/packages/create-webclaw/README.md)
</details>

# Extraction Engine, CLI Usage, and Known CLI Bugs

## Overview

webclaw is a Rust workspace that turns web pages into clean markdown, JSON, and LLM-ready text. The `webclaw` binary is the primary entry point for one-off extraction, batch URL lists, crawling, diffing, search, vertical scraping, and benchmarking. The extraction pipeline itself lives in shared crates and is reused by the CLI, the HTTP server (`webclaw-server`), and the MCP server (`webclaw-mcp`).

The CLI is implemented with `clap` subcommands inside [crates/webclaw-cli/src/main.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/main.rs), and it dispatches to specialized handlers. The core content pipeline (HTML → markdown, metadata, links, structured data) is exposed through `webclaw-core` and consumed uniformly, so the CLI, the `/v1/extract` route, and MCP `scrape` return the same `ExtractionResult` shape. LLM-powered steps (schema extraction, prompt extraction, summarization) are added by [crates/webclaw-llm/src/lib.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/lib.rs) and use a provider chain that prefers local Ollama, then OpenAI, Gemini, and Anthropic.

## CLI Command Surface

The `webclaw` binary exposes a small set of subcommands. The most relevant for extraction are summarized below.

| Subcommand | Purpose | Key options |
|---|---|---|
| `webclaw <url>` (default) | Single-URL extraction | `--format`, `--only-main-content`, `--include`, `--exclude`, `--proxy` |
| `webclaw --urls-file` | Batch extraction from a file | `--format`, (per-row optional filename) |
| `webclaw crawl` | Recursive crawl from a seed | `--max-depth`, `--max-pages` |
| `webclaw diff` | Compare two snapshots of a URL | previous content as JSON or markdown |
| `webclaw summarize` | LLM summary of a page | provider config from env |
| `webclaw brand` | Colors, fonts, logos, metadata | runs over an existing fetch |
| `webclaw search` | Serper.dev search, optional `--scrape` | `--serper-key`, `--num`, `--country` |
| `webclaw vertical <name> <url>` | Site-specific extractor (Reddit, GitHub, PyPI, etc.) | `--raw` for compact JSON |
| `webclaw bench <url>` | Token-savings micro-benchmark | `--facts <path>`, `--json` |

The vertical subcommand, defined in [crates/webclaw-cli/src/main.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/main.rs), runs site-specific extractors that return typed JSON (title, price, author, rating) rather than generic markdown. Catalog names include `reddit`, `github_repo`, `trustpilot_reviews`, `youtube_video`, `shopify_product`, `pypi`, `npm`, and `arxiv`. The MCP counterpart `vertical_scrape` intentionally uses the Firefox browser fingerprint because Reddit's `.json` endpoint rejects the default Chrome fingerprint with 403s, even from residential IPs (Source: [crates/webclaw-mcp/src/server.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/server.rs)).

## Output Formats and Batch Mode

Output is selected with `--format` (`markdown` is the default). The formatter in [crates/webclaw-cli/src/main.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/main.rs) maps each `OutputFormat` variant to a renderer:

- `Markdown` — frontmatter when `--metadata` is set, body markdown, then a "Structured Data" JSON block if present.
- `Json` — pretty-printed `ExtractionResult`.
- `Text` — `content.plain_text` only.
- `Llm` — `to_llm_text(...)` produces a token-optimized string (≈67% smaller than the source HTML, per the README banner in [packages/create-webclaw/README.md](https://github.com/0xMassi/webclaw/blob/main/packages/create-webclaw/README.md)).
- `Html` — `content.raw_html` when available, falling back to markdown.

For batch input, the `collect_urls` helper accepts both positional URLs and a `--urls-file`. Lines with a comma become `(url, custom_filename)` pairs; bare lines auto-generate filenames from the URL (Source: [crates/webclaw-cli/src/main.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/main.rs)). The per-URL `BatchExtractResult` is rendered by `print_batch_output`, which iterates and prints each result joined by `---`. JSON output emits a single array of `{url, result | error}` objects so partial failures are visible.

The HTTP server exposes the same pipeline at `POST /v1/scrape`, `POST /v1/extract`, `POST /v1/crawl`, and `POST /v1/diff`. The `extract` route requires either a JSON schema or a natural-language prompt and builds its provider chain from environment variables (Source: [crates/webclaw-server/src/routes/extract.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/extract.rs)). LLM extraction uses `json_mode: true` for schema flows and `temperature: 0.0` to keep results deterministic (Source: [crates/webclaw-llm/src/extract.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/extract.rs)). Summarization uses `temperature: 0.3` and strips any thinking tags defensively before returning plain text (Source: [crates/webclaw-llm/src/summarize.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/summarize.rs)).

## Known CLI Bugs and Workarounds

A small number of issues are visible from the public issue tracker and the codebase. They affect the CLI directly, so they are documented here for quick lookup.

- `--only-main-content` is ignored in batch mode. Issue #3 reports that `webclaw --only-main-content --urls-file URLs.txt` does not strip navigation, sidebar, and footer noise on individual rows. The flag is read on the single-URL path but is not propagated to the per-URL batch invocation. **Workaround:** run the batch with a small wrapper loop, e.g. `while read u; do webclaw --only-main-content "$u"; done < URLs.txt`, or call the hosted `POST /v1/scrape` endpoint with `"only_main_content": true`.
- Linux release binaries require glibc 2.38+. Issue #73 reports that the prebuilt `webclaw`, `webclaw-mcp`, and `webclaw-server` binaries fail to start on Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9 because their glibc is older than 2.38. **Workaround:** build from source with `cargo install --git https://github.com/0xMassi/webclaw webclaw-cli`, or upgrade the host glibc.
- Update warning when refreshing `webclaw-mcp`. Issue #30 reports a warning shown when updating the MCP binary. **Workaround:** rerun `npx create-webclaw` to reinstall the binary and re-emit tool config; the `create-webclaw` package detects installed AI tools and rewrites MCP configuration (Source: [packages/create-webclaw/README.md](https://github.com/0xMassi/webclaw/blob/main/packages/create-webclaw/README.md)).

Release v0.6.14 (visible in the changelog shared in community context) added docs for `search`, `map`, and `perf`, fixed the deploy step to write `WEBCLAW_API_KEY` into the generated `.env` (#68), and repaired the `create-webclaw` binary download path.

## See Also

- [Firecrawl-Compatible API example](https://github.com/0xMassi/webclaw/blob/main/examples/firecrawl-compatible-api/) — covers the `/v2` scrape, crawl, map, and search routes.
- [MCP Web Scraping example](https://github.com/0xMassi/webclaw/blob/main/examples/mcp-web-scraping/) — connecting the CLI binary into Claude Code, Claude Desktop, Cursor, and Codex.
- [Cloudflare Diagnostics example](https://github.com/0xMassi/webclaw/blob/main/examples/cloudflare-diagnostics/) — checklist for blocked or empty results on protected sites.

---

<a id='page-mcp-server'></a>

## MCP Server, REST API, and LLM Provider Integration

### Related Pages

Related topics: [Project Overview and Workspace Architecture](#page-overview), [Deployment, Configuration, and Known Failure Modes](#page-deployment)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [crates/webclaw-mcp/src/main.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/main.rs)
- [crates/webclaw-mcp/src/server.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/server.rs)
- [crates/webclaw-mcp/src/tools.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/tools.rs)
- [crates/webclaw-server/src/main.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/main.rs)
- [crates/webclaw-server/src/routes/extract.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/extract.rs)
- [crates/webclaw-llm/src/lib.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/lib.rs)
- [crates/webclaw-llm/src/extract.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/extract.rs)
- [crates/webclaw-llm/src/summarize.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/summarize.rs)
- [packages/create-webclaw/package.json](https://github.com/0xMassi/webclaw/blob/main/packages/create-webclaw/package.json)
</details>

# MCP Server, REST API, and LLM Provider Integration

Webclaw exposes its extraction pipeline through three coordinated surfaces: a local **MCP server** (`webclaw-mcp`) for AI agents, a self-hosted **REST API** (`webclaw-server`) for programmatic access, and a shared **LLM provider chain** (`webclaw-llm`) that powers schema extraction, prompt extraction, and summarization across both. Together they let the same core capabilities — scrape, crawl, extract, summarize — be driven from an IDE assistant, a curl script, or a hosted integration.

## High-Level Architecture

```mermaid
flowchart LR
    Agent["AI Agent<br/>(Claude Code, Cursor, Codex)"] -->|stdio MCP| MCP[webclaw-mcp]
    Curl["cURL / SDK<br/>(TypeScript, Python, Go)"] -->|HTTPS JSON| API[webclaw-server]
    MCP --> Core[webclaw-core<br/>fetch + extract]
    API --> Core
    Core --> LLM[webclaw-llm<br/>ProviderChain]
    LLM --> Ollama[(Ollama<br/>local)]
    LLM --> OpenAI[(OpenAI)]
    LLM --> Gemini[(Gemini)]
    LLM --> Anthropic[(Anthropic)]
```

The MCP binary and the REST binary are thin wrappers: both delegate to the same `webclaw-core` pipeline, and both can fall back to the same `ProviderChain` when an LLM is needed. The hosted platform at `api.webclaw.io` is a closed-source extension of `webclaw-server` that adds anti-bot bypass, JS rendering, and async crawl jobs (`Source: [crates/webclaw-server/src/main.rs:1-20]()`).

## MCP Server (`webclaw-mcp`)

### Transport and lifecycle

The MCP server is a stateless stdio service. Its entry point initializes logging to stderr so that stdout remains a clean MCP transport channel, then serves until the client disconnects (`Source: [crates/webclaw-mcp/src/main.rs:1-22]()`):

```rust
let service = WebclawMcp::new().await.serve(stdio()).await?;
service.waiting().await?;
```

`WEBCLAW_API_KEY` is optional at the MCP layer — local extraction works without it, but the key enables cloud fallback for protected sites, JS rendering, hosted search, and hosted research.

### Tool surface and parameter coercion

Each MCP tool is declared with `#[tool]` on the `WebclawMcp` service. Parameters are typed structs that derive `JsonSchema` (for automatic schema generation) and `Deserialize` (for parsing tool-call arguments). Because some MCP clients serialize numbers and booleans as JSON strings, `tools.rs` ships coercion helpers — `deser_opt_u32_or_str` and `deser_opt_bool_or_str` — that accept either form transparently (`Source: [crates/webclaw-mcp/src/tools.rs:1-60]()`).

The exposed tool family mirrors the CLI subcommands: `scrape`, `crawl`, `map`, `batch`, `extract`, `summarize`, `diff`, `brand`, `research`, `search`, and `vertical_scrape`. The vertical tool, for example, runs a typed extractor by name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`) and returns site-specific JSON rather than generic markdown (`Source: [crates/webclaw-mcp/src/server.rs:1-30]()`).

### TLS-profile selection

A subtle but important detail lives in `vertical_scrape`: the tool intentionally uses a cached Firefox TLS profile rather than the Chrome profile used by the generic `scrape` tool. Reddit's `.json` endpoint rejects the wreq-Chrome fingerprint with a 403 even from residential IPs, while the Firefox fingerprint still passes — Firefox is safe for every other vertical, so it is a strictly safer default for site-specific extractors (`Source: [crates/webclaw-mcp/src/server.rs:15-35]()`).

### Installer

The `create-webclaw` npm package (v0.1.5) detects supported MCP clients — Claude, Cursor, Windsurf, OpenCode, Codex, Antigravity — and writes the appropriate config (`Source: [packages/create-webclaw/package.json:1-25]()`).

## REST API (`webclaw-server`)

### Scope and posture

`webclaw-server` is deliberately small: a single binary, stateless, with no database and no job queue. The `long_about` string on the CLI makes the contract explicit — hosted features (anti-bot bypass, JS rendering, async crawl jobs, multi-tenant auth, billing) are *not* implemented here and never will be; they live in the closed-source platform (`Source: [crates/webclaw-server/src/main.rs:1-15]()`).

The server is built on `axum`, uses `tower-http` for CORS and tracing, and wraps the same extraction crates the CLI and MCP server use. JSON shapes mirror the hosted API at `api.webclaw.io` so the same SDKs work against either deployment.

### Structured extraction route

`POST /v1/extract` accepts a URL plus either a JSON schema or a natural-language prompt (at least one is required) and an optional model override. It validates the input, builds a per-request provider chain from environment variables, and dispatches to either `extract_json` or `extract_with_prompt` in `webclaw-llm` (`Source: [crates/webclaw-server/src/routes/extract.rs:1-55]()`):

```rust
if !has_schema && !has_prompt {
    return Err(ApiError::bad_request("either `schema` or `prompt` is required"));
}
```

The route delegates provider selection entirely to the shared chain, which means self-hosters get the same Ollama → OpenAI → Gemini → Anthropic fallback as the CLI without any per-route configuration.

## LLM Provider Integration (`webclaw-llm`)

### Provider chain

`webclaw-llm` is described in its crate doc as a "local-first hybrid architecture" where the provider chain tries Ollama first, then OpenAI, then Gemini, then Anthropic (`Source: [crates/webclaw-llm/src/lib.rs:1-12]()`). The crate exposes a single abstraction — `LlmProvider` with an `async fn complete` — so adding a new provider means implementing one trait, not rewriting call sites.

### Schema-based extraction

`extract_json` builds a system prompt that embeds the user's JSON schema verbatim, sends the page content as the user turn, sets `temperature: 0.0` and `json_mode: true`, and parses the response back into a `serde_json::Value`. The instruction to the model is explicit: "Return ONLY valid JSON matching the schema. No explanations, no markdown, no commentary." (`Source: [crates/webclaw-llm/src/extract.rs:1-35]()`).

### Prompt-based extraction

`extract_with_prompt` is the flexible sibling: it accepts a natural-language description of what to extract instead of a schema. The same `strip_thinking_tags` utility is applied to the response, so models that emit `<think>…` blocks have those removed before parsing (`Source: [crates/webclaw-llm/src/lib.rs:1-12]()`).

### Summarization

`summarize` is intentionally minimal — one function, one prompt. It defaults to three sentences, uses `temperature: 0.3`, and applies `strip_thinking_tags` as defense-in-depth even though providers already strip them, because summary text is passed directly to end users (`Source: [crates/webclaw-llm/src/summarize.rs:1-45]()`).

## Failure Modes and Known Issues

- **Linux glibc requirement.** Prebuilt binaries (`webclaw`, `webclaw-mcp`, `webclaw-server`) require glibc 2.38+, so they fail to start on Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9. Self-hosters on those distros should build from source or use an older release.
- **MCP parameter coercion.** Some MCP clients serialize numbers and booleans as JSON strings. The coercion helpers in `tools.rs` cover both representations, so callers should never see `"invalid type: string"` errors.
- **Update warnings.** The `webclaw-mcp` updater may emit a warning when newer binaries are released; reinstalling via `npx create-webclaw` resolves the configuration mismatch.
- **Cloud-only features.** Anti-bot bypass, JS rendering, async crawl jobs, multi-tenant auth, and billing are not part of `webclaw-server`. Self-hosters who need them must point their `WEBCLAW_API_KEY` at `api.webclaw.io` so requests can escalate (`Source: [crates/webclaw-server/src/main.rs:5-15]()`).

## See Also

- CLI subcommands and `webclaw bench` micro-benchmark
- Firecrawl-Compatible `/v2` API routes
- Proxy-backed crawling with ColdProxy
- Cloudflare diagnostics for blocked or empty protected-site results

---

<a id='page-deployment'></a>

## Deployment, Configuration, and Known Failure Modes

### Related Pages

Related topics: [Project Overview and Workspace Architecture](#page-overview), [Extraction Engine, CLI Usage, and Known CLI Bugs](#page-extraction)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/0xMassi/webclaw/blob/main/README.md)
- [examples/README.md](https://github.com/0xMassi/webclaw/blob/main/examples/README.md)
- [examples/html-to-markdown-rag/README.md](https://github.com/0xMassi/webclaw/blob/main/examples/html-to-markdown-rag/README.md)
- [crates/webclaw-cli/src/main.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/main.rs)
- [crates/webclaw-mcp/src/server.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/server.rs)
- [crates/webclaw-mcp/src/tools.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-mcp/src/tools.rs)
- [crates/webclaw-llm/src/lib.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-llm/src/lib.rs)
- [crates/webclaw-server/src/routes/extract.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/extract.rs)
- [crates/webclaw-server/src/routes/diff.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/diff.rs)
- [crates/webclaw-server/src/routes/crawl.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-server/src/routes/crawl.rs)
- [crates/webclaw-cli/src/bench.rs](https://github.com/0xMassi/webclaw/blob/main/crates/webclaw-cli/src/bench.rs)
</details>

# Deployment, Configuration, and Known Failure Modes

## 1. Scope and Purpose

webclaw is a multi-surface extraction tool that ships as a CLI binary, an MCP server, a REST API, and SDKs for TypeScript, Python, and Go. Each surface shares the same `webclaw-core` extraction pipeline and the same `webclaw-fetch` client, which means deployment, configuration, and failure handling overlap heavily across the surfaces. This page documents how the components are deployed, how they are configured at runtime, and the failure modes that have been reported by users in the project's issue tracker.

The repository's high-level positioning is summarized in the README as: "Turn websites into clean markdown, JSON, and LLM-ready context" via "CLI, MCP server, REST API, and SDKs for AI agents and RAG pipelines." Source: [README.md]()

## 2. Deployment Surfaces

### 2.1 CLI

The CLI is the simplest deployment form. A single binary is invoked against one or many URLs:

```bash
webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
```

Vertical extractors and web search are exposed as subcommands (`webclaw vertical reddit <url>`, `webclaw search "rust async runtime" --scrape`) defined in the clap subcommand enum. Source: [crates/webclaw-cli/src/main.rs]()

A per-URL micro-benchmark is provided via `webclaw bench <url>`, which measures token reduction between raw HTML and the LLM-format output. Source: [crates/webclaw-cli/src/bench.rs]()

### 2.2 MCP Server

The MCP server is scaffolded for AI agents through `npx create-webclaw`, which writes the client configuration block. The server registers tools for `scrape`, `crawl`, `batch_scrape`, `map`, `extract`, `summarize`, `search`, `research`, and `vertical_scrape`. Source: [crates/webclaw-mcp/src/tools.rs]()

`vertical_scrape` deliberately uses the cached Firefox TLS fingerprint because Reddit's `.json` endpoint rejects the Chrome fingerprint with a 403, even from residential IPs. Source: [crates/webclaw-mcp/src/server.rs]()

### 2.3 Hosted REST API and Self-Hosted Server

Two REST surfaces exist:

- A hosted `https://api.webclaw.io/v1/*` service used by SDKs and the hosted MCP routes.
- A self-hosted `webclaw-server` binary that mirrors the hosted API. Its route handlers validate `url` is non-empty and, for `extract`, that at least one of `schema` or `prompt` is supplied. Source: [crates/webclaw-server/src/routes/extract.rs]()

The crawl route returns a JSON document with `status`, `total`, `completed`, `errors`, `elapsed_secs`, and a `pages` array that carries per-page `markdown`, `metadata`, and `error` fields. Source: [crates/webclaw-server/src/routes/crawl.rs]()

The diff route requires a non-empty `url` and accepts a `previous` extraction in either `Extraction` or `Minimal { markdown, metadata }` form; missing markdown/metadata are filled with empty defaults. Source: [crates/webclaw-server/src/routes/diff.rs]()

## 3. Configuration

### 3.1 Provider Chain

The LLM subsystem is local-first and provider-chained. The chain tries Ollama (local), then OpenAI, then Gemini, then Anthropic, exposing this as a single `ProviderChain` API. Source: [crates/webclaw-llm/src/lib.rs]()

`POST /v1/extract` builds the chain per request from environment variables, and a `model` field can override the model name on the chosen provider. Source: [crates/webclaw-server/src/routes/extract.rs]()

### 3.2 Environment Variables

The codebase documents two important environment variables in user-facing examples:

- `WEBCLAW_API_KEY` — Bearer token for the hosted API. The `fix(deploy): write WEBCLAW_API_KEY in generated .env` change in v0.6.14 ensures the `create-webclaw` scaffolder writes this key into the generated `.env` file. Source: [README.md]()
- `SERPER_API_KEY` — Required for `webclaw search`. The CLI subcommand also accepts `--serper-key` and falls back to this env var. Source: [crates/webclaw-cli/src/main.rs]()

### 3.3 CLI Flags

Common flags include `--format` (markdown, json, text, llm, html), `--only-main-content`, `--include "<css selectors>"`, `--exclude "<css selectors>"`, `--crawl --depth N --max-pages M`, and `--diff-with <file>`. The examples index documents each of these with copy-pasteable invocations. Source: [examples/README.md]()

```mermaid
flowchart LR
  A[Caller] --> B{Which surface?}
  B -- "shell" --> C[webclaw CLI binary]
  B -- "agent" --> D[webclaw-mcp / create-webclaw]
  B -- "HTTP" --> E[webclaw-server / api.webclaw.io]
  C --> F[webclaw-core extract pipeline]
  D --> F
  E --> F
  F --> G[webclaw-fetch client]
  F --> H[webclaw-llm provider chain]
  H -- Ollama --> H1[local]
  H -- OpenAI --> H2[remote]
  H -- Gemini --> H3[remote]
  H -- Anthropic --> H4[remote]
```

## 4. Known Failure Modes

### 4.1 Prebuilt Linux Binaries and glibc

Issue #73 reports that the prebuilt `webclaw`, `webclaw-mcp`, and `webclaw-server` release binaries require glibc 2.38+. They fail to start on common server distributions such as Debian 12, Ubuntu 22.04 LTS, Amazon Linux 2023, and RHEL/Rocky 9 because those ship with older glibc. Self-hosters on these distributions must either build from source against the system glibc, run the Docker image, or use the SDKs to call the hosted API instead of running the binary in place.

### 4.2 `--only-main-content` Ignored in Batch Mode

Issue #3 reports that `--only-main-content` is not applied when combined with `--urls-file`. Both ordering variants — `webclaw --only-main-content --urls-file URLs.txt` and `webclaw --urls-file URLs.txt --only-main-content` — produce full-page output rather than main-content-only output. The hosted REST API behaves differently: passing `"only_main_content": true` inside the JSON body of `POST /v1/scrape` works as expected, so batch users who need main-content filtering today should script the API directly. Source: [examples/html-to-markdown-rag/README.md]()

### 4.3 MCP Update Warning

Issue #30 reports a warning when updating `webclaw-mcp`. The v0.6.14 release included `fix(create-webclaw): repair binary...` which addresses the scaffolder. Operators should rerun `npx create-webclaw` after upgrading rather than copying the old `mcpServers` block forward.

### 4.4 Empty Result Diagnostics

The fetch layer classifies empty results into categories such as `ConsentWall`. The unit tests in the CLI crate verify that titles like "Before you continue" and redirect URLs such as `https://guce.advertising.com/collectIdentifiers?sessionId=...` are flagged, while real articles containing the phrase "Cookie consent patterns explained" are not falsely flagged. The `examples/cloudflare-diagnostics/` workflow provides a reproducible checklist for blocked or empty protected-site results. Source: [crates/webclaw-cli/src/main.rs](), [examples/README.md]()

### 4.5 Diff Route Input Shape

The diff route accepts a `previous` payload in two shapes: a full `Extraction` object or a `Minimal { markdown, metadata }` object. Servers that store only the markdown of a page must still supply a `metadata` field (which may be `null` and will be replaced by an empty default). Requests with an empty `url` are rejected with a 400 "url is required" error. Source: [crates/webclaw-server/src/routes/diff.rs]()

## See Also

- [README.md](https://github.com/0xMassi/webclaw/blob/main/README.md) — Project overview, SDKs, and install paths.
- [examples/README.md](https://github.com/0xMassi/webclaw/blob/main/examples/README.md) — Workflow guides and CLI snippets.
- [examples/html-to-markdown-rag/README.md](https://github.com/0xMassi/webclaw/blob/main/examples/html-to-markdown-rag/README.md) — RAG-oriented extraction including the hosted API usage that works around issue #3.
- [examples/cloudflare-diagnostics/](https://github.com/0xMassi/webclaw/tree/main/examples/cloudflare-diagnostics) — Checklist for blocked/empty protected-site results.

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: 0xMassi/webclaw

Summary: Found 11 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/73

## 2. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/15

## 3. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.host_targets | https://github.com/0xMassi/webclaw

## 4. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/62

## 5. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/0xMassi/webclaw

## 6. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/0xMassi/webclaw

## 7. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/0xMassi/webclaw

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/0xMassi/webclaw

## 9. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/71

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/0xMassi/webclaw

## 11. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/0xMassi/webclaw

<!-- canonical_name: 0xMassi/webclaw; human_manual_source: deepwiki_human_wiki -->