# webclaw - Doramagic AI Context Pack

> 定位：安装前体验与判断资产。它帮助宿主 AI 有一个好的开始，但不代表已经安装、执行或验证目标项目。

## 充分原则

- **充分原则，不是压缩原则**：AI Context Pack 应该充分到让宿主 AI 在开工前理解项目价值、能力边界、使用入口、风险和证据来源；它可以分层组织，但不以最短摘要为目标。
- **压缩策略**：只压缩噪声和重复内容，不压缩会影响判断和开工质量的上下文。

## 给宿主 AI 的使用方式

你正在读取 Doramagic 为 webclaw 编译的 AI Context Pack。请把它当作开工前上下文：帮助用户理解适合谁、能做什么、如何开始、哪些必须安装后验证、风险在哪里。不要声称你已经安装、运行或执行了目标项目。

## Claim 消费规则

- **事实来源**：Repo Evidence + Claim/Evidence Graph；Human Wiki 只提供显著性、术语和叙事结构。
- **事实最低状态**：`supported`
- `supported`：可以作为项目事实使用，但回答中必须引用 claim_id 和证据路径。
- `weak`：只能作为低置信度线索，必须要求用户继续核实。
- `inferred`：只能用于风险提示或待确认问题，不能包装成项目事实。
- `unverified`：不得作为事实使用，应明确说证据不足。
- `contradicted`：必须展示冲突来源，不得替用户强行选择一个版本。

## 它最适合谁

- **正在使用 Claude/Codex/Cursor/Gemini 等宿主 AI 的开发者**：README 或插件配置提到多个宿主 AI。 证据：`README.md` Claim：`clm_0003` supported 0.86
- **希望把专业流程带进宿主 AI 的用户**：仓库包含 Skill 文档。 证据：`skill/SKILL.md` Claim：`clm_0004` supported 0.86

## 它能做什么

- **AI Skill / Agent 指令资产库**（可做安装前预览）：项目包含可被宿主 AI 读取的 Skill 或 Agent 指令文件，可用于把专业流程带入 Claude、Codex、Cursor 等宿主。 证据：`skill/SKILL.md`, `SKILL.md` Claim：`clm_0001` supported 0.86
- **命令行启动或安装流程**（需要安装后验证）：项目文档中存在可执行命令，真实使用需要在本地或宿主环境中运行这些命令。 证据：`README.md` Claim：`clm_0002` supported 0.86

## 怎么开始

- `npx create-webclaw` 证据：`README.md` Claim：`clm_0005` supported 0.86
- `npm install @webclaw/sdk` 证据：`README.md` Claim：`clm_0006` supported 0.86
- `pip install webclaw` 证据：`README.md` Claim：`clm_0007` supported 0.86
- `curl -X POST https://api.webclaw.io/v1/scrape \` 证据：`README.md` Claim：`clm_0008` supported 0.86

## 继续前判断卡

- **当前建议**：需要管理员/安全审批
- **为什么**：继续前可能涉及密钥、账号、外部服务或敏感上下文，建议先经过管理员或安全审批。

### 30 秒判断

- **现在怎么做**：需要管理员/安全审批
- **最小安全下一步**：先跑 Prompt Preview；若涉及凭证或企业环境，先审批再试装
- **先别相信**：研究结论、引用和实验结果不能在安装前相信。
- **继续会触碰**：研究判断、命令执行、宿主 AI 配置

### 现在可以相信

- **适合人群线索：正在使用 Claude/Codex/Cursor/Gemini 等宿主 AI 的开发者**（supported）：有 supported claim 或项目证据支撑，但仍不等于真实安装效果。 证据：`README.md` Claim：`clm_0003` supported 0.86
- **适合人群线索：希望把专业流程带进宿主 AI 的用户**（supported）：有 supported claim 或项目证据支撑，但仍不等于真实安装效果。 证据：`skill/SKILL.md` Claim：`clm_0004` supported 0.86
- **能力存在：AI Skill / Agent 指令资产库**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`skill/SKILL.md`, `SKILL.md` Claim：`clm_0001` supported 0.86
- **能力存在：命令行启动或安装流程**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`README.md` Claim：`clm_0002` supported 0.86
- **存在 Quick Start / 安装命令线索**（supported）：可以相信项目文档出现过启动或安装入口；不要因此直接在主力环境运行。 证据：`README.md` Claim：`clm_0005` supported 0.86

### 现在还不能相信

- **研究结论、引用和实验结果不能在安装前相信。**（unverified）：研究 Skill 可以组织问题和路径，但不能替代真实资料检索、论文核验和实验复现。
- **是否适合你的具体研究领域不能直接相信。**（unverified）：Skill 覆盖很多研究主题，不代表对你的领域、资料要求和可信度标准足够。
- **真实输出质量不能在安装前相信。**（unverified）：Prompt Preview 只能展示引导方式，不能证明真实项目中的结果质量。
- **宿主 AI 版本兼容性不能在安装前相信。**（unverified）：Claude、Cursor、Codex、Gemini 等宿主加载规则和版本差异必须在真实环境验证。
- **不会污染现有宿主 AI 行为，不能直接相信。**（inferred）：Skill、plugin、AGENTS/CLAUDE/GEMINI 指令可能改变宿主 AI 的默认行为。 证据：`CLAUDE.md`, `SKILL.md`, `skill/SKILL.md`
- **可安全回滚不能默认相信。**（unverified）：除非项目明确提供卸载和恢复说明，否则必须先在隔离环境验证。
- **真实安装后是否与用户当前宿主 AI 版本兼容？**（unverified）：兼容性只能通过实际宿主环境验证。
- **项目输出质量是否满足用户具体任务？**（unverified）：安装前预览只能展示流程和边界，不能替代真实评测。

### 继续会触碰什么

- **研究判断**：问题拆解、资料路径、实验路径、结论结构和可信度判断。 原因：研究型 Skill 可能让输出看起来更专业，但不能替代真实证据核验。
- **命令执行**：包管理器、网络下载、本地插件目录、项目配置或用户主目录。 原因：运行第一条命令就可能产生环境改动；必须先判断是否值得跑。 证据：`README.md`
- **宿主 AI 配置**：Claude/Codex/Cursor/Gemini/OpenCode 等宿主的 plugin、Skill 或规则加载配置。 原因：宿主配置会改变 AI 后续工作方式，可能和用户已有规则冲突。 证据：`CLAUDE.md`, `SKILL.md`, `skill/SKILL.md`
- **本地环境或项目文件**：安装结果、插件缓存、项目配置或本地依赖目录。 原因：安装前无法证明写入范围和回滚方式，需要隔离验证。 证据：`README.md`
- **环境变量 / API Key**：项目入口文档明确出现 API key、token、secret 或账号凭证配置。 原因：如果真实安装需要凭证，应先使用测试凭证并经过权限/合规判断。 证据：`CHANGELOG.md`, `CLAUDE.md`, `README.md`, `SKILL.md` 等
- **宿主 AI 上下文**：AI Context Pack、Prompt Preview、Skill 路由、风险规则和项目事实。 原因：导入上下文会影响宿主 AI 后续判断，必须避免把未验证项包装成事实。

### 最小安全下一步

- **先跑 Prompt Preview**：先验证它能否正确界定研究问题和证据边界，不要先相信研究输出。（适用：任何项目都适用，尤其是输出质量未知时。）
- **只在隔离目录或测试账号试装**：避免安装命令污染主力宿主 AI、真实项目或用户主目录。（适用：存在命令执行、插件配置或本地写入线索时。）
- **先备份宿主 AI 配置**：Skill、plugin、规则文件可能改变 Claude/Cursor/Codex 的默认行为。（适用：存在插件 manifest、Skill 或宿主规则入口时。）
- **不要使用真实生产凭证**：环境变量/API key 一旦进入宿主或工具链，可能产生账号和合规风险。（适用：出现 API、TOKEN、KEY、SECRET 等环境线索时。）
- **安装后只验证一个最小任务**：先验证加载、兼容、输出质量和回滚，再决定是否深用。（适用：准备从试用进入真实工作流时。）

### 退出方式

- **保留安装前状态**：记录原始宿主配置和项目状态，后续才能判断是否可恢复。
- **准备移除宿主 plugin / Skill / 规则入口**：如果试装后行为异常，可以把宿主 AI 恢复到试装前状态。
- **保留资料和结论核验清单**：如果后续发现引用或实验路径不可靠，可以回到证据边界阶段重新校验。
- **记录安装命令和写入路径**：没有明确卸载说明时，至少要知道哪些目录或配置需要手动清理。
- **准备撤销测试 API key 或 token**：测试凭证泄露或误用时，可以快速止损。
- **如果没有回滚路径，不进入主力环境**：不可回滚是继续前阻断项，不应靠信任或运气继续。

## 哪些只能预览

- 解释项目适合谁和能做什么
- 基于项目文档演示典型对话流程
- 帮助用户判断是否值得安装或继续研究

## 哪些必须安装后验证

- 真实安装 Skill、插件或 CLI
- 执行脚本、修改本地文件或访问外部服务
- 验证真实输出质量、性能和兼容性

## 边界与风险判断卡

- **把安装前预览误认为真实运行**：用户可能高估项目已经完成的配置、权限和兼容性验证。 处理方式：明确区分 prompt_preview_can_do 与 runtime_required。 Claim：`clm_0009` inferred 0.45
- **命令执行会修改本地环境**：安装命令可能写入用户主目录、宿主插件目录或项目配置。 处理方式：先在隔离环境或测试账号中运行。 证据：`README.md` Claim：`clm_0010` supported 0.86
- **待确认**：真实安装后是否与用户当前宿主 AI 版本兼容？。原因：兼容性只能通过实际宿主环境验证。
- **待确认**：项目输出质量是否满足用户具体任务？。原因：安装前预览只能展示流程和边界，不能替代真实评测。
- **待确认**：安装命令是否需要网络、权限或全局写入？。原因：这影响企业环境和个人环境的安装风险。

## 开工前工作上下文

### 加载顺序

- 先读取 how_to_use.host_ai_instruction，建立安装前判断资产的边界。
- 读取 claim_graph_summary，确认事实来自 Claim/Evidence Graph，而不是 Human Wiki 叙事。
- 再读取 intended_users、capabilities 和 quick_start_candidates，判断用户是否匹配。
- 需要执行具体任务时，优先查 role_skill_index，再查 evidence_index。
- 遇到真实安装、文件修改、网络访问、性能或兼容性问题时，转入 risk_card 和 boundaries.runtime_required。

### 任务路由

- **AI Skill / Agent 指令资产库**：先基于 role_skill_index / evidence_index 帮用户挑选可用角色、Skill 或工作流。 边界：可做安装前 Prompt 体验。 证据：`skill/SKILL.md`, `SKILL.md` Claim：`clm_0001` supported 0.86
- **命令行启动或安装流程**：先说明这是安装后验证能力，再给出安装前检查清单。 边界：必须真实安装或运行后验证。 证据：`README.md` Claim：`clm_0002` supported 0.86

### 上下文规模

- 文件总数：158
- 重要文件覆盖：40/158
- 证据索引条目：80
- 角色 / Skill 条目：1

### 证据不足时的处理

- **missing_evidence**：说明证据不足，要求用户提供目标文件、README 段落或安装后验证记录；不要补全事实。
- **out_of_scope_request**：说明该任务超出当前 AI Context Pack 证据范围，并建议用户先查看 Human Manual 或真实安装后验证。
- **runtime_request**：给出安装前检查清单和命令来源，但不要替用户执行命令或声称已执行。
- **source_conflict**：同时展示冲突来源，标记为待核实，不要强行选择一个版本。

## Prompt Recipes

### 适配判断

- 目标：判断这个项目是否适合用户当前任务。
- 预期输出：适配结论、关键理由、证据引用、安装前可预览内容、必须安装后验证内容、下一步建议。

```text
请基于 webclaw 的 AI Context Pack，先问我 3 个必要问题，然后判断它是否适合我的任务。回答必须包含：适合谁、能做什么、不能做什么、是否值得安装、证据来自哪里。所有项目事实必须引用 evidence_refs、source_paths 或 claim_id。
```

### 安装前体验

- 目标：让用户在安装前感受核心工作流，同时避免把预览包装成真实能力或营销承诺。
- 预期输出：一段带边界标签的体验剧本、安装后验证清单和谨慎建议；不含真实运行承诺或强营销表述。

```text
请把 webclaw 当作安装前体验资产，而不是已安装工具或真实运行环境。

请严格输出四段：
1. 先问我 3 个必要问题。
2. 给出一段“体验剧本”：用 [安装前可预览]、[必须安装后验证]、[证据不足] 三种标签展示它可能如何引导工作流。
3. 给出安装后验证清单：列出哪些能力只有真实安装、真实宿主加载、真实项目运行后才能确认。
4. 给出谨慎建议：只能说“值得继续研究/试装”“先补充信息后再判断”或“不建议继续”，不得替项目背书。

硬性边界：
- 不要声称已经安装、运行、执行测试、修改文件或产生真实结果。
- 不要写“自动适配”“确保通过”“完美适配”“强烈建议安装”等承诺性表达。
- 如果描述安装后的工作方式，必须使用“如果安装成功且宿主正确加载 Skill，它可能会……”这种条件句。
- 体验剧本只能写成“示例台词/假设流程”：使用“可能会询问/可能会建议/可能会展示”，不要写“已写入、已生成、已通过、正在运行、正在生成”。
- Prompt Preview 不负责给安装命令；如用户准备试装，只能提示先阅读 Quick Start 和 Risk Card，并在隔离环境验证。
- 所有项目事实必须来自 supported claim、evidence_refs 或 source_paths；inferred/unverified 只能作风险或待确认项。

```

### 角色 / Skill 选择

- 目标：从项目里的角色或 Skill 中挑选最匹配的资产。
- 预期输出：候选角色或 Skill 列表，每项包含适用场景、证据路径、风险边界和是否需要安装后验证。

```text
请读取 role_skill_index，根据我的目标任务推荐 3-5 个最相关的角色或 Skill。每个推荐都要说明适用场景、可能输出、风险边界和 evidence_refs。
```

### 风险预检

- 目标：安装或引入前识别环境、权限、规则冲突和质量风险。
- 预期输出：环境、权限、依赖、许可、宿主冲突、质量风险和未知项的检查清单。

```text
请基于 risk_card、boundaries 和 quick_start_candidates，给我一份安装前风险预检清单。不要替我执行命令，只说明我应该检查什么、为什么检查、失败会有什么影响。
```

### 宿主 AI 开工指令

- 目标：把项目上下文转成一次对话开始前的宿主 AI 指令。
- 预期输出：一段边界明确、证据引用明确、适合复制给宿主 AI 的开工前指令。

```text
请基于 webclaw 的 AI Context Pack，生成一段我可以粘贴给宿主 AI 的开工前指令。这段指令必须遵守 not_runtime=true，不能声称项目已经安装、运行或产生真实结果。
```

## 角色 / Skill 索引

- 共索引 1 个角色 / Skill / 项目文档条目。

- **webclaw**（skill）：Web extraction engine with antibot bypass. Scrape, crawl, extract, summarize, search, map, diff, monitor, research, and analyze any URL — including Cloudflare-protected sites. Use when you need reliable web content, the built-in web fetch fails, or you need structured data extraction from web pages. 激活提示：当用户任务与“webclaw”描述的流程高度相关时，先用它做安装前体验，再决定是否安装。 证据：`skill/SKILL.md`

## 证据索引

- 共索引 80 条证据。

- **Webclaw**（documentation）：Rust workspace: CLI + MCP server for web content extraction into LLM-optimized formats. 证据：`CLAUDE.md`
- **Example Domain**（documentation）：Turn websites into clean markdown, JSON, and LLM-ready context. CLI, MCP server, REST API, and SDKs for AI agents and RAG pipelines. 证据：`README.md`
- **Benchmarks**（documentation）：Reproducible benchmarks comparing webclaw against open-source and commercial web extraction tools. Every number here ships with the script that produced it. Run ./run.sh to regenerate. 证据：`benchmarks/README.md`
- **Cloudflare Diagnostics**（documentation）：Use this checklist when a page works in the browser but fails from a scraper, returns a challenge page, or produces empty extracted content. 证据：`examples/cloudflare-diagnostics/README.md`
- **Firecrawl-Compatible API**（documentation）：webclaw exposes Firecrawl-compatible v2 routes for teams migrating existing scrape, crawl, map, or search calls. 证据：`examples/firecrawl-compatible-api/README.md`
- **HTML to Markdown for RAG**（documentation）：Turn web pages into clean markdown or compact LLM text before chunking, embedding, or passing the page to an agent. 证据：`examples/html-to-markdown-rag/README.md`
- **MCP Web Scraping**（documentation）：Use webclaw as a local MCP server so Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, or another MCP client can fetch clean web context. 证据：`examples/mcp-web-scraping/README.md`
- **Quick Start**（documentation）：One command to give your AI agent reliable web access. No headless browser. No Puppeteer. No 403s. 证据：`packages/create-webclaw/README.md`
- **Contributing to Webclaw**（documentation）：Thanks for your interest in contributing. This document covers the essentials. 证据：`CONTRIBUTING.md`
- **webclaw**（skill_instruction）：High-quality web extraction with automatic antibot bypass. Beats Firecrawl on extraction quality and handles Cloudflare, DataDome, and JS-rendered pages automatically. 证据：`SKILL.md`
- **Package**（package_manifest）：{ "name": "create-webclaw", "version": "0.1.5", "mcpName": "io.github.0xMassi/webclaw", "description": "Set up webclaw MCP server for AI agents Claude, Cursor, Windsurf, OpenCode, Codex, Antigravity ", "bin": { "create-webclaw": "./index.mjs" }, "type": "module", "keywords": "webclaw", "mcp", "mcp-server", "ai", "ai-agent", "scraping", "web-scraping", "scraper", "crawler", "extract", "markdown", "llm", "claude", "cursor", "windsurf", "opencode", "codex", "antigravity", "tls-fingerprint", "cloudflare-bypass" , "author": "webclaw", "license": "AGPL-3.0", "repository": { "type": "git", "url": "https://github.com/0xMassi/webclaw" }, "homepage": "https://webclaw.io", "engines": { "node": " =18"… 证据：`packages/create-webclaw/package.json`
- **webclaw**（skill_instruction）：High-quality web extraction with automatic antibot bypass. Beats Firecrawl on extraction quality and handles Cloudflare, DataDome, and JS-rendered pages automatically. 证据：`skill/SKILL.md`
- **License**（source_file）：GNU AFFERO GENERAL PUBLIC LICENSE Version 3, 19 November 2007 证据：`LICENSE`
- **Changelog**（documentation）：All notable changes to webclaw are documented here. Format follows Keep a Changelog https://keepachangelog.com/ . 证据：`CHANGELOG.md`
- **Contributor Covenant Code of Conduct**（documentation）：Contributor Covenant Code of Conduct 证据：`CODE_OF_CONDUCT.md`
- **Methodology**（documentation）：1. Token efficiency — tokens of the extractor's output vs tokens of the raw fetched HTML. Lower tokens = cheaper to feed into an LLM. But lower tokens only matters if the content is preserved , so tokens are always reported alongside fidelity. 2. Fidelity — how many hand-curated "visible facts" the extractor preserved. Per site we list 5 strings that any reader would say are meaningfully on the page customer names, headline stats, product names, release information . Matched case-insensitively with word boundaries where the fact is a single alphanumeric token API does not match apiece . 3. Latency — wall-clock time from URL submission to markdown output. Includes fetch + extraction. Network… 证据：`benchmarks/methodology.md`
- **.Mcp**（structured_config）：{ "mcpServers": { "webclaw": { "command": "~/.webclaw/webclaw-mcp" } } } 证据：`.mcp.json`
- **Facts**（structured_config）：{ " comment": "Hand-curated 'visible facts' per site. Inspected from live pages on 2026-04-17. PRs welcome to add sites or adjust facts — keep facts specific customer names, headline stats, product names , not generic words.", "facts": { "https://openai.com": "ChatGPT", "Sora", "API", "Enterprise", "research" , "https://vercel.com": "Next.js", "Hobby", "Pro", "Enterprise", "deploy" , "https://anthropic.com": "Opus", "Claude", "Glasswing", "Perseverance", "NASA" , "https://www.notion.com": "agents", "Forbes", "Figma", "Ramp", "Cursor" , "https://stripe.com": "Hertz", "URBN", "Instacart", "99.999", "1.9" , "https://tavily.com": "search", "extract", "crawl", "research", "developers" , "https:/… 证据：`benchmarks/facts.json`
- **Glama**（structured_config）：{ "$schema": "https://glama.ai/mcp/schemas/server.json", "maintainers": "0xMassi" } 证据：`glama.json`
- **2026 04 17**（structured_config）：{ "timestamp": "2026-04-17 14:28:42", "webclaw version": "0.3.18", "trafilatura version": "2.0.0", "tokenizer": "cl100k base", "runs per site": 3, "site count": 18, "total facts": 90, "aggregates": { "webclaw": { "reduction mean": 92.5, "reduction median": 97.8, "facts preserved": 76, "total facts": 90, "fidelity pct": 84.4, "latency mean": 0.41 }, "trafilatura": { "reduction mean": 97.8, "reduction median": 99.7, "facts preserved": 45, "total facts": 90, "fidelity pct": 50.0, "latency mean": 0.2 }, "firecrawl": { "reduction mean": 92.4, "reduction median": 96.2, "facts preserved": 70, "total facts": 90, "fidelity pct": 77.8, "latency mean": 0.99 } }, "per site": { "url": "https://openai.co… 证据：`benchmarks/results/2026-04-17.json`
- **Server**（structured_config）：{ "$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json", "name": "io.github.0xMassi/webclaw", "title": "webclaw", "description": "Web extraction MCP server. Scrape, crawl, extract, summarize any URL to clean markdown.", "version": "0.1.4", "packages": { "registryType": "npm", "identifier": "create-webclaw", "version": "0.1.4", "transport": { "type": "stdio" } } } 证据：`packages/create-webclaw/server.json`
- **No special build flags needed.**（source_file）：No special build flags needed. wreq handles TLS via BoringSSL internally. 证据：`.cargo/config.toml`
- **Build artifacts**（source_file）：IDE / OS .DS Store .vscode/ .idea/ .swp .swo 证据：`.dockerignore`
- **Scratch / local artifacts previously covered by overbroad .json ,**（source_file）：target/ .DS Store .env .env. proxies.txt .claude/skills/ Scratch / local artifacts previously covered by overbroad .json , which would have also swallowed package.json, components.json, .smithery/ .json if they were ever modified . .local.json local-test-results.json CLI research command dumps JSON output keyed on the query; they're not code and shouldn't live in git. Track deliberately-saved research output under a different name. research- .json 证据：`.gitignore`
- **Cargo**（source_file）：workspace resolver = "2" members = "crates/ " 证据：`Cargo.toml`
- **webclaw — Multi-stage Docker build**（source_file）：webclaw — Multi-stage Docker build Produces 3 binaries: webclaw — CLI single-shot extraction, crawl, MCP-less use webclaw-mcp — MCP server stdio, for AI agents webclaw-server — minimal REST API for self-hosting OSS, stateless NOTE: this is NOT the hosted API at api.webclaw.io — the cloud service adds anti-bot bypass, JS rendering, multi-tenant auth and async jobs that are intentionally not open-source. See docs/self-hosting. 证据：`Dockerfile`
- **Slim runtime image — uses pre-built binaries from the release.**（source_file）：Slim runtime image — uses pre-built binaries from the release. The full Dockerfile multi-stage Rust build is for local development. CI uses this to avoid 60+ min QEMU cross-compilation. 证据：`Dockerfile.ci`
- **Build webclaw if not present**（source_file）：set -euo pipefail cd "$ dirname "$0" " Build webclaw if not present if ! -x "../target/release/webclaw" ; then echo "→ building webclaw..." cd .. && cargo build --release fi missing="" python3 -c "import tiktoken" 2 /dev/null missing+=" tiktoken" python3 -c "import trafilatura" 2 /dev/null missing+=" trafilatura" if -n "${FIRECRAWL API KEY:-}" ; then python3 -c "import firecrawl" 2 /dev/null missing+=" firecrawl-py" fi if -n "$missing" ; then echo "→ installing python deps:$missing" python3 -m pip install --quiet $missing fi python3 scripts/bench.py 证据：`benchmarks/run.sh`
- **One URL per line. Comments and blank lines ignored.**（source_file）：One URL per line. Comments and blank lines ignored. Sites chosen to span: SPA marketing, enterprise SaaS, documentation, long-form content, news, and aggregator pages. 证据：`benchmarks/sites.txt`
- **--- Nginx reverse proxy + SSL ---**（source_file）：set -euo pipefail HETZNER API="https://api.hetzner.cloud/v1" SERVER NAME="webclaw" REPO URL="https://github.com/0xMassi/webclaw.git" RED='\033 0;31m' GREEN='\033 0;32m' YELLOW='\033 1;33m' BLUE='\033 0;34m' CYAN='\033 0;36m' BOLD='\033 1m' DIM='\033 2m' RESET='\033 0m' info { printf "${BLUE} ${RESET} %s\n" "$ "; } success { printf "${GREEN} + ${RESET} %s\n" "$ "; } warn { printf "${YELLOW} ! ${RESET} %s\n" "$ "; } error { printf "${RED} x ${RESET} %s\n" "$ " &2; } fatal { error "$ "; exit 1; } mask secret { local s="$1" if -z "$s" ; then printf ' not set ' elif ${ printf ' ' else printf ' %s' "${s: -4}" fi } prompt { local var name="$1" prompt text="$2" default="${3:-}" if -n "$default" ; t… 证据：`deploy/hetzner.sh`
- **Docker Compose**（source_file）：services: webclaw: build: . ports: - "${WEBCLAW PORT:-3000}:3000" env file: - .env environment: - OLLAMA HOST=http://ollama:11434 depends on: - ollama restart: unless-stopped healthcheck: test: "CMD", "webclaw", "--help" interval: 30s timeout: 5s retries: 3 ollama: image: ollama/ollama:latest volumes: - ollama data:/root/.ollama restart: unless-stopped volumes: ollama data: 证据：`docker-compose.yml`
- **Docker Entrypoint**（source_file）：set -e if "$ " -gt 0 && { "${1 -}" != "$1" \ "${1 http://}" != "$1" \ "${1 https://}" != "$1" ; }; then set -- webclaw "$@" fi exec "$@" 证据：`docker-entrypoint.sh`
- **============================================**（source_file）：============================================ Webclaw Configuration Copy to .env and fill in your values ============================================ 证据：`env.example`
- **Webclaw Proxy List**（source_file）：Webclaw Proxy List Copy this file to proxies.txt and add your proxies. webclaw auto-loads proxies.txt when it exists — no config needed. Format: host:port:user:pass one per line Lines starting with are ignored. Example: 123.45.67.89:8080:username:password proxy2.example.com:3128:user:pass123 证据：`proxies.example.txt`
- **Rustfmt**（source_file）：style edition = "2024" 证据：`rustfmt.toml`
- **Setup**（source_file）：set -euo pipefail RED='\033 0;31m' GREEN='\033 0;32m' YELLOW='\033 1;33m' BLUE='\033 0;34m' CYAN='\033 0;36m' BOLD='\033 1m' DIM='\033 2m' RESET='\033 0m' info { printf "${BLUE} ${RESET} %s\n" "$ "; } success { printf "${GREEN} + ${RESET} %s\n" "$ "; } warn { printf "${YELLOW} ! ${RESET} %s\n" "$ "; } error { printf "${RED} x ${RESET} %s\n" "$ " &2; } prompt { local var name="$1" prompt text="$2" default="${3:-}" if -n "$default" ; then printf "${CYAN} %s${DIM} %s ${RESET}: " "$prompt text" "$default" else printf "${CYAN} %s${RESET}: " "$prompt text" fi read -r input printf -v "$var name" '%s' "${input:-$default}" } prompt secret { local var name="$1" prompt text="$2" default="${3:-}" if -n… 证据：`setup.sh`
- **Smithery**（source_file）：startCommand: type: stdio configSchema: type: object properties: apiKey: type: string description: webclaw API key from webclaw.io. Optional — the server works locally without one. Set this for automatic fallback to the webclaw cloud API when a site has bot protection or requires JS rendering. secret: true commandFunction: config = { command: 'webclaw-mcp', args: , env: config.apiKey ? { WEBCLAW API KEY: config.apiKey } : {} } exampleConfig: apiKey: wc your api key here 证据：`smithery.yaml`
- **Targets 1000**（source_file）：Nike PDP https://www.nike.com/t/air-force-1-07-mens-shoes-jBrhbr/CW2288-111 nike,air force,cart Nike Women https://www.nike.com/w/womens-running-shoes-37v7jz5e1x6 nike,women,running StockX PDP https://stockx.com/nike-dunk-low-retro-white-black-2021 stockx,dunk,bid Amazon US https://www.amazon.com/dp/B0CX23V2ZK amazon,price,cart Amazon IT https://www.amazon.it/-/en/Plasters-Superior-Quality-Snoring-Congestion/dp/B0CPSXML6Z amazon,price,cart Amazon DE https://www.amazon.de/-/en/dp/B09V3KXJPB amazon,price,kaufen Target PDP https://www.target.com/p/stanley-quencher-h2-0-flowstate-tumbler-40oz/-/A-87790798 target,stanley,price Target Electronics https://www.target.com/c/tvs-home-theater-electron… 证据：`targets_1000.txt`
- **aggregates**（source_file）：HERE = Path file .resolve .parent ROOT = HERE.parent REPO ROOT = ROOT.parent WEBCLAW = os.environ.get "WEBCLAW", str REPO ROOT / "target" / "release" / "webclaw" RUNS = int os.environ.get "RUNS", "3" WC TIMEOUT = int os.environ.get "WEBCLAW TIMEOUT", "30" ⋮---- ENC = tiktoken.get encoding "cl100k base" FC KEY = os.environ.get "FIRECRAWL API KEY" FC = None ⋮---- FC = Firecrawl api key=FC KEY ⋮---- def load sites - list str ⋮---- path = ROOT / "sites.txt" out = ⋮---- s = line.split " ", 1 0 .strip ⋮---- def load facts - dict str, list str def run webclaw llm url: str - tuple str, float ⋮---- t0 = time.time r = subprocess.run ⋮---- def run webclaw raw url: str - str def run trafilatura url: st… 证据：`benchmarks/scripts/bench.py`
- **Cargo**（source_file）：package name = "webclaw-cli" description = "CLI for extracting web content into LLM-optimized formats" version.workspace = true edition.workspace = true license.workspace = true 证据：`crates/webclaw-cli/Cargo.toml`
- **Bench**（source_file）：use std::time::Instant; ⋮---- pub struct BenchArgs { ⋮---- struct BenchResult { ⋮---- pub async fn run args: &BenchArgs - Result { ⋮---- let client = FetchClient::new config .map err e format! "build client: {e}" ?; ⋮---- .fetch &args.url ⋮---- .map err e format! "fetch: {e}" ?; ⋮---- extract &fetched.html, Some &fetched.url .map err e format! "extract: {e}" ?; let llm text = to llm text &extraction, Some &fetched.url ; let elapsed = start.elapsed ; let raw tokens = approx tokens &fetched.html ; let llm tokens = approx tokens &llm text ; let raw bytes = fetched.html.len ; let llm bytes = llm text.len ; ⋮---- let facts = match args.facts.as deref { Some path = check facts path, &args.url, &l… 证据：`crates/webclaw-cli/src/bench.rs`
- **Main**（source_file）：mod bench; ⋮---- use std::process; use std::sync::Arc; ⋮---- use tracing subscriber::EnvFilter; ⋮---- use webclaw llm::LlmProvider; use webclaw pdf::PdfMode; ⋮---- enum EmptyReason { ⋮---- fn detect empty result: &ExtractionResult - EmptyReason { if is consent wall result { ⋮---- if result.metadata.word count 50 !result.content.markdown.is empty { ⋮---- let lower = title.to lowercase ; if ANTIBOT TITLES.iter .any t lower.starts with t { ⋮---- if result.metadata.word count == 0 && result.content.links.is empty { ⋮---- fn is consent wall result: &ExtractionResult - bool { ⋮---- let lower = url.to ascii lowercase ; ⋮---- .iter .any fragment lower.contains fragment ⋮---- .any prefix lower.start… 证据：`crates/webclaw-cli/src/main.rs`
- **Reddit regression fixtures are real old.reddit.com pages read at test time;**（source_file）：package name = "webclaw-core" description = "Pure HTML content extraction engine for LLMs" version.workspace = true edition.workspace = true license.workspace = true Reddit regression fixtures are real old.reddit.com pages read at test time; they're large and only needed to run the test suite from the repo, so keep them out of the published crate. exclude = "testdata/reddit/ .html" 证据：`crates/webclaw-core/Cargo.toml`
- **Brand**（source_file）：use std::collections::HashMap; use once cell::sync::Lazy; use regex::Regex; ⋮---- use serde::Serialize; use url::Url; ⋮---- pub struct BrandColor { ⋮---- pub enum ColorUsage { ⋮---- pub struct LogoVariant { ⋮---- pub struct BrandIdentity { ⋮---- static CSS DECL: Lazy = Lazy::new Regex::new r" ?i \w- + \s :\s ^;}{ + " .unwrap ; /// Matches hex colors: RGB or RRGGBB static HEX COLOR: Lazy = Lazy::new Regex::new r" 0-9a-fA-F {3} \b 0-9a-fA-F {6} \b" .unwrap ; ⋮---- Regex::new r" ?i rgb\ \s \d{1,3} \s ,\s \d{1,3} \s ,\s \d{1,3} \s \ " .unwrap ⋮---- /// Matches rgba r, g, b, a static RGBA COLOR: Lazy = Lazy::new { Regex::new r" ?i rgba\ \s \d{1,3} \s ,\s \d{1,3} \s ,\s \d{1,3} \s ,\s \d. +\s \ "… 证据：`crates/webclaw-core/src/brand.rs`
- **Data Island**（source_file）：use once cell::sync::Lazy; ⋮---- use tracing::debug; ⋮---- Lazy::new Selector::parse "script type='application/json' " .unwrap ; ⋮---- struct TextChunk { ⋮---- pub fn try extract doc: &Html, dom word count: usize, existing markdown: &str - Option { ⋮---- let existing lower = existing markdown.to lowercase ; for script in doc.select &SCRIPT JSON SELECTOR { if all chunks.len = MAX CHUNKS { ⋮---- let json text = script.text .collect:: ; if json text.len , depth: usize { ⋮---- if let Some node type = map.get "nodeType" .and then v v.as str && let Some text = extract contentful node map, node type ⋮---- chunks.push text ; ⋮---- if is cms entry map && let Some chunk = extract cms entry map ⋮----… 证据：`crates/webclaw-core/src/data_island.rs`
- **Diff**（source_file）：use std::collections::HashSet; use serde::Serialize; use similar::TextDiff; ⋮---- pub enum ChangeStatus { ⋮---- pub struct MetadataChange { ⋮---- pub struct ContentDiff { ⋮---- pub fn diff old: &ExtractionResult, new result: &ExtractionResult - ContentDiff { let text diff = compute text diff &old.content.markdown, &new result.content.markdown ; let metadata changes = compute metadata changes &old.metadata, &new result.metadata ; ⋮---- compute link changes &old.content.links, &new result.content.links ; ⋮---- let status = if text diff.is none && metadata changes.is empty { ⋮---- fn compute text diff old: &str, new: &str - Option { ⋮---- .unified diff .context radius 3 .header "old", "new" .t… 证据：`crates/webclaw-core/src/diff.rs`
- **Domain**（source_file）：pub enum DomainType { ⋮---- pub fn detect url: Option , html: &str - DomainType { ⋮---- && let Some dt = detect from url url ⋮---- detect from dom html ⋮---- fn detect from url url: &str - Option { let lower = url.to lowercase ; if lower.contains "github.com" lower.contains "gitlab.com" { return Some DomainType::GitHub ; ⋮---- if doc patterns.iter .any p lower.contains p { return Some DomainType::Documentation ; ⋮---- if forum patterns.iter .any p lower.contains p { return Some DomainType::Forum ; ⋮---- if social patterns.iter .any p lower.contains p { return Some DomainType::Social ; ⋮---- if ecommerce patterns.iter .any p lower.contains p { return Some DomainType::ECommerce ; ⋮---- fn det… 证据：`crates/webclaw-core/src/domain.rs`
- **Endpoints**（source_file）：use once cell::sync::Lazy; use regex::Regex; ⋮---- use std::collections::BTreeSet; use url::Url; ⋮---- pub enum EndpointKind { ⋮---- pub struct DiscoveredEndpoint { ⋮---- pub struct EndpointReport { ⋮---- .expect "RE REL PATH" ⋮---- .expect "RE ABS URL" ⋮---- Regex::new r "wss?:// A-Za-z0-9.\- {1,253} ?:/ A-Za-z0-9 \-./% {0,256} ?" .expect "RE WS" ⋮---- static SCRIPT SEL: Lazy = Lazy::new Selector::parse "script" .expect "script sel" ; /// Common multi-label public suffixes so ticketmaster.co.uk resolves to const SUFFIX2: & &str = & ⋮---- fn registrable domain host: &str - String { let host = host.trim end matches '.' .to ascii lowercase ; let labels: Vec = host.split '.' .collect ; if labe… 证据：`crates/webclaw-core/src/endpoints.rs`
- **Error**（source_file）：use thiserror::Error; ⋮---- pub enum ExtractError { 证据：`crates/webclaw-core/src/error.rs`
- **Extractor**（source_file）：use std::collections::HashSet; use ego tree::NodeId; use once cell::sync::Lazy; ⋮---- use url::Url; use crate::markdown; use crate::noise; ⋮---- Lazy::new Selector::parse "article, main, role='main' , div, section, td" .unwrap ; static BODY SELECTOR: Lazy = Lazy::new Selector::parse "body" .unwrap ; static H1 SELECTOR: Lazy = Lazy::new Selector::parse "h1" .unwrap ; static H2 SELECTOR: Lazy = Lazy::new Selector::parse "h2" .unwrap ; static P SELECTOR: Lazy = Lazy::new Selector::parse "p" .unwrap ; static A SELECTOR: Lazy = Lazy::new Selector::parse "a" .unwrap ; ⋮---- Lazy::new Selector::parse " role='region' aria-label " .unwrap ; static FOOTER SELECTOR: Lazy = Lazy::new Selector::parse "f… 证据：`crates/webclaw-core/src/extractor.rs`
- **Js Eval**（source_file）：use once cell::sync::Lazy; use regex::Regex; ⋮---- use tracing::debug; static SCRIPT SELECTOR: Lazy = Lazy::new Selector::parse "script" .unwrap ; static HTML TAG RE: Lazy = Lazy::new Regex::new r" + " .unwrap ; ⋮---- /// Markers that, if absent from the HTML, prove the QuickJS scan cannot find /// any data blob. The scan only ever surfaces globalThis. object/array ⋮---- /// any data blob. The scan only ever surfaces globalThis. object/array /// properties, and the seeded next f only emits when non-empty. Every ⋮---- /// properties, and the seeded next f only emits when non-empty. Every /// realistic way an inline script populates such a global goes through one of ⋮---- /// realistic way an… 证据：`crates/webclaw-core/src/js_eval.rs`
- **Lib**（source_file）：pub mod brand; pub crate mod data island; pub mod diff; pub mod domain; pub mod endpoints; pub mod error; pub mod extractor; ⋮---- pub mod js eval; pub mod llm; pub mod markdown; pub mod metadata; ⋮---- pub crate mod noise; pub mod reddit; pub mod structured data; pub mod types; pub mod youtube; pub use brand::BrandIdentity; ⋮---- pub use domain::DomainType; pub use error::ExtractError; pub use llm::to llm text; ⋮---- use scraper::Html; use url::Url; pub fn extract html: &str, url: Option - Result { extract with options html, url, &ExtractionOptions::default ⋮---- pub fn extract with options ⋮---- let html = html.to string ; let url = url.map u u.to string ; let options = options.clone ; ⋮-… 证据：`crates/webclaw-core/src/lib.rs`
- **Body**（source_file）：use once cell::sync::Lazy; use regex::Regex; use super::cleanup; use super::images; use super::links; pub crate struct ProcessedBody { ⋮---- pub crate fn process body markdown: &str - ProcessedBody { ⋮---- let text = dedup repeated phrases &text ; let text = dedup heading paragraph &text ; let text = dedup text against headings &text ; let text = dedup duplicate headings &text ; let text = strip empty headings &text ; ⋮---- let text = dedup content blocks &text ; let text = dedup lines &text ; let text = dedup comma lists &text ; let text = strip trailing empty headings &text ; let text = strip empty code blocks &text ; ⋮---- let text = merge stat lines &text ; ⋮---- static HEADING RE: Lazy… 证据：`crates/webclaw-core/src/llm/body.rs`
- **Cleanup**（source_file）：use once cell::sync::Lazy; use regex::Regex; use crate::noise; pub crate fn decode html entities input: &str - String { if !input.contains '&' { return input.to string ; ⋮---- Lazy::new Regex::new r"& xX 0-9a-fA-F + 0-9 + a-zA-Z + ;" .unwrap ; ⋮---- .replace all input, caps: &regex::Captures { let entity = caps.get 1 .unwrap .as str ; ⋮---- "nbsp" = " ".to string , "amp" = "&".to string , "lt" = " " ".to string , "quot" = "\"".to string , "apos" = "'".to string , "mdash" = "\u{2014}".to string , "ndash" = "\u{2013}".to string , "laquo" = "\u{00AB}".to string , "raquo" = "\u{00BB}".to string , "copy" = "\u{00A9}".to string , "reg" = "\u{00AE}".to string , "trade" = "\u{2122}".to string , "he… 证据：`crates/webclaw-core/src/llm/cleanup.rs`
- **Images**（source_file）：use once cell::sync::Lazy; use regex::Regex; use super::cleanup::is asset label; ⋮---- Lazy::new Regex::new r"\ !\ ^\ \ \ ^ +\ \ \ ^ + \ " .unwrap ; /// Matches empty markdown links url left after image stripping. pub crate static EMPTY LINK RE: Lazy = Lazy::new Regex::new r"\ \s \ \ ^ +\ " .unwrap ; /// Convert linked images to plain links, preserving the alt text and link target. /// Adds a newline after each to prevent text mashing when multiple are adjacent. ⋮---- /// Adds a newline after each to prevent text mashing when multiple are adjacent. pub crate fn convert linked images input: &str - String { ⋮---- pub crate fn convert linked images input: &str - String { ⋮---- .replace all inp… 证据：`crates/webclaw-core/src/llm/images.rs`
- **Links**（source_file）：use std::collections::HashSet; use once cell::sync::Lazy; use regex::Regex; static LINK RE: Lazy = Lazy::new Regex::new r"\ ^\ \ \ ^ + \ " .unwrap ; /// Extract all links from markdown, replacing inline text url with just text . /// Returns the cleaned text and a deduplicated list of label, href pairs. ⋮---- /// Returns the cleaned text and a deduplicated list of label, href pairs. pub crate fn extract and strip links input: &str - String, Vec { ⋮---- pub crate fn extract and strip links input: &str - String, Vec { ⋮---- let replaced = LINK RE.replace all input, caps: &regex::Captures { let text = caps.get 1 .map or "", m m.as str .trim .to string ; let href = caps.get 2 .map or "", m m.as… 证据：`crates/webclaw-core/src/llm/links.rs`
- **Metadata**（source_file）：use crate::types::ExtractionResult; pub crate fn build metadata header ⋮---- let effective url = url.or meta.url.as deref ; ⋮---- out.push str &format! " URL: {u}\n" ; ⋮---- && !t.is empty ⋮---- out.push str &format! " Title: {t}\n" ; ⋮---- && !d.is empty ⋮---- out.push str &format! " Description: {d}\n" ; ⋮---- && !a.is empty ⋮---- out.push str &format! " Author: {a}\n" ; ⋮---- out.push str &format! " Published: {d}\n" ; ⋮---- && !l.is empty ⋮---- out.push str &format! " Language: {l}\n" ; ⋮---- out.push str &format! " Word count: {}\n", meta.word count ; 证据：`crates/webclaw-core/src/llm/metadata.rs`
- **Mod**（source_file）：mod body; mod cleanup; mod images; mod links; mod metadata; use crate::types::ExtractionResult; pub fn to llm text result: &ExtractionResult, url: Option - String { ⋮---- if !processed.text.is empty { if !out.is empty { out.push '\n' ; ⋮---- out.push str &processed.text ; ⋮---- if !processed.links.is empty { out.push str "\n\n Links\n" ; ⋮---- if !label.is empty { out.push str &format! "- {label}: {href}\n" ; ⋮---- .iter .filter v is useful structured data v .cloned .collect ; ⋮---- scrub body fields value, 0 ; ⋮---- if !useful.is empty { let serialized = serde json::to string pretty &useful .unwrap or default ; ⋮---- if serialized.len bool { let Some obj = v.as object else { if v.is array… 证据：`crates/webclaw-core/src/llm/mod.rs`
- **Markdown**（source_file）：use std::collections::HashSet; use ego tree::NodeId; use once cell::sync::Lazy; use scraper::node::Node; ⋮---- use url::Url; use crate::noise; ⋮---- static CODE SELECTOR: Lazy = Lazy::new Selector::parse "code" .unwrap ; static IMG ALT SELECTOR: Lazy = Lazy::new Selector::parse "img alt " .unwrap ; static A HREF SELECTOR: Lazy = Lazy::new Selector::parse "a href " .unwrap ; ⋮---- pub struct ConvertedAssets { ⋮---- pub fn convert ⋮---- let md = node to md element, base url, &mut assets, 0, exclude, 0 ; let plain = strip markdown &md ; let md = collapse whitespace &md ; let plain = collapse whitespace &plain ; ⋮---- /// Recursive descent through the DOM, emitting markdown for each node. fn no… 证据：`crates/webclaw-core/src/markdown.rs`
- **Metadata**（source_file）：use crate::types::Metadata; macro rules! selector { ⋮---- pub fn extract doc: &Html, url: Option - Metadata { let title = og meta doc, "og:title" .or else meta name doc, "twitter:title" .or else title tag doc ; let description = og meta doc, "og:description" .or else meta name doc, "twitter:description" .or else meta name doc, "description" ; let author = meta name doc, "author" .or else og meta doc, "article:author" ; let published date = og meta doc, "article:published time" .or else meta name doc, "date" .or else meta name doc, "publication date" ; ⋮---- .select selector! "html" .next .and then el el.value .attr "lang" .map s s.to string ; let site name = og meta doc, "og:site name" ; le… 证据：`crates/webclaw-core/src/metadata.rs`
- 其余 20 条证据见 `AI_CONTEXT_PACK.json` 或 `EVIDENCE_INDEX.json`。

## 宿主 AI 必须遵守的规则

- **把本资产当作开工前上下文，而不是运行环境。**：AI Context Pack 只包含证据化项目理解，不包含目标项目的可执行状态。 证据：`CLAUDE.md`, `README.md`, `benchmarks/README.md`
- **回答用户时区分可预览内容与必须安装后才能验证的内容。**：安装前体验的消费者价值来自降低误装和误判，而不是伪装成真实运行。 证据：`CLAUDE.md`, `README.md`, `benchmarks/README.md`

## 用户开工前应该回答的问题

- 你准备在哪个宿主 AI 或本地环境中使用它？
- 你只是想先体验工作流，还是准备真实安装？
- 你最在意的是安装成本、输出质量、还是和现有规则的冲突？

## 验收标准

- 所有能力声明都能回指到 evidence_refs 中的文件路径。
- AI_CONTEXT_PACK.md 没有把预览包装成真实运行。
- 用户能在 3 分钟内看懂适合谁、能做什么、如何开始和风险边界。

---

## Doramagic Context Augmentation

下面内容用于强化 Repomix/AI Context Pack 主体。Human Manual 只提供阅读骨架；踩坑日志会被转成宿主 AI 必须遵守的工作约束。

## Human Manual 骨架

使用规则：这里只是项目阅读路线和显著性信号，不是事实权威。具体事实仍必须回到 repo evidence / Claim Graph。

宿主 AI 硬性规则：
- 不得把页标题、章节顺序、摘要或 importance 当作项目事实证据。
- 解释 Human Manual 骨架时，必须明确说它只是阅读路线/显著性信号。
- 能力、安装、兼容性、运行状态和风险判断必须引用 repo evidence、source path 或 Claim Graph。

- **项目概述与安装指南**：importance `high`
  - source_paths: README.md, packages/create-webclaw/README.md, packages/create-webclaw/package.json, packages/create-webclaw/index.mjs, examples/README.md
- **Rust 工作空间与 Crate 架构**：importance `high`
  - source_paths: Cargo.toml, crates/webclaw-core/src/lib.rs, crates/webclaw-fetch/src/lib.rs, crates/webclaw-fetch/src/tls.rs, crates/webclaw-llm/src/lib.rs
- **CLI、MCP 工具与已知故障模式**：importance `high`
  - source_paths: crates/webclaw-cli/src/main.rs, crates/webclaw-cli/src/bench.rs, crates/webclaw-mcp/src/tools.rs, crates/webclaw-fetch/src/crawler.rs, crates/webclaw-fetch/src/proxy.rs
- **自托管服务器与部署**：importance `medium`
  - source_paths: crates/webclaw-server/src/main.rs, crates/webclaw-server/src/auth.rs, crates/webclaw-server/src/error.rs, crates/webclaw-server/src/state.rs, crates/webclaw-server/src/routes/scrape.rs

## Repo Inspection Evidence / 源码检查证据

- repo_clone_verified: true
- repo_inspection_verified: true
- repo_commit: `a3d3744104d8c1f58f0f2abc3d24572dbe05a674`
- inspected_files: `Dockerfile`, `README.md`, `docker-compose.yml`, `examples/README.md`, `examples/cloudflare-diagnostics/README.md`, `examples/firecrawl-compatible-api/README.md`, `examples/html-to-markdown-rag/README.md`, `examples/mcp-web-scraping/README.md`, `examples/proxy-backed-crawling/README.md`, `packages/create-webclaw/README.md`, `packages/create-webclaw/package.json`, `packages/create-webclaw/server.json`

宿主 AI 硬性规则：
- 没有 repo_clone_verified=true 时，不得声称已经读过源码。
- 没有 repo_inspection_verified=true 时，不得把 README/docs/package 文件判断写成事实。
- 没有 quick_start_verified=true 时，不得声称 Quick Start 已跑通。

## Doramagic Pitfall Constraints / 踩坑约束

这些规则来自 Doramagic 发现、验证或编译过程中的项目专属坑点。宿主 AI 必须把它们当作工作约束，而不是普通说明文字。

### Constraint 1: 来源证据：Linux release binaries require glibc 2.38+ — fail on Debian 12 / Ubuntu 22.04 / Amazon Linux 2023

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Linux release binaries require glibc 2.38+ — fail on Debian 12 / Ubuntu 22.04 / Amazon Linux 2023
- Host AI rule: 来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/73 | 来源讨论提到 node 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 2: 来源证据：`webclaw-server` REST API binary is missing from repo and Docker image

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：`webclaw-server` REST API binary is missing from repo and Docker image
- Host AI rule: 来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/15 | 来源讨论提到 docker 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 3: 可能修改宿主 AI 配置

- Trigger: 项目面向 Claude/Cursor/Codex/Gemini/OpenCode 等宿主，或安装命令涉及用户配置目录。
- Host AI rule: 列出会写入的配置文件、目录和卸载/回滚步骤。
- Why it matters: 安装可能改变本机 AI 工具行为，用户需要知道写入位置和回滚方法。
- Evidence: capability.host_targets | https://github.com/0xMassi/webclaw | host_targets=mcp_host, claude_code, claude, cursor, codex
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 4: 来源证据：MCP boolean params rejected when sent as strings (follow-up to #58 / #59)

- Trigger: GitHub 社区证据显示该项目存在一个配置相关的待验证问题：MCP boolean params rejected when sent as strings (follow-up to #58 / #59)
- Host AI rule: 来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/62 | 来源讨论提到 windows 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 5: 能力判断依赖假设

- Trigger: README/documentation is current enough for a first validation pass.
- Host AI rule: 将假设转成下游验证清单。
- Why it matters: 假设不成立时，用户拿不到承诺的能力。
- Evidence: capability.assumptions | https://github.com/0xMassi/webclaw | README/documentation is current enough for a first validation pass.
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 6: 维护活跃度未知

- Trigger: 未记录 last_activity_observed。
- Host AI rule: 补 GitHub 最近 commit、release、issue/PR 响应信号。
- Why it matters: 新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- Evidence: evidence.maintainer_signals | https://github.com/0xMassi/webclaw | last_activity_observed missing
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

- Trigger: no_demo
- Evidence: downstream_validation.risk_items | https://github.com/0xMassi/webclaw | no_demo; severity=medium
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 8: 存在评分风险

- Trigger: no_demo
- Why it matters: 风险会影响是否适合普通用户安装。
- Evidence: risks.scoring_risks | https://github.com/0xMassi/webclaw | no_demo; severity=medium
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 9: 来源证据：create-webclaw fails on Windows: asset name mismatch + uses missing unzip command

- Trigger: GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：create-webclaw fails on Windows: asset name mismatch + uses missing unzip command
- Host AI rule: 来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- Why it matters: 可能影响授权、密钥配置或安全边界。
- Evidence: community_evidence:github | https://github.com/0xMassi/webclaw/issues/71 | 来源讨论提到 node 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 10: issue/PR 响应质量未知

- Trigger: issue_or_pr_quality=unknown。
- Host AI rule: 抽样最近 issue/PR，判断是否长期无人处理。
- Why it matters: 用户无法判断遇到问题后是否有人维护。
- Evidence: evidence.maintainer_signals | https://github.com/0xMassi/webclaw | issue_or_pr_quality=unknown
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。