# https://github.com/vasylenko/markfetch 项目说明书

生成时间：2026-05-15 08:07:16 UTC

## 目录

- [Introduction](#introduction)
- [Quick Start Guide](#quickstart)
- [Processing Pipeline](#processing-pipeline)
- [HTTP/2 Fingerprinting](#http-fingerprinting)
- [CLI Usage](#cli-usage)
- [MCP Server Integration](#mcp-server)
- [Environment Variables](#environment-variables)
- [Write Sandbox Security](#write-sandbox)
- [Error Handling](#error-handling)
- [Development Guide](#development)

<a id='introduction'></a>

## Introduction

### 相关页面

相关主题：[Quick Start Guide](#quickstart), [Processing Pipeline](#processing-pipeline)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)
- [CHANGELOG.md](https://github.com/vasylenko/markfetch/blob/main/CHANGELOG.md)
- [src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)
- [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)
- [src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)
- [src/sandbox.ts](https://github.com/vasylenko/markfetch/blob/main/src/sandbox.ts)
</details>

# Introduction

## What is markfetch?

**markfetch** is a Node.js tool that fetches public HTTP/S URLs and returns clean, readable markdown — indistinguishable from what a human would get by running "Save as Markdown" in a browser. It is designed to provide high-quality content extraction for language models, with a focus on producing output that LLM clients can actually consume reliably.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Core Design Philosophy

markfetch is built around several key principles that differentiate it from generic fetching solutions:

| Principle | Description |
|-----------|-------------|
| **Single-channel output** | Returns markdown in `content[0].text` only — no `structuredContent` that some LLM clients drop |
| **Real-browser fingerprint** | Uses HTTP/2 transport with a coherent Chrome header set and `Sec-CH-UA-*` client hints |
| **Reader-View extraction** | Leverages Mozilla's Readability library to extract the main article content |
| **Zero-config defaults** | Works out of the box with sensible defaults |
| **Deterministic errors** | 8 structured error codes for reliable error handling |

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Architecture Overview

markfetch follows an adapter pattern with a unified core:

```mermaid
graph TD
    A[User / LLM Client] --> B[Adapter Layer]
    B --> C{Invocation Mode}
    C -->|CLI args| D[cli.ts]
    C -->|MCP stdio| E[mcp.ts]
    D --> F[core.ts - fetchMarkdown]
    E --> F
    F --> G[HTTP Fetch - undici]
    G --> H[Readability Extraction]
    H --> I[Turndown Conversion]
    I --> J[Markdown Output]
```

### Core Components

| Component | File | Responsibility |
|-----------|------|----------------|
| **Core Pipeline** | `src/core.ts` | URL fetching, HTML parsing, content extraction, markdown conversion, error throwing |
| **CLI Adapter** | `src/cli.ts` | Command-line argument parsing, stdout/stderr output |
| **MCP Adapter** | `src/mcp.ts` | Model Context Protocol stdio server, tool registration |
| **Write Sandbox** | `src/sandbox.ts` | Path validation for file saves |

资料来源：[src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts), [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts), [src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)

## Two Operating Modes

### CLI Mode

The command-line interface accepts a URL and outputs markdown to stdout:

```bash
markfetch https://en.wikipedia.org/wiki/Markdown
```

Options include:
- `-o, --output <path>` — Save markdown to a file
- `-V, --version` — Print version
- `-h, --help` — Print usage

The CLI respects the same environment variables as the MCP mode and resolves relative output paths against the current working directory.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md), [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)

### MCP Mode

The Model Context Protocol server provides a single tool `fetch_markdown(url, savePath?)` for integration with LLM clients like Claude Code, Cursor, or Goose:

```json
{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}
```

The MCP mode has additional security features:
- **Write sandbox**: File saves are restricted to allowed write roots
- **Lazy loading**: The CLI adapter is never loaded in MCP mode, ensuring `console.log` is never reachable

资料来源：[src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts), [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)

## Content Extraction Pipeline

The markdown conversion process involves several stages:

```mermaid
graph LR
    A[HTML Response] --> B[Decode Encoded Tags]
    B --> C[Ensure Base Href]
    C --> D[Rewrite for Readability]
    D --> E[Readability Parse]
    E --> F[Turndown Convert]
    F --> G[Prune Empty Headings]
    G --> H[Clean Markdown]
```

### Extraction Details

1. **Encoded Tag Decoding**: Handles HTML entities like `&lt;code&gt;` in code blocks
2. **Base Href Injection**: Ensures relative URLs become absolute using the canonical URL
3. **Pre-processing Rewrites**: Handles footnotes, `<details>` elements, and MediaWiki-specific structures
4. **Readability Parsing**: Extracts main article content using Mozilla Readability with `keepClasses: true` to preserve language hints on code blocks
5. **Markdown Conversion**: Uses Turndown with a custom escape function to avoid noisy backslash escapes
6. **Heading Pruning**: Removes empty headings left by stripped interactive widgets

资料来源：[src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)

## Error Handling

markfetch provides 8 deterministic error codes:

| Error Code | Meaning |
|------------|---------|
| `network_error` | DNS, TCP, TLS failure, or unexpected fetcher error |
| `http_error` | Non-2xx status from upstream |
| `timeout` | Request exceeded `MARKFETCH_TIMEOUT_MS` |
| `unsupported_content_type` | Response is not `text/html` or `application/xhtml+xml` |
| `extraction_failed` | Readability found no article content (typical for SPAs) |
| `too_large` | Content exceeded `MARKFETCH_MAX_BYTES` |
| `save_failed` | File write failed (permission, missing directory) |
| `save_forbidden` | `savePath` resolves outside allowed write roots |

All errors are thrown uniformly from `core.ts` as `MarkfetchError` and caught by adapters for translation to their respective output formats.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Configuration

| Variable | Default | Purpose |
|----------|---------|---------|
| `MARKFETCH_TIMEOUT_MS` | `30000` | Per-request timeout in milliseconds |
| `MARKFETCH_MAX_BYTES` | `5000000` | Cap on response body and extracted markdown |
| `MARKFETCH_USER_AGENT` | Chrome 130 string | Browser fingerprint; must be Chrome UA |
| `MARKFETCH_ALLOWED_WRITE_ROOTS` | `os.tmpdir()` + `process.cwd()` | MCP-only; allowed file save paths |

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## What markfetch Is Not

Understanding the boundaries helps set correct expectations:

| Limitation | Explanation |
|------------|-------------|
| **Not a crawler** | One URL in, one document out. No recursion, `robots.txt` parsing, or rate limiting. |
| **Not authenticated** | Anonymous fetch only. Pages behind login walls return public content or `http_error`. |
| **Not a JS renderer** | Pure client-rendered SPAs with no static HTML return `extraction_failed`. SPAs with server-rendered content will extract what they ship. |

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Requirements

- **Node.js ≥ 24**
- **npm** for installation

## Quick Start

```bash
# Install globally
npm i -g markfetch

# Fetch a URL
markfetch https://en.wikipedia.org/wiki/Markdown

# Save to file
markfetch https://example.com/article -o output.md
```

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Version History

| Version | Date | Key Changes |
|---------|------|-------------|
| 0.6.0 | 2026-05-13 | Write sandbox, `save_forbidden` error, CI matrix expansion |
| 0.5.0 | 2026-05-12 | CLI mode with lazy-loading dispatcher |
| 0.4.0 | 2026-05-10 | MCP server with single `fetch_markdown` tool |
| 0.4.1 | 2026-05-11 | Bug fixes and documentation improvements |

资料来源：[CHANGELOG.md](https://github.com/vasylenko/markfetch/blob/main/CHANGELOG.md)

---

<a id='quickstart'></a>

## Quick Start Guide

### 相关页面

相关主题：[Introduction](#introduction), [CLI Usage](#cli-usage), [MCP Server Integration](#mcp-server)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)
- [package.json](https://github.com/vasylenko/markfetch/blob/main/package.json)
- [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)
- [src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)
</details>

# Quick Start Guide

markfetch is a tool that fetches URLs and returns clean markdown output. It operates as both a CLI command and an MCP (Model Context Protocol) server, making it suitable for AI agents like Claude Code, Codex, and Gemini CLI.

## Installation

### Prerequisites

- Node.js ≥ 24 资料来源：[package.json:8]()

### CLI Installation (Global)

```bash
npm i -g markfetch
```

After installation, the `markfetch` command is available globally. 资料来源：[README.md:38]()

### CLI Installation (npx)

For one-off usage without global installation:

```bash
npx -y markfetch <url>
```

### MCP Server Setup

Add markfetch to your MCP client configuration. The setup varies by client.

#### Claude Code

```bash
claude mcp add --scope user markfetch -- npx -y markfetch
```

#### Codex

```json
{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}
```

#### Gemini CLI

```bash
gemini mcp add -s user markfetch npx -y markfetch
```

#### Cursor / Goose / Other stdio-MCP Clients

```json
{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}
```

资料来源：[README.md:46-69]()

## CLI Usage

### Basic Fetch

```bash
markfetch <url>
```

The fetched markdown is printed to stdout. 资料来源：[src/cli.ts:18]()

### Save to File

```bash
markfetch <url> -o <path>
```

Use `-o` or `--output` to save markdown to a file. Relative paths resolve against the current working directory. 资料来源：[src/cli.ts:12-15]()

Example:

```bash
markfetch https://en.wikipedia.org/wiki/Markdown -o output.md
```

### Help and Version

```bash
markfetch --help
markfetch --version
```

## MCP Tool Usage

### Tool Name

`fetch_markdown`

### Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | Absolute http(s) URL to fetch. The server follows redirects automatically. No authentication headers, cookies, or session state are sent. |
| `savePath` | string | No | Absolute filesystem path. When provided, the fetched markdown is written to this path instead of returned in the response. |

资料来源：[src/mcp.ts:22-33]()

### Return Value

The tool returns markdown content in `content[0].text`. No `structuredContent` field is used — this ensures compatibility with MCP clients that forward only `structuredContent` to the model. 资料来源：[README.md:18-21]()

## Environment Configuration

| Variable | Default | Purpose |
|----------|---------|---------|
| `MARKFETCH_TIMEOUT_MS` | `30000` | Per-request timeout in milliseconds |
| `MARKFETCH_MAX_BYTES` | `5000000` | Cap on response body and extracted markdown (5MB) |
| `MARKFETCH_USER_AGENT` | Pinned Chrome 130 string | Override the User-Agent header. Must be a Chrome UA string. |
| `MARKFETCH_ALLOWED_WRITE_ROOTS` | `os.tmpdir()` + `process.cwd()` | MCP-only. Colon-delimited (POSIX) or semicolon-delimited (Windows) list of absolute paths permitted for `savePath` writes. |

资料来源：[README.md:99-103]()

### Passing Environment Variables to MCP

```json
{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_TIMEOUT_MS": "60000"
      }
    }
  }
}
```

## Error Handling

Errors are returned with deterministic codes in the format `[code] message`:

| Code | Meaning |
|------|---------|
| `network_error` | DNS, TCP, or TLS failure |
| `http_error` | Upstream returned a non-2xx status |
| `timeout` | Request exceeded `MARKFETCH_TIMEOUT_MS` |
| `unsupported_content_type` | Response was not `text/html` or `application/xhtml+xml` |
| `extraction_failed` | Readability found no article content (typical for pure client-rendered SPAs) |
| `too_large` | Response body or extracted markdown exceeded `MARKFETCH_MAX_BYTES` |
| `save_failed` | `writeFile` failed (missing directory, permission denied) |
| `save_forbidden` | `savePath` resolves outside the allowed write roots |

Errors go to stderr with non-zero exit status in CLI mode. 资料来源：[README.md:72-85]()

## Quick Workflow

```mermaid
graph TD
    A[Start markfetch] --> B{Arguments provided?}
    B -->|Yes, URL argument| C[CLI Mode]
    B -->|No arguments| D[MCP Server Mode]
    C --> E[Fetch URL]
    D --> F[Wait for MCP request]
    E --> G{Output path specified?}
    F --> H[Receive fetch_markdown request]
    G -->|No| I[Print to stdout]
    G -->|Yes, -o path| J[Write to file]
    H --> I
    J --> K[Return confirmation]
    I --> L[Return markdown content]
    K --> L
```

## Use Cases

| Use Case | Recommended Mode | Command/Config |
|----------|-------------------|----------------|
| One-time URL fetch in shell | CLI | `markfetch <url>` |
| Batch processing with shell scripts | CLI + `-o` | `markfetch <url> -o out.md` |
| AI agent web content retrieval | MCP | Configure in client |
| Large document bypass inline limits | MCP + `savePath` | Set `savePath` to local file |

## Limitations

- **Not a crawler**: No recursion, no `robots.txt` parsing. One URL in, one document out. 资料来源：[README.md:89-91]()
- **Not authenticated**: Anonymous fetch only. Pages behind login walls return whatever the public response is. 资料来源：[README.md:93-95]()
- **Not a JS renderer**: Pure client-rendered SPAs with no static HTML return `extraction_failed`. 资料来源：[README.md:97-99]()

---

<a id='processing-pipeline'></a>

## Processing Pipeline

### 相关页面

相关主题：[Introduction](#introduction), [HTTP/2 Fingerprinting](#http-fingerprinting), [Error Handling](#error-handling)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)
- [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)
- [src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)
- [src/sandbox.ts](https://github.com/vasylenko/markfetch/blob/main/src/sandbox.ts)
- [package.json](https://github.com/vasylenko/markfetch/blob/main/package.json)
- [README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)
</details>

# Processing Pipeline

## Overview

The Processing Pipeline is the core data flow engine in markfetch. It transforms raw HTML fetched from a URL into clean, readable markdown suitable for consumption by AI agents and language models. The pipeline is intentionally single-purpose — one URL in, one markdown document out — with no recursion, pagination, or client-side JavaScript rendering.

The pipeline operates identically whether invoked via CLI or MCP adapter, ensuring consistent behavior across both interfaces.

资料来源：[src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)

## Architecture

The pipeline is composed of three primary stages executed sequentially:

```mermaid
graph TD
    A[URL Input] --> B[HTTP Fetch]
    B --> C{HTML Valid?}
    C -->|No| D[Error: network_error / http_error / timeout]
    C -->|Yes| E[Content-Type Check]
    E -->|Non-HTML| F[Error: unsupported_content_type]
    E -->|HTML| G[Extract Article]
    G -->|No Content| H[Error: extraction_failed]
    G -->|Extracted| I[Convert to Markdown]
    I --> J{Size Check}
    J -->|Exceeds Limit| K[Error: too_large]
    J -->|Valid| L{Save Path?}
    L -->|Yes| M[Write to File / Error: save_forbidden / save_failed]
    L -->|No| N[Return Markdown]
```

Each stage performs validation and may abort with a deterministic error code, ensuring failures are predictable and actionable.

资料来源：[src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)

## Stage 1: HTTP Fetch

The fetch stage retrieves raw HTML from the target URL using Node.js `fetch` with a real-browser fingerprint.

### Transport Configuration

| Setting | Value | Purpose |
|---------|-------|---------|
| Protocol | HTTP/2 | Modern web fingerprint |
| User-Agent | Chrome 130 (pinned) | Realistic browser identification |
| Client Hints | Sec-CH-UA-* headers | Derived from User-Agent at startup |
| Timeout | `MARKFETCH_TIMEOUT_MS` (default: 30000ms) | Per-request budget |

The User-Agent string is validated at startup. Non-Chrome strings fail fast to prevent fingerprint inconsistencies that could trigger bot detection.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

### Error Conditions

| Code | Trigger |
|------|---------|
| `network_error` | DNS failure, TCP failure, TLS error, unexpected fetcher error |
| `http_error` | Non-2xx HTTP status code |
| `timeout` | Response exceeds `MARKFETCH_TIMEOUT_MS` |

Redirects are followed automatically by the underlying HTTP client.

## Stage 2: Article Extraction

Article extraction identifies and isolates the main content from the fetched HTML, stripping navigation, sidebars, footers, and other boilerplate.

### Technology Stack

| Component | Library | Purpose |
|-----------|---------|---------|
| HTML Parser | `linkedom` | Parses HTML into a DOM-like structure |
| Extraction | `readability` (Mozilla) | Identifies main article content |
| Configuration | `keepClasses: true` | Preserves code block language hints |

The `linkedom` parser is chosen over native `DOMParser` to ensure consistent behavior across Node.js versions and environments.

资料来源：[src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)

### Pre-Extraction Rewrites

Before Readability processes the document, the pipeline applies targeted HTML rewrites to normalize content and improve extraction quality:

```typescript
function rewriteForReadability(document: Document): void {
  // Normalize code blocks (pre and code elements)
  // Convert aside elements to sections
  // Expand details/summary elements
  // Flatten MediaWiki heading wrappers
}
```

Specific transformations include:

| Transform | Target | Action |
|-----------|--------|--------|
| Code block normalization | `<pre>`, `<code>` | Standardize encoding artifacts |
| Base href injection | `<head>` / `<html>` | Ensure absolute URLs after redirects |
| Aside conversion | `<aside>` with footnote roles | Convert to `<section>` |
| Details expansion | `<details>`, `<summary>` | Inline content |
| Heading unwrapping | `div.mw-heading` | Remove MediaWiki wrappers |

### Base Href Handling

Readability and linkedom leave relative URLs unresolved unless a `<base>` element exists. The pipeline injects the post-redirect canonical URL to ensure all hrefs and srcs resolve correctly:

```typescript
function ensureBaseHref(html: string, url: string): string {
  const safeUrl = url.replaceAll("&", "&amp;").replaceAll('"', "&quot;");
  const stripped = html.replaceAll(/<base\s[^>]*>/gi, "");
  // Inject <base href="..."> into <head> or <html>
}
```

### Error Conditions

| Code | Trigger |
|------|---------|
| `unsupported_content_type` | Response is not `text/html` or `application/xhtml+xml` |
| `extraction_failed` | Readability returned empty content (typical for client-rendered SPAs) |

资料来源：[src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)

## Stage 3: Markdown Conversion

The conversion stage transforms extracted HTML into clean markdown using Turndown with custom rules.

### Technology Stack

| Component | Library | Notes |
|-----------|---------|-------|
| HTML-to-MD | `turndown` | Configured with GFM rules |
| Code fences | Custom rule | Preserves `class="language-X"` as hint |

### Custom Escape Behavior

Turndown's default escape mechanism inserts backslashes before certain character sequences that might be misinterpreted as markdown. The pipeline removes two categories of unnecessary escapes:

| Pattern | Before | After | Rationale |
|---------|--------|-------|-----------|
| Intraword underscores | `\_` | `_` | Intraword underscores are valid |
| Mid-line dash/equals | `\-X`, `\=X` | `-X`, `=X` | Not list markers or underlines when alphanumeric follows |

This prevents the output from containing visible escape characters that don't affect rendering.

### Empty Heading Pruning

The conversion includes iterative pruning of empty headings — headings immediately followed by another heading with no body content. This commonly occurs when Readability strips interactive widgets (browser-compat tables, spec diagrams) but leaves the surrounding heading structure.

### Title Handling

| Condition | Output |
|-----------|--------|
| Content starts with `<h1>` | Use content heading, no duplicate |
| Content lacks heading | Prepend `# {title}` from Readability |

### Output Format

```markdown
# Page Title (if not already in content)

Article body with clean markdown conversion...
```

## Stage 4: Size Validation and Output

### Size Limits

| Limit | Environment Variable | Default |
|-------|---------------------|---------|
| Response body | `MARKFETCH_MAX_BYTES` | 5,000,000 bytes |
| Extracted markdown | Same variable | Same default |

The pipeline checks both the raw HTTP response size and the final markdown size against this cap.

### Error Conditions

| Code | Trigger |
|------|---------|
| `too_large` | Body or markdown exceeds `MARKFETCH_MAX_BYTES` |

### Output Routing

| Mode | Destination | Behavior |
|------|-------------|----------|
| No `savePath` | Return value | `markdown` field contains content |
| `savePath` (MCP) | File system | `savedTo` field contains path |
| `savePath` (CLI) | File system | Confirmation to stdout |

## Write Sandbox (MCP Only)

When used as an MCP tool with a `savePath` parameter, writes are confined to an allowed set of root directories.

### Default Roots

| Platform | Roots |
|----------|-------|
| POSIX | `os.tmpdir()`, `process.cwd()` |
| Windows | Same, case-insensitive comparison |

### Configuration

`MARKFETFETCH_ALLOWED_WRITE_ROOTS` overrides the defaults entirely. Paths use platform delimiters:

| Platform | Delimiter | Example |
|----------|-----------|---------|
| POSIX | `:` | `/Users/me/out:/tmp` |
| Windows | `;` | `C:\Users\me\out;C:\Temp` |

### Error Conditions

| Code | Trigger |
|------|---------|
| `save_forbidden` | `savePath` resolves outside allowed roots |
| `save_failed` | `writeFile` failed (permissions, missing directory) |

The sandbox applies only to MCP mode. The CLI has no restrictions — the human at the shell is the security boundary.

资料来源：[src/sandbox.ts](https://github.com/vasylenko/markfetch/blob/main/src/sandbox.ts)

## Error Codes Reference

The pipeline returns exactly eight deterministic error codes:

| Code | Stage | Description |
|------|-------|-------------|
| `network_error` | Fetch | DNS/TCP/TLS failure |
| `http_error` | Fetch | Non-2xx status |
| `timeout` | Fetch | Exceeded timeout budget |
| `unsupported_content_type` | Fetch | Not HTML/XHTML |
| `extraction_failed` | Extract | Readability found no content |
| `too_large` | Convert/Validate | Exceeded size cap |
| `save_forbidden` | Output | Path outside sandbox |
| `save_failed` | Output | File write failed |

All errors use the format `[code] message` for easy parsing by consuming tools.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Data Flow Summary

```mermaid
graph LR
    A[URL] --> B[HTTP Fetch]
    B --> C{HTML?}
    C -->|Yes| D[Readability]
    C -->|No| E[Error]
    D --> F[HTML Rewrites]
    F --> G[Extract Content]
    G --> H[Turndown]
    H --> I[Size Check]
    I -->|OK| J[Output]
    I -->|Large| K[Error]
    J --> L{savePath?}
    L -->|No| M[Return Markdown]
    L -->|Yes| N[Write File]
```

## Pipeline Entry Points

### CLI Adapter

The CLI adapter (`src/cli.ts`) parses arguments and delegates to the core pipeline:

```typescript
const { markdown, bytes, savedTo } = await fetchMarkdown({
  url,
  savePath: resolve(process.cwd(), options.output)
});
```

Output behavior:
- With `-o`: prints `Saved N bytes to <path>` to stdout
- Without `-o`: writes raw markdown to stdout via `process.stdout.write`

Errors print to stderr with `[code] message` format.

### MCP Adapter

The MCP adapter (`src/mcp.ts`) registers the `fetch_markdown` tool and calls the core pipeline:

```typescript
server.registerTool("fetch_markdown", {
  description: "Fetch a single public HTTP/S URL...",
  inputSchema: {
    url: z.string().url(),
    savePath: z.string().refine(isAbsolute).optional()
  }
});
```

Output is always returned via `content[0].text`, never `structuredContent`, ensuring compatibility with clients that only forward `content[]`.

资料来源：[src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)
资料来源：[src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)

## Configuration Options

| Variable | Default | Applies To | Purpose |
|----------|---------|------------|---------|
| `MARKFETCH_TIMEOUT_MS` | `30000` | Both | Per-request timeout |
| `MARKFETCH_MAX_BYTES` | `5000000` | Both | Size cap |
| `MARKFETCH_USER_AGENT` | Chrome 130 | Both | Browser fingerprint |
| `MARKFETCH_ALLOWED_WRITE_ROOTS` | tmpdir + cwd | MCP only | Write sandbox roots |

All variables are validated at startup with fail-fast behavior — invalid values terminate the process immediately with a stderr message.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Dependencies

| Package | Version | Role |
|---------|---------|------|
| `linkedom` | runtime | HTML parsing |
| `readability` | runtime | Article extraction |
| `turndown` | runtime | HTML-to-markdown |
| `turndown-plugin-gfm` | runtime | GitHub Flavored Markdown |
| `commander` | runtime | CLI argument parsing |
| `@modelcontextprotocol/sdk` | runtime | MCP server framework |

Node.js ≥ 24 is required for native `fetch` and `fetch` headers support.

资料来源：[package.json](https://github.com/vasylenko/markfetch/blob/main/package.json)

---

<a id='http-fingerprinting'></a>

## HTTP/2 Fingerprinting

### 相关页面

相关主题：[Processing Pipeline](#processing-pipeline), [Environment Variables](#environment-variables)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)
- [src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)
- [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)
- [README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)
- [CHANGELOG.md](https://github.com/vasylenko/markfetch/blob/main/CHANGELOG.md)
- [package.json](https://github.com/vasylenko/markfetch/blob/main/package.json)
</details>

# HTTP/2 Fingerprinting

## Overview

HTTP/2 Fingerprinting is a technique used by markfetch to mimic real browser traffic when fetching web pages. Instead of making requests that appear to come from a typical HTTP library (like curl or a basic fetch implementation), markfetch generates HTTP/2 requests with headers and client hints that closely match those of an actual Chrome browser session.

This approach serves two critical purposes:

1. **Bypass anti-bot measures**: Many websites employ fingerprinting techniques to detect and block automated scrapers. By presenting headers identical to a genuine Chrome browser, markfetch avoids triggering these defenses.
2. **Access SEO-rendered content**: Sites that serve different content to bots vs. browsers will return the full article content when markfetch requests arrive with Chrome-like fingerprints.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Architecture

```mermaid
graph TD
    A[URL Request] --> B{Adapter Type?}
    B -->|MCP| C[src/mcp.ts]
    B -->|CLI| D[src/cli.ts]
    C --> E[src/core.ts - fetchMarkdown]
    D --> E
    E --> F[Undici Dispatcher]
    F --> G[HTTP/2 Transport]
    G --> H[Sec-CH-UA-* Client Hints]
    G --> I[Chrome Headers]
    H --> J[Upstream Server]
    I --> J
    J --> K[HTML Response]
    K --> L[Readability Parser]
    L --> M[Markdown Output]
```

## Implementation Details

### User Agent String

The default user agent is a pinned Chrome 130 string. This can be overridden via the `MARKFETCH_USER_AGENT` environment variable, but must be a valid Chrome UA string.

| Environment Variable | Default Value | Purpose |
|---|---|---|
| `MARKFETCH_USER_AGENT` | Pinned Chrome 130 string | Override the browser fingerprint UA |

**Constraint**: The UA string must be a Chrome browser UA. Non-Chrome strings fail fast at startup because `Sec-CH-UA-*` client hints are derived from the UA at initialization time.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

### Client Hints Generation

When the server starts, markfetch parses the `MARKFETCH_USER_AGENT` value and derives `Sec-CH-UA-*` client hint headers from it. These hints are sent with every HTTP/2 request and include:

- `Sec-CH-UA` — Browser brand and version
- `Sec-CH-UA-Mobile` — Mobile indicator
- `Sec-CH-UA-Platform` — Operating system

```mermaid
graph LR
    A[MARKFETCH_USER_AGENT<br/>Chrome 130] --> B[Startup<br/>Initialization]
    B --> C[Sec-CH-UA Header<br/>Derived Value]
    B --> D[Sec-CH-UA-Mobile<br/>Derived Value]
    B --> E[Sec-CH-UA-Platform<br/>Derived Value]
    C --> F[Every HTTP/2<br/>Request]
    D --> F
    E --> F
```

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

### HTTP/2 Transport

Markfetch uses the undici HTTP client library with HTTP/2 protocol support. The HTTP/2 transport is selected automatically by undici when the server supports it, enabling:

- Multiplexed requests over a single connection
- Header compression
- Server push capabilities

The combination of HTTP/2 transport + coherent Chrome header set creates a fingerprint that is indistinguishable from a human browsing with Chrome DevTools open.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

### Request Flow

```mermaid
sequenceDiagram
    participant Client
    participant Markfetch
    participant Undici
    participant Server

    Client->>Markfetch: fetch_markdown(url)
    Markfetch->>Markfetch: Validate MARKFETCH_USER_AGENT
    Markfetch->>Undici: Dispatch with Chrome headers
    Undici->>Server: HTTP/2 CONNECT<br/>Sec-CH-UA: "Chromium"
    Undici->>Server: Sec-CH-UA-Mobile: ?U
    Undici->>Server: Sec-CH-UA-Platform: "Windows"
    Undici->>Server: GET /path HTTP/2
    Server->>Undici: HTTP/2 200 OK<br/>text/html
    Undici->>Markfetch: HTML Content
    Markfetch->>Markfetch: Apply Readability
    Markfetch->>Markfetch: Convert to Markdown
    Markfetch->>Client: Clean Markdown
```

## Configuration

### Environment Variables

| Variable | Default | Purpose |
|---|---|---|
| `MARKFETCH_TIMEOUT_MS` | `30000` | Per-request timeout in milliseconds |
| `MARKFETCH_MAX_BYTES` | `5000000` | Cap on response body and extracted markdown |
| `MARKFETCH_USER_AGENT` | Pinned Chrome 130 | Browser fingerprint override |

### Validation

All environment variables are validated at startup. Invalid values cause the process to fail fast on stderr with descriptive error messages, rather than producing confusing per-request errors.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Integration Points

### MCP Adapter

The MCP server (`src/mcp.ts`) uses the core fetch pipeline which includes the HTTP/2 fingerprinting. The tool description explicitly documents this behavior:

> Fetch a single public HTTP/S URL and return its main article content as clean markdown. Best for articles, documentation, blog posts, news, and reference pages. Non-HTML responses return `unsupported_content_type`.

资料来源：[src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)

### CLI Adapter

The CLI adapter (`src/cli.ts`) also uses the same core fetch pipeline, ensuring consistent HTTP/2 fingerprinting behavior whether invoked via MCP or command line:

```bash
markfetch https://en.wikipedia.org/wiki/Markdown
```

资料来源：[src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)

## Version History

| Version | Date | Change |
|---|---|---|
| 0.4.0 | 2026-05-10 | HTTP/2 fingerprinting feature added with Sec-CH-UA-* client hints |
| 0.5.0 | 2026-05-12 | CLI mode added with same fingerprinting behavior |
| 0.6.0 | Current | Enhanced write sandbox and validation |

资料来源：[CHANGELOG.md](https://github.com/vasylenko/markfetch/blob/main/CHANGELOG.md)

## Limitations

### SPA Handling

Pure client-rendered Single Page Applications (SPAs) with no static HTML content return `extraction_failed`. Sites that ship server-rendered or SEO-prerendered HTML will extract whatever static content they expose, including when accessed with Chrome fingerprints.

### Authentication

Markfetch performs anonymous fetches only — no cookie jar, no auth headers, no session reuse. Pages behind login walls return whatever the public response is, usually surfaced as `http_error`.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Security Considerations

The HTTP/2 fingerprinting approach makes requests appear legitimate, which raises responsibility concerns. The documentation explicitly states:

> Use it on URLs whose targets you have permission to fetch, and respect the terms of service of any site you query. The maintainer assumes no liability for misuse.

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

---

<a id='cli-usage'></a>

## CLI Usage

### 相关页面

相关主题：[Quick Start Guide](#quickstart), [MCP Server Integration](#mcp-server), [Write Sandbox Security](#write-sandbox)

<details>
<summary>Relevant Source Files</summary>

The following source files were used to generate this page:

- [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)
- [src/index.ts](https://github.com/vasylenko/markfetch/blob/main/src/index.ts)
- [src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)
- [README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)
- [package.json](https://github.com/vasylenko/markfetch/blob/main/package.json)
- [CHANGELOG.md](https://github.com/vasylenko/markfetch/blob/main/CHANGELOG.md)
</details>

# CLI Usage

The markfetch CLI provides a command-line interface for fetching URLs and converting their content to clean markdown. It operates as one of two execution surfaces—the other being the MCP (Model Context Protocol) stdio server—with both sharing the same underlying core pipeline.

## Overview

The CLI accepts a URL as its primary argument and outputs the converted markdown to stdout or to a specified file. It was introduced in version 0.5.0 as a way to make markfetch accessible from standard shell environments, pipelines, and scripts.

| Aspect | Details |
|--------|---------|
| Entry Point | `markfetch <url>` |
| Output | stdout (default) or file via `-o` |
| Version | 0.6.0 |
| Runtime | Node.js ≥ 24 |
| Distribution | npm package |

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Architecture

The CLI is implemented as an adapter layer that delegates to the shared core. When the process is invoked with arguments, the dispatcher in `index.ts` lazy-loads the CLI adapter; bare invocation (zero arguments) routes to the MCP server instead.

```mermaid
graph TD
    A["markfetch CLI Invokation<br/>process.argv.length > 1"] --> B["src/index.ts<br/>Dispatcher"]
    B --> C["src/cli.ts<br/>CLI Adapter"]
    C --> D["src/core.ts<br/>fetchMarkdown()"]
    D --> E["src/sandbox.ts<br/>Write Validation"]
    D --> F["HTTP Fetch + Readability + Turndown"]
    
    G["Bare Invocation<br/>process.argv.length === 1"] --> H["src/mcp.ts<br/>MCP Server"]
```

资料来源：[src/cli.ts:39-47](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)

## Command Syntax

```bash
markfetch <url> [options]
```

### Arguments

| Argument | Required | Description |
|----------|----------|-------------|
| `<url>` | Yes | Absolute http(s) URL to fetch |

### Options

| Flag | Description |
|------|-------------|
| `-o, --output <path>` | Save markdown to a file (absolute or relative path). Default is stdout. |
| `-V, --version` | Print version and exit |
| `-h, --help` | Print usage and exit |

资料来源：[src/cli.ts:23-30](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)

## Output Behavior

The CLI maintains strict separation between its output channels:

| Scenario | Channel | Content |
|----------|---------|---------|
| Raw markdown (no `-o`) | stdout | Raw markdown body via `process.stdout.write()` |
| File output (`-o`) | stdout | Confirmation: `Saved N bytes to <path>` |
| Any error | stderr | `[code] message` |

The raw markdown is written using `process.stdout.write()` rather than `console.log()` to preserve trailing whitespace in the output—matching the exact bytes the MCP adapter would emit in `content[0].text`.

资料来源：[src/cli.ts:50-58](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)

## Error Handling

Errors are written to stderr with a deterministic format: `[code] message`. The process exits with a non-zero status code.

```typescript
process.exitCode = 1;
console.error(`[${code}] ${message}`);
```

The CLI uses `process.exitCode` (not `process.exit()`) to ensure pending output drains before the process exits—important when stdout is piped to a slow consumer.

资料来源：[src/cli.ts:58-62](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)

### Error Codes

| Code | Meaning |
|------|---------|
| `network_error` | DNS / TCP / TLS failure |
| `http_error` | Upstream returned a non-2xx status |
| `timeout` | Request exceeded `MARKFETCH_TIMEOUT_MS` |
| `unsupported_content_type` | Response was not HTML |
| `extraction_failed` | No extractable article content |
| `too_large` | Response or markdown exceeded `MARKFETCH_MAX_BYTES` |
| `save_failed` | File write failed (permission denied, etc.) |

Note: `save_forbidden` is MCP-only and does not apply to CLI (no sandbox).

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Path Resolution

The CLI resolves relative output paths against the current working directory before passing them to the core:

```typescript
const savePath = options.output
  ? resolve(process.cwd(), options.output)
  : undefined;
```

Tilde expansion is intentionally **not** performed—the shell expands `~/foo` before argv reaches the process, and a quoted literal `'~/foo'` should produce a file named `~/foo` in cwd (standard tool behavior).

资料来源：[src/cli.ts:32-39](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)

## Environment Variables

These environment variables apply to both CLI and MCP modes:

| Variable | Default | Purpose |
|----------|---------|---------|
| `MARKFETCH_TIMEOUT_MS` | `30000` | Per-request timeout in ms |
| `MARKFETCH_MAX_BYTES` | `5000000` | Cap on response body and extracted markdown |
| `MARKFETCH_USER_AGENT` | Chrome 130 string | Override User-Agent header |

The CLI adapter imports `fetchMarkdown` and `classifyError` from the core module, which validates these environment variables at startup.

资料来源：[src/cli.ts:15](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts) and [README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

## File Structure

The project source is organized into adapter modules:

```
src/
├── index.ts    # Dispatcher (lazy-loads cli.ts or mcp.ts)
├── core.ts     # Shared pipeline and errors
├── cli.ts      # CLI adapter (commander-based)
└── mcp.ts      # MCP stdio server adapter
```

The lazy-import pattern ensures that `cli.ts` code (which calls `console.log`) is never loaded when running in MCP mode, preserving the "stdout is reserved for MCP frames" invariant structurally.

资料来源：[CHANGELOG.md](https://github.com/vasylenko/markfetch/blob/main/CHANGELOG.md) and [src/cli.ts:1-13](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)

## Installation

Install globally via npm:

```bash
npm i -g markfetch
```

Or use via npx without installation:

```bash
npx -y markfetch <url>
```

The `bin` entry in `package.json` points to `dist/index.js`:

```json
{
  "bin": {
    "markfetch": "dist/index.js"
  }
}
```

资料来源：[package.json:16-18](https://github.com/vasylenko/markfetch/blob/main/package.json)

## Usage Examples

### Basic fetch to stdout

```bash
markfetch https://en.wikipedia.org/wiki/Markdown
```

### Save to file

```bash
markfetch https://example.com/article -o output.md
```

### With timeout override

```bash
MARKFETCH_TIMEOUT_MS=60000 markfetch https://slow-site.example.com
```

### Pipeline to another tool

```bash
markfetch https://example.com/doc | grep -A5 "## Installation"
```

资料来源：[README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)

---

<a id='mcp-server'></a>

## MCP Server Integration

### 相关页面

相关主题：[Quick Start Guide](#quickstart), [CLI Usage](#cli-usage), [Write Sandbox Security](#write-sandbox)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)
- [src/index.ts](https://github.com/vasylenko/markfetch/blob/main/src/index.ts)
- [package.json](https://github.com/vasylenko/markfetch/blob/main/package.json)
- [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)
- [src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)
</details>

# MCP Server Integration

## Overview

The MCP (Model Context Protocol) Server Integration is the primary interface for AI agents to fetch web content as clean markdown. Markfetch exposes a single MCP tool `fetch_markdown` that accepts a URL and returns extracted markdown content, enabling language models like Claude to access web information through a standardized protocol.

The MCP server operates as a stdio-based server, meaning it communicates exclusively through standard input and standard output streams. This design ensures the server integrates seamlessly with MCP clients including Claude Desktop, Claude Code, Cursor, and Goose.

## Architecture

### Entry Point Dispatcher

The `src/index.ts` file implements an argv-discriminated dispatcher that determines whether to start the MCP server or the CLI based on the presence of command-line arguments:

```typescript
if (process.argv.length === 2) {
  await import("./mcp.js");
} else {
  await import("./cli.js");
}
```

**资料来源：[src/index.ts:26-29]()**

When `process.argv.length === 2`, the process was invoked without arguments—this is the standard pattern MCP clients use when spawning a server. Any extra argument (URL, flags, `--help`) routes to the CLI adapter.

### Module Isolation

The dynamic import pattern ensures complete module isolation:

```mermaid
graph TD
    A[markfetch entry] --> B{argv.length === 2?}
    B -->|Yes| C[Lazy import: mcp.ts]
    B -->|No| D[Lazy import: cli.ts]
    C --> E[@modelcontextprotocol/sdk loaded]
    D --> F[commander loaded]
    E -.-> G[Never reaches console.log]
    F -.-> H[Can use console.log]
```

**资料来源：[src/index.ts:18-22]()**

This architecture enforces the "stdout is reserved for MCP frames" invariant structurally—the MCP path never imports `cli.ts`, so code that calls `console.log` is literally unreachable from the MCP execution path.

## MCP Server Implementation

### Server Initialization

The MCP server is initialized using the `@modelcontextprotocol/sdk` package:

```typescript
const server = new McpServer({ name: "markfetch", version: "0.6.0" });
```

**资料来源：[src/mcp.ts:20]()**

### Tool Registration

The server registers a single tool `fetch_markdown` with a Zod-based input schema:

```typescript
server.registerTool(
  "fetch_markdown",
  {
    description: "Fetch a single public HTTP/S URL and return its main article content as clean markdown...",
    inputSchema: {
      url: z.string().url().describe("Absolute http(s) URL of the page to fetch..."),
      savePath: z.string().refine(isAbsolute, "savePath must be an absolute filesystem path").optional().describe("Optional. When provided...")
    }
  },
  async ({ url, savePath }) => {
    // Implementation
  }
);
```

**资料来源：[src/mcp.ts:22-47]()**

### Tool Input Schema

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | Absolute http(s) URL of the page to fetch. The server follows redirects automatically. No authentication headers, cookies, or session state are sent. |
| `savePath` | string | No | Optional absolute filesystem path. When provided, the fetched markdown is written to this path instead of returned inline. |

The `url` parameter is validated using Zod's `.url()` method to ensure a valid URL format. The `savePath` parameter must be an absolute path, enforced by the `.refine(isAbsolute, ...)` check.

### Response Format

The tool returns a response in this structure:

```typescript
{
  content: [{ type: "text", text: "markdown content or [errorcode] message" }],
  isError: boolean
}
```

**资料来源：[src/mcp.ts:8-12]()**

## Error Handling

### Error Code System

The MCP adapter uses a uniform error code system with 8 deterministic codes:

| Error Code | Description | Source |
|------------|-------------|--------|
| `network_error` | DNS/TCP/TLS failure or unexpected internal error | core.ts |
| `http_error` | Upstream returned non-2xx status | core.ts |
| `timeout` | Per-request budget exceeded | core.ts |
| `unsupported_content_type` | Response was not text/html or application/xhtml+xml | core.ts |
| `extraction_failed` | Readability returned no article content | core.ts |
| `too_large` | Response or markdown exceeded MARKFETCH_MAX_BYTES | core.ts |
| `save_failed` | writeFile failed (permission denied, missing directory) | core.ts |
| `save_forbidden` | savePath resolves outside allowed write roots | src/mcp.ts |

### Error Result Factory

```typescript
function errorResult(code: ErrorCode, message: string) {
  return {
    content: [{ type: "text" as const, text: `[${code}] ${message}` }],
    isError: true,
  };
}
```

**资料来源：[src/mcp.ts:8-12]()**

### Error Propagation Pattern

In version 0.5.0, error handling was refactored so that core functions now `throw MarkfetchError` instead of returning error results inline. Both the MCP and CLI adapters catch these exceptions and convert them to their respective output formats.

**资料来源：[CHANGELOG.md:19-21]()**

## Write Sandbox (MCP-Specific)

The MCP server implements a write sandbox that restricts `savePath` operations to a set of allowed root directories.

### Default Allowed Roots

By default, the allowed set is:
- `os.tmpdir()` (system temp directory)
- `process.cwd()` (current working directory)

Each path is resolved via `fs.realpath` at startup to handle symlinks.

### Configuration

The `MARKFETCH_ALLOWED_WRITE_ROOTS` environment variable overrides the default set entirely:

```json
{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
      }
    }
  }
}
```

**资料来源：[README.md:89-100]()**

### Security Rationale

The sandbox is MCP-only by design. The CLI is unrestricted because "a human at the shell is the security boundary." The asymmetry exists because the MCP tool is driven by a language model, which may be steered by content from a page it just fetched.

**资料来源：[README.md:102-104]()**

## Request Flow

```mermaid
sequenceDiagram
    participant Client as MCP Client
    participant MCP as MCP Server
    participant Core as fetchMarkdown()
    participant Fetch as HTTP Fetcher

    Client->>MCP: fetch_markdown({url, savePath?})
    MCP->>Core: fetchMarkdown({url, savePath})
    Core->>Fetch: GET url (with Chrome fingerprint)
    Fetch-->>Core: HTML response
    Core->>Core: Readability parsing
    Core->>Core: Turndown conversion
    alt savePath provided
        Core->>Core: Write to file (within sandbox)
    end
    Core-->>MCP: {markdown, bytes, savedTo?}
    MCP-->>Client: {content: [{text: markdown}], isError: false}
```

## Environment Configuration

| Variable | Default | Purpose | MCP-Specific |
|----------|---------|---------|--------------|
| `MARKFETCH_TIMEOUT_MS` | `30000` | Per-request timeout in ms | No |
| `MARKFETCH_MAX_BYTES` | `5000000` | Cap on response body and extracted markdown | No |
| `MARKFETCH_USER_AGENT` | Chrome 130 string | Override the User-Agent header | No |
| `MARKFETCH_ALLOWED_WRITE_ROOTS` | `os.tmpdir()` + `process.cwd()` | Permitted write roots for savePath | **Yes** |

**资料来源：[src/mcp.ts:1-5](), [README.md:68-75]()**

## Integration with Clients

### Claude Desktop / Claude Code

```bash
claude mcp add --scope user markfetch -- npx -y markfetch
```

**资料来源：[README.md:40-43]()**

### Codex

```bash
codex mcp add markfetch -- npx -y markfetch
```

**资料来源：[README.md:46-48]()**

### Manual Configuration

```json
{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}
```

**资料来源：[README.md:52-58]()**

## Dependencies

The MCP server depends on:

| Package | Version | Purpose |
|---------|---------|---------|
| `@modelcontextprotocol/sdk` | ^1.29.0 | MCP protocol implementation |
| `zod` | ^3.0.0 | Input schema validation |
| `@mozilla/readability` | ^0.5.0 | Article extraction |
| `turndown` | ^7.0.0 | HTML to Markdown conversion |
| `undici` | ^8.2.0 | HTTP client |
| `linkedom` | ^0.18.0 | DOM parsing |

**资料来源：[package.json:36-47]()**

---

<a id='environment-variables'></a>

## Environment Variables

### 相关页面

相关主题：[HTTP/2 Fingerprinting](#http-fingerprinting), [Write Sandbox Security](#write-sandbox), [Error Handling](#error-handling)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)
- [src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)
- [src/sandbox.ts](https://github.com/vasylenko/markfetch/blob/main/src/sandbox.ts)
- [README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)
</details>

# Environment Variables

markfetch uses environment variables to configure runtime behavior at startup. These variables control network timeouts, response size limits, HTTP fingerprinting, and file write permissions for the MCP server.

## Overview

Environment variables in markfetch serve as the primary configuration mechanism. Unlike per-request options, these settings apply globally to every operation and are validated once at process startup. This fail-fast design prevents misconfiguration from producing confusing per-request errors later.

```mermaid
graph TD
    A[Process Start] --> B[Validate MARKFETCH_TIMEOUT_MS]
    A --> C[Validate MARKFETCH_MAX_BYTES]
    A --> D[Validate MARKFETCH_USER_AGENT]
    A --> E[Build MARKFETCH_ALLOWED_WRITE_ROOTS]
    B --> F{Valid?}
    C --> F
    D --> F
    E --> F
    F -->|Yes| G[Server Ready]
    F -->|No| H[Exit with stderr error]
```

All validation occurs before the server begins accepting requests. Invalid values cause immediate process termination with a descriptive error message written to stderr.

## Configuration Variables

### MARKFETCH_TIMEOUT_MS

| Property | Value |
|----------|-------|
| Default | `30000` (30 seconds) |
| Purpose | Per-request timeout in milliseconds |
| Type | Positive integer |

Controls the maximum duration allowed for any single HTTP request, including DNS resolution, TCP connection, TLS handshake, and response body transfer.

```typescript
const config = {
  timeoutMs: intEnv("MARKFETCH_TIMEOUT_MS", 30_000),
};
```

Validation rejects non-positive integers, non-integer values, and non-finite numbers (NaN, Infinity). A malformed value produces:

```
[core] Error: Invalid MARKFETCH_TIMEOUT_MS="abc" — expected a positive integer.
```

资料来源：[src/core.ts:1-50]()

### MARKFETCH_MAX_BYTES

| Property | Value |
|----------|-------|
| Default | `5000000` (~4.77 MB) |
| Purpose | Cap on response body and extracted markdown |
| Type | Positive integer |

Both the raw HTTP response body and the final extracted markdown are checked against this limit. If either exceeds the cap, the operation returns `too_large` error.

```typescript
const config = {
  maxBytes: intEnv("MARKFETCH_MAX_BYTES", 5_000_000),
};
```

资料来源：[src/core.ts:1-50]()

### MARKFETCH_USER_AGENT

| Property | Value |
|----------|-------|
| Default | `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36` |
| Purpose | HTTP User-Agent header and Sec-CH-UA-* client hints |
| Type | String (must contain "Chrome") |

The User-Agent string determines both the HTTP header sent to servers and the derived Sec-CH-UA-* client hints. The hints are derived at startup and remain fixed for the process lifetime.

```mermaid
graph LR
    A[MARKFETCH_USER_AGENT] --> B[deriveClientHints]
    B --> C[Sec-CH-UA]
    B --> D[Sec-CH-UA-Mobile]
    B --> E[Sec-CH-UA-Platform]
    A --> F[User-Agent Header]
```

```typescript
function deriveClientHints(ua: string): {
  brands: string;
  mobile: string;
  platform: string;
} {
  const versionMatch = /\bChrome\/(\d+)/.exec(ua);
  if (!versionMatch) {
    throw new Error(
      `Invalid MARKFETCH_USER_AGENT=${JSON.stringify(ua)} — expected a Chrome User-Agent containing "Chrome/..."`
    );
  }
  // ...
}
```

The UA must contain a Chrome version string. Non-Chrome UAs fail fast at startup to prevent fingerprinting mismatches that would increase bot detection.

资料来源：[src/core.ts:1-50]()

## Write Sandbox (MCP-Only)

### MARKFETCH_ALLOWED_WRITE_ROOTS

| Property | Value |
|----------|-------|
| Default | `os.tmpdir() ∪ process.cwd()` |
| Purpose | Restrict MCP `savePath` writes to specific directories |
| Type | Platform-delimiter-separated absolute paths |
| Platform | POSIX: `:` delimiter; Windows: `;` delimiter |
| Mode | MCP-only (CLI has no sandbox) |

This variable applies exclusively to the MCP server mode. The CLI operates without restriction, treating the human at the shell as the security boundary.

```mermaid
graph TD
    A[MCP savePath request] --> B{Path inside allowed roots?}
    B -->|Yes| C[Write file]
    B -->|No| D[Return save_forbidden error]
    C --> E[Confirmation to client]
    D --> F[No file created]
```

When set, the value **replaces** the defaults entirely rather than merging with them. To retain access to the default directories, include them explicitly:

```json
{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
      }
    }
  }
}
```

On Windows:

```json
{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "C:\\Users\\me\\markfetch-out;C:\\Users\\me\\AppData\\Local\\Temp"
      }
    }
  }
}
```

### Validation Rules

Each entry in the list must be:

1. An absolute path (relative paths fail fast)
2. An existing directory at startup
3. Resolved through symlinks for containment checks

```typescript
function buildAllowedRoots(envValue?: string): string[] {
  // ...
}
```

Symlinks pointing outside the sandbox are blocked. The canonicalized path flows from the containment check into `writeFile`, ensuring the file is created exactly at the validated location.

资料来源：[src/sandbox.ts:1-50]()
资料来源：[src/mcp.ts:1-50]()

## Error Codes

When environment variable validation fails, markfetch writes to stderr and exits with a non-zero status:

| Error Code | Trigger | Exit Status |
|------------|---------|-------------|
| Startup failure | Invalid MARKFETCH_TIMEOUT_MS | Non-zero |
| Startup failure | Invalid MARKFETCH_MAX_BYTES | Non-zero |
| Startup failure | Non-Chrome MARKFETFETCH_USER_AGENT | Non-zero |
| Startup failure | Malformed MARKFETCH_ALLOWED_WRITE_ROOTS | Non-zero |
| Runtime error | `save_forbidden` (MCP only) | Non-zero |

Runtime errors from invalid environment values (e.g., `MARKFETCH_TIMEOUT_MS="abc"`) differ from request-scoped errors like `http_error` or `timeout`. Environment misconfiguration is always fatal at startup.

## Environment Variable Summary

| Variable | Default | Scope | Purpose |
|----------|---------|-------|---------|
| `MARKFETCH_TIMEOUT_MS` | `30000` | Both | Request timeout in ms |
| `MARKFETCH_MAX_BYTES` | `5000000` | Both | Response and markdown size cap |
| `MARKFETCH_USER_AGENT` | Chrome 130 string | Both | HTTP fingerprint |
| `MARKFETCH_ALLOWED_WRITE_ROOTS` | tmpdir + cwd | MCP only | Write sandbox boundaries |

## Configuration Priority

Environment variables set at process startup take precedence over all other configuration. There is no runtime override mechanism—changing these values requires restarting the server.

```mermaid
graph TD
    A[Environment Variable] --> B[Validated at Startup]
    B --> C[Stored in config object]
    C --> D[Used by core.ts pipeline]
    D --> E[HTTP Request]
    D --> F[File Write]
    D --> G[Response Validation]
```

## Security Considerations

The write sandbox exists because the MCP tool is driven by a language model, which may be steered by content from a page it just fetched. Without sandboxing, a malicious page could诱导 the model to request writes outside expected directories.

The CLI intentionally has no sandbox—direct human invocation at the shell establishes the trust boundary.

资料来源：[README.md:1-100]()
</details>

---

<a id='write-sandbox'></a>

## Write Sandbox Security

### 相关页面

相关主题：[MCP Server Integration](#mcp-server), [Environment Variables](#environment-variables), [Error Handling](#error-handling)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/sandbox.ts](https://github.com/vasylenko/markfetch/blob/main/src/sandbox.ts)
- [src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)
- [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)
- [README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)
- [CHANGELOG.md](https://github.com/vasylenko/markfetch/blob/main/CHANGELOG.md)
- [package.json](https://github.com/vasylenko/markfetch/blob/main/package.json)
</details>

# Write Sandbox Security

## Overview

The Write Sandbox is a security mechanism in markfetch that restricts filesystem writes initiated via the MCP (Model Context Protocol) interface to a configurable set of allowed root directories. This protection prevents a language model, which may be influenced by fetched content, from writing files to arbitrary locations on the host system.

The sandbox enforces path containment by resolving symlinks and comparing canonicalized paths against the configured allowed roots. Any attempted write outside the sandbox boundary returns a `save_forbidden` error and the file is never created.

## Purpose and Scope

### Security Boundary

The sandbox exists because MCP tools are driven by a language model that can be steered by content from pages it fetches. Without containment:

- A malicious or compromised webpage could instruct the LLM to write files to sensitive locations (e.g., `~/.ssh/authorized_keys`, `~/.bashrc`)
- Path traversal attempts via symlinks could escape expected boundaries
- Untrusted fetched content could modify configuration files or inject malicious code

The CLI mode intentionally has **no sandbox**. A human at the shell is considered the security boundary, as the user has direct control over command invocation and can review output before it reaches any model.

### Scope Limitations

| Scope | Sandboxed? |
|-------|------------|
| MCP server (`fetch_markdown` tool) | Yes |
| CLI mode (`markfetch <url>`) | No |
| Direct `node` execution | No |

资料来源：[README.md:68-70](https://github.com/vasylenko/markfetch/blob/main/README.md)

## Configuration

### Environment Variable

| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `MARKFETCH_ALLOWED_WRITE_ROOTS` | String | `os.tmpdir()` + `process.cwd()` | Path-delimiter-separated list of absolute paths permitted as MCP `savePath` write roots |

### Path Delimiters

The delimiter varies by platform:

| Platform | Delimiter | Example |
|----------|-----------|---------|
| POSIX (Linux, macOS) | `:` | `/tmp:/home/user/markfetch-out` |
| Windows | `;` | `C:\Users\me\markfetch-out;C:\Temp` |

### Behavior Rules

1. **Replacement, not merge**: When set, the variable replaces the defaults entirely. To retain access to `os.tmpdir()` or `process.cwd()`, explicitly include them.

2. **Validation at startup**: Malformed values (non-absolute entries, nonexistent directories) cause the server to fail fast on stderr.

3. **Realpath resolution**: Each root is resolved once via `fs.realpath` at startup to canonicalize symlinks.

资料来源：[README.md:71-89](https://github.com/vasylenko/markfetch/blob/main/README.md)

### Configuration Example

**POSIX:**
```json
{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
      }
    }
  }
}
```

**Windows:**
```json
{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "C:\\Users\\me\\markfetch-out;C:\\Users\\me\\AppData\\Local\\Temp"
      }
    }
  }
}
```

## Security Model

### Path Resolution Flow

```mermaid
graph TD
    A[User provides savePath] --> B{Is path absolute?}
    B -->|No| E[Error: savePath must be absolute]
    B -->|Yes| C[Resolve via fs.realpath]
    C --> D{Is resolved path inside allowed roots?}
    D -->|Yes| F[Allow write to resolved path]
    D -->|No| G[Return save_forbidden error]
    
    H[Allowed roots from env] --> I[Realpath-resolved at startup]
    I --> D
```

### Symlink Handling

The sandbox protects against symlink-based escapes:

1. **Resolve before check**: Symlinks are resolved via `fs.realpath` before containment validation
2. **Re-resolve at write time**: The canonicalized path from the validation check flows directly into `writeFile`
3. **No lexical comparison**: A path like `<sandbox>/link/..` is not compared lexically against the roots—it's resolved first, then validated

This prevents attacks where a symlink planted inside the sandbox points outside, collapsing lexically for the check but resolving to an external location at write time.

资料来源：[CHANGELOG.md:17-25](https://github.com/vasylenko/markfetch/blob/main/CHANGELOG.md)

### Platform-Specific Behaviors

| Platform | Case Sensitivity | Notes |
|----------|------------------|-------|
| Linux/macOS | Case-sensitive | Paths must match exactly |
| Windows | Case-insensitive | `C:\Users\Bob` and `c:\users\bob` are equivalent |

On Windows, the containment check lowercases both the root and target paths before comparison.

资料来源：[src/sandbox.ts:28-30](https://github.com/vasylenko/markfetch/blob/main/src/sandbox.ts)

## Core Implementation

### API Design

The sandbox module exposes two primary functions:

```typescript
function buildAllowedRoots(env: Record<string, string | undefined>): string[]
function validateSavePath(
  savePath: string,
  roots: string[]
): { ok: boolean; resolved?: string; reason?: string }
```

### `buildAllowedRoots()`

Parses `MARKFETCH_ALLOWED_WRITE_ROOTS` from environment variables:

| Parameter | Type | Description |
|-----------|------|-------------|
| `env` | `Record<string, string \| undefined>` | Process environment variables |

| Return Type | Description |
|-------------|-------------|
| `string[]` | Array of absolute, realpath-resolved directory paths |

**Logic:**
1. If `MARKFETCH_ALLOWED_WRITE_ROOTS` is unset: return `[os.tmpdir(), process.cwd()]`
2. If set: split by platform delimiter, validate each is absolute and exists
3. Resolve each via `fs.realpath` for canonical form

### `validateSavePath()`

Validates a save path is within allowed roots:

| Parameter | Type | Description |
|-----------|------|-------------|
| `savePath` | `string` | The requested save path |
| `roots` | `string[]` | Allowed root directories |

| Return Type | Description |
|-------------|-------------|
| `{ ok: true, resolved: string }` | Path is allowed; `resolved` is the canonicalized path for writing |
| `{ ok: false, reason: string }` | Path is outside sandbox; `reason` describes the violation |

**Validation steps:**
1. Resolve `savePath` via `fs.realpath`
2. For each root, compute relative path from root to resolved target
3. If relative path is empty (same directory) or does not start with `..` and is not absolute: allow
4. Otherwise: reject with reason listing allowed roots

资料来源：[src/sandbox.ts:1-50](https://github.com/vasylenko/markfetch/blob/main/src/sandbox.ts)

## Error Handling

### Error Codes

| Code | Condition | Response |
|------|-----------|----------|
| `save_forbidden` | `savePath` resolves outside allowed roots | No file written; MCP returns error |
| `save_failed` | `savePath` is valid but `writeFile` fails | No file written; MCP returns error |

### Error Message Format

All sandbox errors return the format:
```
[save_forbidden] '<path>' is outside the allowed write roots: ['/allowed/root1', '/allowed/root2']
```

This provides:
- The attempted path
- The reason for rejection
- The list of allowed roots for debugging

资料来源：[src/mcp.ts:8-13](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)

## MCP Integration

### Tool Schema

```typescript
server.registerTool("fetch_markdown", {
  inputSchema: {
    url: z.string().url().describe("..."),
    savePath: z.string()
      .refine(isAbsolute, "savePath must be an absolute filesystem path")
      .optional()
      .describe("Optional. When provided, the fetched markdown is written to this absolute filesystem path...")
  }
});
```

### Validation Flow

1. MCP adapter receives `savePath` parameter
2. Validates path is absolute (via Zod schema)
3. Calls `validateSavePath(savePath, allowedRoots)`
4. If `ok: false`: throw `MarkfetchError` with `save_forbidden` code
5. If `ok: true`: use `resolved` path for `writeFile`

资料来源：[src/mcp.ts:24-35](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)

## Architecture Diagram

```mermaid
graph LR
    subgraph MCP_Client
        A[LLM sends fetch_markdown with savePath]
    end
    
    subgraph MCP_Server
        B[src/mcp.ts - MCP adapter]
        C[src/core.ts - fetchMarkdown]
        D[src/sandbox.ts - validateSavePath]
    end
    
    subgraph File_System
        E[fs.realpath resolution]
        F[fs.writeFile]
    end
    
    A --> B
    B -->|validate path| D
    D -->|resolve symlink| E
    E -->|check containment| D
    D -->|ok: true| C
    C -->|write markdown| F
    
    D -->|ok: false| B
    B -->|save_forbidden| A
```

## CLI vs MCP Behavior

| Aspect | CLI Mode | MCP Mode |
|--------|----------|----------|
| Write sandbox | None | Enforced |
| Path validation | Not performed | Required |
| Symlink resolution | Not performed | Required |
| `savePath` parameter | Optional, `-o` flag | Optional, tool parameter |
| Relative path resolution | Resolves against cwd | Not allowed (must be absolute) |

The CLI adapter resolves relative paths internally for convenience, but the MCP adapter requires absolute paths and enforces the sandbox.

资料来源：[src/cli.ts:6-18](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)

## Security Considerations

### Attack Vectors Mitigated

1. **Path traversal**: `../../etc/passwd` is resolved before checking
2. **Symlink escape**: `<sandbox>/link_to_external` is resolved and rejected
3. **Case confusion (Windows)**: `C:\Users\Bob` equals `c:\users\bob`
4. **Tilde expansion**: Not performed; shell expands `~` before argv reaches process

### Remaining Trust Boundaries

| Trust Level | Description |
|-------------|-------------|
| Filesystem permissions | Sandbox does not override OS file permissions |
| Network | Does not prevent network-based attacks |
| Content injection | Does not sanitize markdown content before writing |

## Related Files

| File | Role |
|------|------|
| `src/sandbox.ts` | Core sandbox validation logic |
| `src/mcp.ts` | MCP server adapter, uses sandbox |
| `src/cli.ts` | CLI adapter, no sandbox |
| `src/core.ts` | Core fetch pipeline |
| `README.md` | User documentation and configuration |
| `CHANGELOG.md` | Historical security fix for symlink escape |

## Changelog

| Version | Change |
|---------|--------|
| 0.6.0 | Current release with full sandbox implementation |
| 0.5.0 | CLI mode added (unrestricted by design) |
| < 0.5.0 | MCP-only, sandbox introduced |

资料来源：[package.json:3](https://github.com/vasylenko/markfetch/blob/main/package.json)

---

<a id='error-handling'></a>

## Error Handling

### 相关页面

相关主题：[Processing Pipeline](#processing-pipeline), [Write Sandbox Security](#write-sandbox), [Environment Variables](#environment-variables)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)
- [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)
- [src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)
- [src/sandbox.ts](https://github.com/vasylenko/markfetch/blob/main/src/sandbox.ts)
- [README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)
- [CHANGELOG.md](https://github.com/vasylenko/markfetch/blob/main/CHANGELOG.md)
</details>

# Error Handling

markfetch implements a deterministic, structured error handling system that provides consistent error reporting across both CLI and MCP interfaces. All errors are categorized into specific codes that enable precise failure diagnosis and appropriate recovery strategies.

## Error Code Reference

markfetch defines eight deterministic error codes that cover all failure scenarios. Each code is designed to be actionable, helping callers understand exactly what went wrong and how to respond.

| Error Code | Meaning | Typical Cause |
|---|---|---|
| `network_error` | DNS, TCP, or TLS failure | Firewall blocking, network unavailable, invalid hostname |
| `http_error` | Non-2xx HTTP response | 404 page not found, 403 forbidden, 500 server error |
| `timeout` | Request exceeded `MARKFETCH_TIMEOUT_MS` | Slow server, large page, network latency |
| `unsupported_content_type` | Response is not HTML | Binary files, JSON APIs, PDF documents |
| `extraction_failed` | Readability found no article content | Pure client-rendered SPAs with no static HTML |
| `too_large` | Body or markdown exceeded `MARKFETCH_MAX_BYTES` | Very large articles with embedded media |
| `save_failed` | File write operation failed | Missing parent directory, permission denied |
| `save_forbidden` | Save path outside allowed write roots | Path traverses symlink outside sandbox |

资料来源：[README.md](README.md)

## Error Architecture

The error handling system follows a layered architecture where core validation and error creation happen in `src/core.ts`, while each adapter (CLI and MCP) provides interface-specific error formatting and reporting.

```mermaid
graph TD
    A[Request] --> B[core.ts Validation]
    B --> C{Error Condition?}
    C -->|No| D[Successful Fetch]
    C -->|Yes| E[MarkfetchError Thrown]
    E --> F[Adapter Layer]
    F --> G[CLI Adapter]
    F --> H[MCP Adapter]
    G --> I[stderr: [code] message]
    H --> J[content[0].text: [code] message]
    J --> K[isError: true]
```

资料来源：[src/core.ts](src/core.ts), [src/cli.ts](src/cli.ts), [src/mcp.ts](src/mcp.ts)

## MarkfetchError Class

The central error type is `MarkfetchError`, which encapsulates both the error code and human-readable message. This class serves as the single error type thrown throughout the application.

```typescript
class MarkfetchError {
  constructor(
    public readonly code: ErrorCode,
    public readonly message: string
  ) {}
}
```

资料来源：[src/core.ts:1-100](src/core.ts)

## Environment Variable Validation

markfetch validates configuration environment variables at startup to fail fast on misconfiguration rather than producing confusing per-request errors.

| Variable | Default | Validation Rules |
|---|---|---|
| `MARKFETCH_TIMEOUT_MS` | `30000` | Positive integer |
| `MARKFETCH_MAX_BYTES` | `5000000` | Positive integer |
| `MARKFETCH_USER_AGENT` | Chrome 130 UA string | Must contain Chrome substring |

The `intEnv` function performs validation:

```typescript
function intEnv(name: string, fallback: number): number {
  const raw = process.env[name];
  if (raw == null || raw === "") return fallback;
  const n = Number(raw);
  if (!Number.isFinite(n) || !Number.isInteger(n) || n <= 0) {
    throw new Error(
      `Invalid ${name}=${JSON.stringify(raw)} — expected a positive integer.`,
    );
  }
  return n;
}
```

资料来源：[src/core.ts:1-100](src/core.ts)

### User-Agent Validation

The `MARKFETFET_USER_AGENT` must be a valid Chrome User-Agent string. This requirement exists because Sec-CH-UA-* client hints are derived from the User-Agent at startup, and a mismatch creates a stronger bot signal.

```typescript
function deriveClientHints(ua: string): {
  brands: string;
  mobile: string;
  platform: string;
} {
  const versionMatch = /\bChrome\/(\d+)/.exec(ua);
  if (!versionMatch) {
    throw new Error(
      `Invalid MARKFETCH_USER_AGENT=${JSON.stringify(ua)} — expected a Chrome User-Agent containing "Chrome/VERSION".`,
    );
  }
  // ...
}
```

资料来源：[src/core.ts:1-100](src/core.ts)

## CLI Error Handling

The CLI adapter catches errors thrown from core and formats them for stderr output. Error output follows a consistent `[code] message` format that matches the MCP error format exactly.

```typescript
try {
  const { markdown, bytes, savedTo } = await fetchMarkdown({
    url,
    savePath,
  });
  // ... success handling
} catch (err) {
  const { code, message } = classifyError(err);
  console.error(`[${code}] ${message}`);
  // Use exitCode so pending output drains before process exits
  process.exitCode = 1;
}
```

资料来源：[src/cli.ts:1-50](src/cli.ts)

### CLI Exit Codes

| Scenario | Exit Code | Output |
|---|---|---|
| Success (stdout) | 0 | Raw markdown |
| Success (save to file) | 0 | `Saved X bytes to /path` |
| Any error | 1 | `[code] message` to stderr |

The use of `process.exitCode = 1` (rather than `process.exit(1)`) ensures pending stdout/stderr output drains before the process terminates, which is important when stdout is piped to a slow consumer.

资料来源：[src/cli.ts:1-50](src/cli.ts)

## MCP Error Handling

The MCP adapter returns errors in a format compatible with the MCP protocol. Errors appear in the `content[0].text` field with `isError: true` set.

```typescript
function errorResult(code: ErrorCode, message: string) {
  return {
    content: [{ type: "text" as const, text: `[${code}] ${message}` }],
    isError: true,
  };
}
```

资料来源：[src/mcp.ts:1-50](src/mcp.ts)

### MCP Response Structure for Errors

```json
{
  "content": [
    {
      "type": "text",
      "text": "[network_error] DNS lookup failed"
    }
  ],
  "isError": true
}
```

资料来源：[src/mcp.ts:1-50](src/mcp.ts)

## Write Sandbox Errors

The MCP interface enforces a write sandbox that restricts file saves to configured root directories. Errors occur when `savePath` resolves to a location outside the allowed roots.

```typescript
export function checkWritePath(
  target: string,
  roots: string[],
): { ok: true; resolved: string } | { ok: false; reason: string } {
  // ... validation logic
  return {
    ok: false,
    reason: `'${reattached}' is outside the allowed write roots: [${roots.map((r) => `'${r}'`).join(", ")}]`,
  };
}
```

资料来源：[src/sandbox.ts:1-100](src/sandbox.ts)

### Allowed Write Roots Configuration

| Platform | Default Roots | Delimiter |
|---|---|---|
| POSIX | `os.tmpdir()` + `process.cwd()` | `:` |
| Windows | `os.tmpdir()` + `process.cwd()` | `;` |

Override with `MARKFETCH_ALLOWED_WRITE_ROOTS` environment variable. When set, this **replaces** the defaults entirely rather than merging.

资料来源：[README.md](README.md)

### Symlink Handling

The sandbox correctly resolves symlinks to prevent escape attempts like `<sandbox>/link/../out.md` where `link` points outside the sandbox. The canonicalized path flows from the containment check into `writeFile`, ensuring the file is created exactly at the validated location.

资料来源：[CHANGELOG.md](CHANGELOG.md), [src/sandbox.ts:1-100](src/sandbox.ts)

## Error Classification

The `classifyError` function normalizes different error types into the `MarkfetchError` format used throughout the system:

```typescript
function classifyError(err: unknown): { code: string; message: string } {
  if (err instanceof MarkfetchError) {
    return { code: err.code, message: err.message };
  }
  if (err instanceof Error) {
    return { code: "network_error", message: err.message };
  }
  return { code: "network_error", message: String(err) };
}
```

资料来源：[src/core.ts:1-100](src/core.ts)

### Error Source Mapping

| Error Source | Code Produced |
|---|---|
| `MarkfetchError` instances | Original code preserved |
| `Error` instances | `network_error` |
| Non-Error values | `network_error` with string coercion |

## Unified Error Flow

Version 0.5.0 introduced a refactoring where three inline `return errorResult(...)` sites in the MCP handler were converted to throw `MarkfetchError` from core uniformly. Both adapters now catch and convert errors consistently.

This architectural change ensures that both CLI and MCP interfaces produce identical error codes and messages for the same failure conditions.

资料来源：[CHANGELOG.md](CHANGELOG.md)

## Best Practices for Error Handling

### For MCP Clients

1. Check `isError` field in the response object
2. Parse the `content[0].text` field for the `[code] message` format
3. Handle `extraction_failed` gracefully for client-rendered SPAs
4. Use `savePath` parameter for large responses to avoid tool-result truncation

### For CLI Consumers

1. Redirect stderr to capture error codes
2. Parse `[code] message` format from stderr
3. Use `markfetch url 2>&1 | head -1` to get the error

### For Save Operations

1. Always use absolute paths for `savePath`
2. Verify `MARKFETCH_ALLOWED_WRITE_ROOTS` includes your target directory
3. Check for `save_forbidden` before `save_failed` in error handling logic

---

<a id='development'></a>

## Development Guide

### 相关页面

相关主题：[Introduction](#introduction), [Quick Start Guide](#quickstart)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [package.json](https://github.com/vasylenko/markfetch/blob/main/package.json)
- [src/core.ts](https://github.com/vasylenko/markfetch/blob/main/src/core.ts)
- [src/cli.ts](https://github.com/vasylenko/markfetch/blob/main/src/cli.ts)
- [src/mcp.ts](https://github.com/vasylenko/markfetch/blob/main/src/mcp.ts)
- [src/sandbox.ts](https://github.com/vasylenko/markfetch/blob/main/src/sandbox.ts)
- [README.md](https://github.com/vasylenko/markfetch/blob/main/README.md)
</details>

# Development Guide

This guide provides comprehensive information for developers who want to understand, extend, or contribute to markfetch.

## Overview

markfetch is a Node.js tool that fetches URLs and converts web content to clean markdown. It operates in two modes:

1. **CLI Mode** - Command-line interface for shell integration
2. **MCP Mode** - Model Context Protocol server for AI agent integration

The project requires Node.js ≥ 24 and is distributed as an npm package. 资料来源：[package.json:8]()

## Architecture

```mermaid
graph TD
    A[User Input] --> B{process.argv.length}
    B -->|≥ 2 args| C[CLI Adapter]
    B -->|Zero args| D[MCP Adapter]
    
    C --> E[src/cli.ts]
    D --> F[src/mcp.ts]
    
    E --> G[src/core.ts]
    F --> G
    
    G --> H[undici HTTP Client]
    G --> I[linkedom HTML Parser]
    G --> J[@mozilla/readability]
    G --> K[turndown]
    
    H --> L[HTTP Response]
    I --> M[DOM Document]
    J --> N[Extracted Article]
    K --> O[Markdown Output]
```

### Core Pipeline (src/core.ts)

The core module implements the main fetch-and-convert pipeline. It orchestrates:

| Component | Role |
|-----------|------|
| `undici` | HTTP/2 transport with Chrome-like fingerprinting |
| `linkedom` | HTML parsing to DOM |
| `@mozilla/readability` | Article content extraction |
| `turndown` | HTML to markdown conversion |

资料来源：[src/core.ts:1-50]()

### Adapters (src/cli.ts & src/mcp.ts)

The source is structured into three distinct files:

| File | Purpose |
|------|---------|
| `src/core.ts` | Pipeline + errors (shared logic) |
| `src/mcp.ts` | MCP stdio server adapter |
| `src/cli.ts` | CLI argv parser + dispatcher |
| `src/index.ts` | Lazy-import dispatcher based on `process.argv.length` |

资料来源：[README.md:95-100]()

The lazy-import dispatcher ensures `console.log` calls in `cli.ts` are never reachable from the MCP path, maintaining the invariant that stdout is reserved for MCP frames. 资料来源：[CHANGELOG.md:45-47]()

## Setting Up the Development Environment

### Prerequisites

- Node.js ≥ 24
- npm or yarn

### Installation

```bash
# Clone the repository
git clone https://github.com/vasylenko/markfetch.git
cd markfetch

# Install dependencies
npm install
```

### Available Scripts

| Script | Command | Purpose |
|--------|---------|---------|
| `dev` | `npm run dev` | Run source directly with tsx (no build required) |
| `build` | `npm run build` | Compile TypeScript to JavaScript |
| `test` | `npm run test` | Run test suite with tsx |
| `inspect` | `npm run inspect` | Launch MCP inspector for debugging |

资料来源：[package.json:21-28]()

### Build Process

The build process consists of two steps:

```bash
# Compile TypeScript
npm run build

# Post-build script (automatically runs after build)
npm run postbuild
```

The postbuild script (`scripts/postbuild.mjs`) performs additional transformations after TypeScript compilation. 资料来源：[package.json:26]()

## Project Structure

```
markfetch/
├── src/
│   ├── index.ts      # Entry point with argv dispatcher
│   ├── core.ts       # Core fetch/extract/convert pipeline
│   ├── cli.ts        # CLI adapter using commander
│   ├── mcp.ts        # MCP stdio server
│   └── sandbox.ts    # Write path sandboxing
├── dist/             # Compiled JavaScript output
├── tests/            # Test fixtures and test files
├── scripts/
│   └── postbuild.mjs # Post-compilation transformations
└── docs/
    └── SPEC.md       # Detailed specification
```

## Configuration

### Environment Variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `MARKFETCH_TIMEOUT_MS` | `30000` | Per-request timeout in milliseconds |
| `MARKFETCH_MAX_BYTES` | `5000000` | Cap on response body and extracted markdown |
| `MARKFETCH_USER_AGENT` | Chrome 130 string | Override the User-Agent header |
| `MARKFETCH_ALLOWED_WRITE_ROOTS` | `os.tmpdir()` + `process.cwd()` | MCP-only write sandbox roots |

资料来源：[README.md:60-66]()

### Configuration Precedence

1. Environment variables set at startup
2. Command-line flags (CLI mode)
3. MCP tool parameters (MCP mode)

## Core API

### fetchMarkdown Function

The main function exported from `core.ts`:

```typescript
interface FetchOptions {
  url: string;
  savePath?: string;
}

interface FetchResult {
  markdown: string;
  bytes: number;
  savedTo?: string;
}
```

### Error Handling

The core module defines eight deterministic error codes:

| Code | Meaning |
|------|---------|
| `network_error` | DNS/TCP/TLS failure |
| `http_error` | Non-2xx HTTP status |
| `timeout` | Request timeout exceeded |
| `unsupported_content_type` | Not `text/html` or `application/xhtml+xml` |
| `extraction_failed` | Readability found no article content |
| `too_large` | Response or markdown exceeded size cap |
| `save_failed` | File write failed (permissions, missing directory) |
| `save_forbidden` | Path outside allowed write roots |

资料来源：[README.md:71-80]()

Errors are thrown as `MarkfetchError` from core uniformly and caught by adapters for conversion. 资料来源：[CHANGELOG.md:49-51]()

## Extending the Pipeline

### Adding New HTML Rewrites

The `rewriteForReadability()` function in `core.ts` handles pre-extraction HTML transformations:

```typescript
function rewriteForReadability(document: Document): void {
  // Transform <aside class="footnote-brackets"> to <section>
  // Flatten <details> elements
  // Replace div.mw-heading with their heading children
}
```

To add new rewrite rules, append to this function before the return statement. 资料来源：[src/core.ts:120-160]()

### Customizing Markdown Conversion

The `TURNDOWN` instance is configured with:

| Plugin/Option | Purpose |
|---------------|---------|
| `gfm` plugin | GitHub Flavored Markdown support |
| `keepClasses: true` | Preserve `class="language-X"` for code fences |
| Custom escape | Handle `-`/`=` after inline elements |

资料来源：[src/core.ts:50-90]()

### Modifying Error Handling

Error handling flows through the `MarkfetchError` class in core:

1. Core throws `MarkfetchError` with code and message
2. Adapters catch and format for their protocol
3. CLI: writes `[code] message` to stderr
4. MCP: returns `{ content: [...], isError: true }`

资料来源：[src/cli.ts:35-42]() 和 [src/mcp.ts:15-20]()

## Write Sandbox

The MCP adapter enforces write path restrictions:

```mermaid
graph TD
    A[MCP savePath] --> B{absolutely path?}
    B -->|No| C[Refine fails: savePath must be absolute]
    B -->|Yes| D{Inside allowed roots?}
    D -->|Yes| E[Write file]
    D -->|No| F[Return save_forbidden error]
```

### Configuring Allowed Roots

Set the environment variable with platform delimiter:

```bash
# POSIX
export MARKFETCH_ALLOWED_WRITE_ROOTS="/tmp:/home/user/docs"

# Windows
set MARKFETCH_ALLOWED_WRITE_ROOTS="C:\Users\me\docs;C:\temp"
```

The sandbox checks resolve symlinks and applies case-folding on Windows. 资料来源：[src/sandbox.ts:20-40]()

## Testing

### Running Tests

```bash
npm test
```

### Test Structure

Tests use Node.js built-in test runner (`--test` flag) with tsx for TypeScript support. 资料来源：[package.json:27]()

### Writing New Tests

1. Place test files in `tests/` directory
2. Use `*.test.ts` naming pattern
3. Run with `tsx --test tests/*.test.ts`

## MCP Inspector

Debug MCP integration using the official inspector:

```bash
npm run inspect
```

This launches the MCP inspector at `http://localhost:6274` where you can:
- Test tool calls interactively
- Inspect request/response frames
- Verify schema validation

资料来源：[package.json:27]()

## Dependencies

### Production Dependencies

| Package | Version | Purpose |
|---------|---------|---------|
| `@modelcontextprotocol/sdk` | ^1.29.0 | MCP server implementation |
| `@mozilla/readability` | ^0.5.0 | Article extraction |
| `commander` | ^14.0.3 | CLI argument parsing |
| `linkedom` | ^0.18.0 | HTML parsing |
| `turndown` | ^7.0.0 | HTML to markdown |
| `turndown-plugin-gfm` | ^1.0.2 | GFM support |
| `undici` | ^8.2.0 | HTTP client |
| `zod` | ^3.0.0 | Schema validation |

### Development Dependencies

| Package | Purpose |
|---------|---------|
| `@types/node` | Node.js type definitions |
| `@types/turndown` | Turndown type definitions |
| `tsx` | TypeScript execution |
| `typescript` | TypeScript compiler |

资料来源：[package.json:30-50]()

## Version History

| Version | Date | Key Changes |
|---------|------|-------------|
| 0.6.0 | 2026-05-13 | Write sandbox, Windows CI, save_forbidden error |
| 0.5.0 | 2026-05-12 | CLI mode, commander dependency |
| 0.4.1 | 2026-05-11 | README rewrite, bin path fix |
| 0.4.0 | 2026-05-10 | MCP server with fetch_markdown tool |

资料来源：[CHANGELOG.md:1-60]()

## Contributing Guidelines

### Code Standards

- All source in TypeScript under `src/`
- Build output to `dist/` via `npm run build`
- Tests in `tests/` with `*.test.ts` pattern
- No runtime `console.log` in MCP path (enforced by lazy-import structure)

### Pull Request Checklist

- [ ] Run `npm run build` successfully
- [ ] Run `npm test` with all tests passing
- [ ] Update CHANGELOG.md with changes
- [ ] Ensure documentation reflects new behavior

### Release Process

```bash
npm run prepublishOnly
```

This runs the build automatically before npm publish. 资料来源：[package.json:29]()

---

---

## Doramagic 踩坑日志

项目：vasylenko/markfetch

摘要：发现 7 个潜在踩坑项，其中 0 个为 high/blocking；最高优先级：安装坑 - 来源证据：v0.4.1。

## 1. 安装坑 · 来源证据：v0.4.1

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：v0.4.1
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_749b65614f7b40e0b524f4e932cd4aca | https://github.com/vasylenko/markfetch/releases/tag/v0.4.1 | 来源讨论提到 node 相关条件，需在安装/试用前复核。

## 2. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 建议检查：将假设转成下游验证清单。
- 防护动作：假设必须转成验证项；没有验证结果前不能写成事实。
- 证据：capability.assumptions | github_repo:1234238440 | https://github.com/vasylenko/markfetch | README/documentation is current enough for a first validation pass.

## 3. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 建议检查：补 GitHub 最近 commit、release、issue/PR 响应信号。
- 防护动作：维护活跃度未知时，推荐强度不能标为高信任。
- 证据：evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | last_activity_observed missing

## 4. 安全/权限坑 · 下游验证发现风险项

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：下游已经要求复核，不能在页面中弱化。
- 建议检查：进入安全/权限治理复核队列。
- 防护动作：下游风险存在时必须保持 review/recommendation 降级。
- 证据：downstream_validation.risk_items | github_repo:1234238440 | https://github.com/vasylenko/markfetch | no_demo; severity=medium

## 5. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 建议检查：把风险写入边界卡，并确认是否需要人工复核。
- 防护动作：评分风险必须进入边界卡，不能只作为内部分数。
- 证据：risks.scoring_risks | github_repo:1234238440 | https://github.com/vasylenko/markfetch | no_demo; severity=medium

## 6. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 建议检查：抽样最近 issue/PR，判断是否长期无人处理。
- 防护动作：issue/PR 响应未知时，必须提示维护风险。
- 证据：evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | issue_or_pr_quality=unknown

## 7. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 建议检查：确认最近 release/tag 和 README 安装命令是否一致。
- 防护动作：发布节奏未知或过期时，安装说明必须标注可能漂移。
- 证据：evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | release_recency=unknown

<!-- canonical_name: vasylenko/markfetch; human_manual_source: deepwiki_human_wiki -->
