markfetch Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

markfetch

Related topics: Quick Start Guide, Processing Pipeline

Introduction

Related topics: Quick Start Guide, Processing Pipeline

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section CLI Mode

Continue reading this section for the full explanation and source context.

Section MCP Mode

Continue reading this section for the full explanation and source context.

Related topics: Quick Start Guide, Processing Pipeline

Introduction

What is markfetch?

markfetch is a Node.js tool that fetches public HTTP/S URLs and returns clean, readable markdown — indistinguishable from what a human would get by running "Save as Markdown" in a browser. It is designed to provide high-quality content extraction for language models, with a focus on producing output that LLM clients can actually consume reliably.

Sources: README.md

Core Design Philosophy

markfetch is built around several key principles that differentiate it from generic fetching solutions:

Principle	Description
Single-channel output	Returns markdown in `content[0].text` only — no `structuredContent` that some LLM clients drop
Real-browser fingerprint	Uses HTTP/2 transport with a coherent Chrome header set and `Sec-CH-UA-*` client hints
Reader-View extraction	Leverages Mozilla's Readability library to extract the main article content
Zero-config defaults	Works out of the box with sensible defaults
Deterministic errors	8 structured error codes for reliable error handling

Sources: README.md

Architecture Overview

markfetch follows an adapter pattern with a unified core:

graph TD
    A[User / LLM Client] --> B[Adapter Layer]
    B --> C{Invocation Mode}
    C -->|CLI args| D[cli.ts]
    C -->|MCP stdio| E[mcp.ts]
    D --> F[core.ts - fetchMarkdown]
    E --> F
    F --> G[HTTP Fetch - undici]
    G --> H[Readability Extraction]
    H --> I[Turndown Conversion]
    I --> J[Markdown Output]

Core Components

Component	File	Responsibility
Core Pipeline	`src/core.ts`	URL fetching, HTML parsing, content extraction, markdown conversion, error throwing
CLI Adapter	`src/cli.ts`	Command-line argument parsing, stdout/stderr output
MCP Adapter	`src/mcp.ts`	Model Context Protocol stdio server, tool registration
Write Sandbox	`src/sandbox.ts`	Path validation for file saves

Sources: src/core.ts, src/cli.ts, src/mcp.ts

Two Operating Modes

CLI Mode

The command-line interface accepts a URL and outputs markdown to stdout:

markfetch https://en.wikipedia.org/wiki/Markdown

Options include:

-o, --output <path> — Save markdown to a file
-V, --version — Print version
-h, --help — Print usage

The CLI respects the same environment variables as the MCP mode and resolves relative output paths against the current working directory.

Sources: README.md, src/cli.ts

MCP Mode

The Model Context Protocol server provides a single tool fetch_markdown(url, savePath?) for integration with LLM clients like Claude Code, Cursor, or Goose:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

The MCP mode has additional security features:

Write sandbox: File saves are restricted to allowed write roots
Lazy loading: The CLI adapter is never loaded in MCP mode, ensuring console.log is never reachable

Sources: src/mcp.ts, src/cli.ts

Content Extraction Pipeline

The markdown conversion process involves several stages:

graph LR
    A[HTML Response] --> B[Decode Encoded Tags]
    B --> C[Ensure Base Href]
    C --> D[Rewrite for Readability]
    D --> E[Readability Parse]
    E --> F[Turndown Convert]
    F --> G[Prune Empty Headings]
    G --> H[Clean Markdown]

Extraction Details

Encoded Tag Decoding: Handles HTML entities like <code> in code blocks
Base Href Injection: Ensures relative URLs become absolute using the canonical URL
Pre-processing Rewrites: Handles footnotes, `

Quick Start Guide

markfetch is a tool that fetches URLs and returns clean markdown output. It operates as both a CLI command and an MCP (Model Context Protocol) server, making it suitable for AI agents like Claude Code, Codex, and Gemini CLI.

Installation

Prerequisites

Node.js ≥ 24 Sources: package.json:8

CLI Installation (Global)

npm i -g markfetch

After installation, the markfetch command is available globally. Sources: README.md:38

CLI Installation (npx)

For one-off usage without global installation:

npx -y markfetch <url>

MCP Server Setup

Add markfetch to your MCP client configuration. The setup varies by client.

#### Claude Code

claude mcp add --scope user markfetch -- npx -y markfetch

#### Codex

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

#### Gemini CLI

gemini mcp add -s user markfetch npx -y markfetch

#### Cursor / Goose / Other stdio-MCP Clients

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Sources: README.md:46-69

CLI Usage

Basic Fetch

markfetch <url>

The fetched markdown is printed to stdout. Sources: src/cli.ts:18

Save to File

markfetch <url> -o <path>

Use -o or --output to save markdown to a file. Relative paths resolve against the current working directory. Sources: src/cli.ts:12-15

Example:

markfetch https://en.wikipedia.org/wiki/Markdown -o output.md

Help and Version

markfetch --help
markfetch --version

MCP Tool Usage

Tool Name

fetch_markdown

Parameters

Parameter	Type	Required	Description
`url`	string	Yes	Absolute http(s) URL to fetch. The server follows redirects automatically. No authentication headers, cookies, or session state are sent.
`savePath`	string	No	Absolute filesystem path. When provided, the fetched markdown is written to this path instead of returned in the response.

Sources: src/mcp.ts:22-33

Return Value

The tool returns markdown content in content[0].text. No structuredContent field is used — this ensures compatibility with MCP clients that forward only structuredContent to the model. Sources: README.md:18-21

Environment Configuration

Variable	Default	Purpose
`MARKFETCH_TIMEOUT_MS`	`30000`	Per-request timeout in milliseconds
`MARKFETCH_MAX_BYTES`	`5000000`	Cap on response body and extracted markdown (5MB)
`MARKFETCH_USER_AGENT`	Pinned Chrome 130 string	Override the User-Agent header. Must be a Chrome UA string.
`MARKFETCH_ALLOWED_WRITE_ROOTS`	`os.tmpdir()` + `process.cwd()`	MCP-only. Colon-delimited (POSIX) or semicolon-delimited (Windows) list of absolute paths permitted for `savePath` writes.

Sources: README.md:99-103

Passing Environment Variables to MCP

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_TIMEOUT_MS": "60000"
      }
    }
  }
}

Error Handling

Errors are returned with deterministic codes in the format [code] message:

Code	Meaning
`network_error`	DNS, TCP, or TLS failure
`http_error`	Upstream returned a non-2xx status
`timeout`	Request exceeded `MARKFETCH_TIMEOUT_MS`
`unsupported_content_type`	Response was not `text/html` or `application/xhtml+xml`
`extraction_failed`	Readability found no article content (typical for pure client-rendered SPAs)
`too_large`	Response body or extracted markdown exceeded `MARKFETCH_MAX_BYTES`
`save_failed`	`writeFile` failed (missing directory, permission denied)
`save_forbidden`	`savePath` resolves outside the allowed write roots

Errors go to stderr with non-zero exit status in CLI mode. Sources: README.md:72-85

Quick Workflow

graph TD
    A[Start markfetch] --> B{Arguments provided?}
    B -->|Yes, URL argument| C[CLI Mode]
    B -->|No arguments| D[MCP Server Mode]
    C --> E[Fetch URL]
    D --> F[Wait for MCP request]
    E --> G{Output path specified?}
    F --> H[Receive fetch_markdown request]
    G -->|No| I[Print to stdout]
    G -->|Yes, -o path| J[Write to file]
    H --> I
    J --> K[Return confirmation]
    I --> L[Return markdown content]
    K --> L

Use Cases

Use Case	Recommended Mode	Command/Config
One-time URL fetch in shell	CLI	`markfetch <url>`
Batch processing with shell scripts	CLI + `-o`	`markfetch <url> -o out.md`
AI agent web content retrieval	MCP	Configure in client
Large document bypass inline limits	MCP + `savePath`	Set `savePath` to local file

Limitations

Not a crawler: No recursion, no robots.txt parsing. One URL in, one document out. Sources: README.md:89-91
Not authenticated: Anonymous fetch only. Pages behind login walls return whatever the public response is. Sources: README.md:93-95
Not a JS renderer: Pure client-rendered SPAs with no static HTML return extraction_failed. Sources: README.md:97-99

Sources: README.md

Processing Pipeline

Related topics: Introduction, HTTP/2 Fingerprinting, Error Handling

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Transport Configuration

Continue reading this section for the full explanation and source context.

Section Error Conditions

Continue reading this section for the full explanation and source context.

Section Technology Stack

Continue reading this section for the full explanation and source context.

Processing Pipeline

Overview

The Processing Pipeline is the core data flow engine in markfetch. It transforms raw HTML fetched from a URL into clean, readable markdown suitable for consumption by AI agents and language models. The pipeline is intentionally single-purpose — one URL in, one markdown document out — with no recursion, pagination, or client-side JavaScript rendering.

The pipeline operates identically whether invoked via CLI or MCP adapter, ensuring consistent behavior across both interfaces.

Sources: src/core.ts

Architecture

The pipeline is composed of three primary stages executed sequentially:

graph TD
    A[URL Input] --> B[HTTP Fetch]
    B --> C{HTML Valid?}
    C -->|No| D[Error: network_error / http_error / timeout]
    C -->|Yes| E[Content-Type Check]
    E -->|Non-HTML| F[Error: unsupported_content_type]
    E -->|HTML| G[Extract Article]
    G -->|No Content| H[Error: extraction_failed]
    G -->|Extracted| I[Convert to Markdown]
    I --> J{Size Check}
    J -->|Exceeds Limit| K[Error: too_large]
    J -->|Valid| L{Save Path?}
    L -->|Yes| M[Write to File / Error: save_forbidden / save_failed]
    L -->|No| N[Return Markdown]

Each stage performs validation and may abort with a deterministic error code, ensuring failures are predictable and actionable.

Sources: src/core.ts

Stage 1: HTTP Fetch

The fetch stage retrieves raw HTML from the target URL using Node.js fetch with a real-browser fingerprint.

Transport Configuration

Setting	Value	Purpose
Protocol	HTTP/2	Modern web fingerprint
User-Agent	Chrome 130 (pinned)	Realistic browser identification
Client Hints	Sec-CH-UA-* headers	Derived from User-Agent at startup
Timeout	`MARKFETCH_TIMEOUT_MS` (default: 30000ms)	Per-request budget

The User-Agent string is validated at startup. Non-Chrome strings fail fast to prevent fingerprint inconsistencies that could trigger bot detection.

Sources: README.md

Error Conditions

Code	Trigger
`network_error`	DNS failure, TCP failure, TLS error, unexpected fetcher error
`http_error`	Non-2xx HTTP status code
`timeout`	Response exceeds `MARKFETCH_TIMEOUT_MS`

Redirects are followed automatically by the underlying HTTP client.

Stage 2: Article Extraction

Article extraction identifies and isolates the main content from the fetched HTML, stripping navigation, sidebars, footers, and other boilerplate.

Technology Stack

Component	Library	Purpose
HTML Parser	`linkedom`	Parses HTML into a DOM-like structure
Extraction	`readability` (Mozilla)	Identifies main article content
Configuration	`keepClasses: true`	Preserves code block language hints

The linkedom parser is chosen over native DOMParser to ensure consistent behavior across Node.js versions and environments.

Sources: src/core.ts

Pre-Extraction Rewrites

Before Readability processes the document, the pipeline applies targeted HTML rewrites to normalize content and improve extraction quality:

function rewriteForReadability(document: Document): void {
  // Normalize code blocks (pre and code elements)
  // Convert aside elements to sections
  // Expand details/summary elements
  // Flatten MediaWiki heading wrappers
}

Specific transformations include:

| Details expansion | `

Transform	Target	Action
Code block normalization	`<pre>`, `<code>`	Standardize encoding artifacts
Base href injection	`<head>` / `<html>`	Ensure absolute URLs after redirects
Aside conversion	`<aside>` with footnote roles	Convert to `<section>`

HTTP/2 Fingerprinting

Overview

HTTP/2 Fingerprinting is a technique used by markfetch to mimic real browser traffic when fetching web pages. Instead of making requests that appear to come from a typical HTTP library (like curl or a basic fetch implementation), markfetch generates HTTP/2 requests with headers and client hints that closely match those of an actual Chrome browser session.

This approach serves two critical purposes:

Bypass anti-bot measures: Many websites employ fingerprinting techniques to detect and block automated scrapers. By presenting headers identical to a genuine Chrome browser, markfetch avoids triggering these defenses.
Access SEO-rendered content: Sites that serve different content to bots vs. browsers will return the full article content when markfetch requests arrive with Chrome-like fingerprints.

Sources: README.md

Architecture

graph TD
    A[URL Request] --> B{Adapter Type?}
    B -->|MCP| C[src/mcp.ts]
    B -->|CLI| D[src/cli.ts]
    C --> E[src/core.ts - fetchMarkdown]
    D --> E
    E --> F[Undici Dispatcher]
    F --> G[HTTP/2 Transport]
    G --> H[Sec-CH-UA-* Client Hints]
    G --> I[Chrome Headers]
    H --> J[Upstream Server]
    I --> J
    J --> K[HTML Response]
    K --> L[Readability Parser]
    L --> M[Markdown Output]

Implementation Details

User Agent String

The default user agent is a pinned Chrome 130 string. This can be overridden via the MARKFETCH_USER_AGENT environment variable, but must be a valid Chrome UA string.

Environment Variable	Default Value	Purpose
`MARKFETCH_USER_AGENT`	Pinned Chrome 130 string	Override the browser fingerprint UA

Constraint: The UA string must be a Chrome browser UA. Non-Chrome strings fail fast at startup because Sec-CH-UA-* client hints are derived from the UA at initialization time.

Sources: README.md

Client Hints Generation

When the server starts, markfetch parses the MARKFETCH_USER_AGENT value and derives Sec-CH-UA-* client hint headers from it. These hints are sent with every HTTP/2 request and include:

Sec-CH-UA — Browser brand and version
Sec-CH-UA-Mobile — Mobile indicator
Sec-CH-UA-Platform — Operating system

graph LR
    A[MARKFETCH_USER_AGENT<br/>Chrome 130] --> B[Startup<br/>Initialization]
    B --> C[Sec-CH-UA Header<br/>Derived Value]
    B --> D[Sec-CH-UA-Mobile<br/>Derived Value]
    B --> E[Sec-CH-UA-Platform<br/>Derived Value]
    C --> F[Every HTTP/2<br/>Request]
    D --> F
    E --> F

Sources: README.md

HTTP/2 Transport

Markfetch uses the undici HTTP client library with HTTP/2 protocol support. The HTTP/2 transport is selected automatically by undici when the server supports it, enabling:

Multiplexed requests over a single connection
Header compression
Server push capabilities

The combination of HTTP/2 transport + coherent Chrome header set creates a fingerprint that is indistinguishable from a human browsing with Chrome DevTools open.

Sources: README.md

Request Flow

sequenceDiagram
    participant Client
    participant Markfetch
    participant Undici
    participant Server

    Client->>Markfetch: fetch_markdown(url)
    Markfetch->>Markfetch: Validate MARKFETCH_USER_AGENT
    Markfetch->>Undici: Dispatch with Chrome headers
    Undici->>Server: HTTP/2 CONNECT<br/>Sec-CH-UA: "Chromium"
    Undici->>Server: Sec-CH-UA-Mobile: ?U
    Undici->>Server: Sec-CH-UA-Platform: "Windows"
    Undici->>Server: GET /path HTTP/2
    Server->>Undici: HTTP/2 200 OK<br/>text/html
    Undici->>Markfetch: HTML Content
    Markfetch->>Markfetch: Apply Readability
    Markfetch->>Markfetch: Convert to Markdown
    Markfetch->>Client: Clean Markdown

Configuration

Environment Variables

Variable	Default	Purpose
`MARKFETCH_TIMEOUT_MS`	`30000`	Per-request timeout in milliseconds
`MARKFETCH_MAX_BYTES`	`5000000`	Cap on response body and extracted markdown
`MARKFETCH_USER_AGENT`	Pinned Chrome 130	Browser fingerprint override

Validation

All environment variables are validated at startup. Invalid values cause the process to fail fast on stderr with descriptive error messages, rather than producing confusing per-request errors.

Sources: README.md

Integration Points

MCP Adapter

The MCP server (src/mcp.ts) uses the core fetch pipeline which includes the HTTP/2 fingerprinting. The tool description explicitly documents this behavior:

Fetch a single public HTTP/S URL and return its main article content as clean markdown. Best for articles, documentation, blog posts, news, and reference pages. Non-HTML responses return unsupported_content_type.

Sources: src/mcp.ts

CLI Adapter

The CLI adapter (src/cli.ts) also uses the same core fetch pipeline, ensuring consistent HTTP/2 fingerprinting behavior whether invoked via MCP or command line:

markfetch https://en.wikipedia.org/wiki/Markdown

Sources: src/cli.ts

Version History

Version	Date	Change
0.4.0	2026-05-10	HTTP/2 fingerprinting feature added with Sec-CH-UA-* client hints
0.5.0	2026-05-12	CLI mode added with same fingerprinting behavior
0.6.0	Current	Enhanced write sandbox and validation

Sources: CHANGELOG.md

Limitations

SPA Handling

Pure client-rendered Single Page Applications (SPAs) with no static HTML content return extraction_failed. Sites that ship server-rendered or SEO-prerendered HTML will extract whatever static content they expose, including when accessed with Chrome fingerprints.

Authentication

Markfetch performs anonymous fetches only — no cookie jar, no auth headers, no session reuse. Pages behind login walls return whatever the public response is, usually surfaced as http_error.

Sources: README.md

Security Considerations

The HTTP/2 fingerprinting approach makes requests appear legitimate, which raises responsibility concerns. The documentation explicitly states:

Use it on URLs whose targets you have permission to fetch, and respect the terms of service of any site you query. The maintainer assumes no liability for misuse.

Sources: README.md

Sources: src/core.ts

CLI Usage

Related topics: Quick Start Guide, MCP Server Integration, Write Sandbox Security

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Arguments

Continue reading this section for the full explanation and source context.

Section Options

Continue reading this section for the full explanation and source context.

Section Error Codes

Continue reading this section for the full explanation and source context.

CLI Usage

The markfetch CLI provides a command-line interface for fetching URLs and converting their content to clean markdown. It operates as one of two execution surfaces—the other being the MCP (Model Context Protocol) stdio server—with both sharing the same underlying core pipeline.

Overview

The CLI accepts a URL as its primary argument and outputs the converted markdown to stdout or to a specified file. It was introduced in version 0.5.0 as a way to make markfetch accessible from standard shell environments, pipelines, and scripts.

Aspect	Details
Entry Point	`markfetch <url>`
Output	stdout (default) or file via `-o`
Version	0.6.0
Runtime	Node.js ≥ 24
Distribution	npm package

Sources: README.md

Architecture

The CLI is implemented as an adapter layer that delegates to the shared core. When the process is invoked with arguments, the dispatcher in index.ts lazy-loads the CLI adapter; bare invocation (zero arguments) routes to the MCP server instead.

graph TD
    A["markfetch CLI Invokation<br/>process.argv.length > 1"] --> B["src/index.ts<br/>Dispatcher"]
    B --> C["src/cli.ts<br/>CLI Adapter"]
    C --> D["src/core.ts<br/>fetchMarkdown()"]
    D --> E["src/sandbox.ts<br/>Write Validation"]
    D --> F["HTTP Fetch + Readability + Turndown"]
    
    G["Bare Invocation<br/>process.argv.length === 1"] --> H["src/mcp.ts<br/>MCP Server"]

Sources: src/cli.ts:39-47

Command Syntax

markfetch <url> [options]

Arguments

Argument	Required	Description
`<url>`	Yes	Absolute http(s) URL to fetch

Options

Flag	Description
`-o, --output <path>`	Save markdown to a file (absolute or relative path). Default is stdout.
`-V, --version`	Print version and exit
`-h, --help`	Print usage and exit

Sources: src/cli.ts:23-30

Output Behavior

The CLI maintains strict separation between its output channels:

Scenario	Channel	Content
Raw markdown (no `-o`)	stdout	Raw markdown body via `process.stdout.write()`
File output (`-o`)	stdout	Confirmation: `Saved N bytes to <path>`
Any error	stderr	`[code] message`

The raw markdown is written using process.stdout.write() rather than console.log() to preserve trailing whitespace in the output—matching the exact bytes the MCP adapter would emit in content[0].text.

Sources: src/cli.ts:50-58

Error Handling

Errors are written to stderr with a deterministic format: [code] message. The process exits with a non-zero status code.

process.exitCode = 1;
console.error(`[${code}] ${message}`);

The CLI uses process.exitCode (not process.exit()) to ensure pending output drains before the process exits—important when stdout is piped to a slow consumer.

Sources: src/cli.ts:58-62

Error Codes

Code	Meaning
`network_error`	DNS / TCP / TLS failure
`http_error`	Upstream returned a non-2xx status
`timeout`	Request exceeded `MARKFETCH_TIMEOUT_MS`
`unsupported_content_type`	Response was not HTML
`extraction_failed`	No extractable article content
`too_large`	Response or markdown exceeded `MARKFETCH_MAX_BYTES`
`save_failed`	File write failed (permission denied, etc.)

Note: save_forbidden is MCP-only and does not apply to CLI (no sandbox).

Sources: README.md

Path Resolution

The CLI resolves relative output paths against the current working directory before passing them to the core:

const savePath = options.output
  ? resolve(process.cwd(), options.output)
  : undefined;

Tilde expansion is intentionally not performed—the shell expands ~/foo before argv reaches the process, and a quoted literal '~/foo' should produce a file named ~/foo in cwd (standard tool behavior).

Sources: src/cli.ts:32-39

Environment Variables

These environment variables apply to both CLI and MCP modes:

Variable	Default	Purpose
`MARKFETCH_TIMEOUT_MS`	`30000`	Per-request timeout in ms
`MARKFETCH_MAX_BYTES`	`5000000`	Cap on response body and extracted markdown
`MARKFETCH_USER_AGENT`	Chrome 130 string	Override User-Agent header

The CLI adapter imports fetchMarkdown and classifyError from the core module, which validates these environment variables at startup.

Sources: src/cli.ts:15 and README.md

File Structure

The project source is organized into adapter modules:

src/
├── index.ts    # Dispatcher (lazy-loads cli.ts or mcp.ts)
├── core.ts     # Shared pipeline and errors
├── cli.ts      # CLI adapter (commander-based)
└── mcp.ts      # MCP stdio server adapter

The lazy-import pattern ensures that cli.ts code (which calls console.log) is never loaded when running in MCP mode, preserving the "stdout is reserved for MCP frames" invariant structurally.

Sources: CHANGELOG.md and src/cli.ts:1-13

Installation

Install globally via npm:

npm i -g markfetch

Or use via npx without installation:

npx -y markfetch <url>

The bin entry in package.json points to dist/index.js:

{
  "bin": {
    "markfetch": "dist/index.js"
  }
}

Sources: package.json:16-18

Usage Examples

Basic fetch to stdout

markfetch https://en.wikipedia.org/wiki/Markdown

Save to file

markfetch https://example.com/article -o output.md

With timeout override

MARKFETCH_TIMEOUT_MS=60000 markfetch https://slow-site.example.com

Pipeline to another tool

markfetch https://example.com/doc | grep -A5 "## Installation"

Sources: README.md

MCP Server Integration

Related topics: Quick Start Guide, CLI Usage, Write Sandbox Security

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Entry Point Dispatcher

Continue reading this section for the full explanation and source context.

Section Module Isolation

Continue reading this section for the full explanation and source context.

Section Server Initialization

Continue reading this section for the full explanation and source context.

MCP Server Integration

Overview

The MCP (Model Context Protocol) Server Integration is the primary interface for AI agents to fetch web content as clean markdown. Markfetch exposes a single MCP tool fetch_markdown that accepts a URL and returns extracted markdown content, enabling language models like Claude to access web information through a standardized protocol.

The MCP server operates as a stdio-based server, meaning it communicates exclusively through standard input and standard output streams. This design ensures the server integrates seamlessly with MCP clients including Claude Desktop, Claude Code, Cursor, and Goose.

Architecture

Entry Point Dispatcher

The src/index.ts file implements an argv-discriminated dispatcher that determines whether to start the MCP server or the CLI based on the presence of command-line arguments:

if (process.argv.length === 2) {
  await import("./mcp.js");
} else {
  await import("./cli.js");
}

Sources: src/index.ts:26-29

When process.argv.length === 2, the process was invoked without arguments—this is the standard pattern MCP clients use when spawning a server. Any extra argument (URL, flags, --help) routes to the CLI adapter.

Module Isolation

The dynamic import pattern ensures complete module isolation:

graph TD
    A[markfetch entry] --> B{argv.length === 2?}
    B -->|Yes| C[Lazy import: mcp.ts]
    B -->|No| D[Lazy import: cli.ts]
    C --> E[@modelcontextprotocol/sdk loaded]
    D --> F[commander loaded]
    E -.-> G[Never reaches console.log]
    F -.-> H[Can use console.log]

Sources: src/index.ts:18-22

This architecture enforces the "stdout is reserved for MCP frames" invariant structurally—the MCP path never imports cli.ts, so code that calls console.log is literally unreachable from the MCP execution path.

MCP Server Implementation

Server Initialization

The MCP server is initialized using the @modelcontextprotocol/sdk package:

const server = new McpServer({ name: "markfetch", version: "0.6.0" });

Sources: src/mcp.ts:20

Tool Registration

The server registers a single tool fetch_markdown with a Zod-based input schema:

server.registerTool(
  "fetch_markdown",
  {
    description: "Fetch a single public HTTP/S URL and return its main article content as clean markdown...",
    inputSchema: {
      url: z.string().url().describe("Absolute http(s) URL of the page to fetch..."),
      savePath: z.string().refine(isAbsolute, "savePath must be an absolute filesystem path").optional().describe("Optional. When provided...")
    }
  },
  async ({ url, savePath }) => {
    // Implementation
  }
);

Sources: src/mcp.ts:22-47

Tool Input Schema

Parameter	Type	Required	Description
`url`	string	Yes	Absolute http(s) URL of the page to fetch. The server follows redirects automatically. No authentication headers, cookies, or session state are sent.
`savePath`	string	No	Optional absolute filesystem path. When provided, the fetched markdown is written to this path instead of returned inline.

The url parameter is validated using Zod's .url() method to ensure a valid URL format. The savePath parameter must be an absolute path, enforced by the .refine(isAbsolute, ...) check.

Response Format

The tool returns a response in this structure:

{
  content: [{ type: "text", text: "markdown content or [errorcode] message" }],
  isError: boolean
}

Sources: src/mcp.ts:8-12

Error Handling

Error Code System

The MCP adapter uses a uniform error code system with 8 deterministic codes:

Error Code	Description	Source
`network_error`	DNS/TCP/TLS failure or unexpected internal error	core.ts
`http_error`	Upstream returned non-2xx status	core.ts
`timeout`	Per-request budget exceeded	core.ts
`unsupported_content_type`	Response was not text/html or application/xhtml+xml	core.ts
`extraction_failed`	Readability returned no article content	core.ts
`too_large`	Response or markdown exceeded MARKFETCH_MAX_BYTES	core.ts
`save_failed`	writeFile failed (permission denied, missing directory)	core.ts
`save_forbidden`	savePath resolves outside allowed write roots	src/mcp.ts

Error Result Factory

function errorResult(code: ErrorCode, message: string) {
  return {
    content: [{ type: "text" as const, text: `[${code}] ${message}` }],
    isError: true,
  };
}

Sources: src/mcp.ts:8-12

Error Propagation Pattern

In version 0.5.0, error handling was refactored so that core functions now throw MarkfetchError instead of returning error results inline. Both the MCP and CLI adapters catch these exceptions and convert them to their respective output formats.

Sources: CHANGELOG.md:19-21

Write Sandbox (MCP-Specific)

The MCP server implements a write sandbox that restricts savePath operations to a set of allowed root directories.

Default Allowed Roots

By default, the allowed set is:

os.tmpdir() (system temp directory)
process.cwd() (current working directory)

Each path is resolved via fs.realpath at startup to handle symlinks.

Configuration

The MARKFETCH_ALLOWED_WRITE_ROOTS environment variable overrides the default set entirely:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
      }
    }
  }
}

Sources: README.md:89-100

Security Rationale

The sandbox is MCP-only by design. The CLI is unrestricted because "a human at the shell is the security boundary." The asymmetry exists because the MCP tool is driven by a language model, which may be steered by content from a page it just fetched.

Sources: README.md:102-104

Request Flow

sequenceDiagram
    participant Client as MCP Client
    participant MCP as MCP Server
    participant Core as fetchMarkdown()
    participant Fetch as HTTP Fetcher

    Client->>MCP: fetch_markdown({url, savePath?})
    MCP->>Core: fetchMarkdown({url, savePath})
    Core->>Fetch: GET url (with Chrome fingerprint)
    Fetch-->>Core: HTML response
    Core->>Core: Readability parsing
    Core->>Core: Turndown conversion
    alt savePath provided
        Core->>Core: Write to file (within sandbox)
    end
    Core-->>MCP: {markdown, bytes, savedTo?}
    MCP-->>Client: {content: [{text: markdown}], isError: false}

Environment Configuration

Variable	Default	Purpose	MCP-Specific
`MARKFETCH_TIMEOUT_MS`	`30000`	Per-request timeout in ms	No
`MARKFETCH_MAX_BYTES`	`5000000`	Cap on response body and extracted markdown	No
`MARKFETCH_USER_AGENT`	Chrome 130 string	Override the User-Agent header	No
`MARKFETCH_ALLOWED_WRITE_ROOTS`	`os.tmpdir()` + `process.cwd()`	Permitted write roots for savePath	Yes

Sources: src/mcp.ts:1-5, README.md:68-75

Integration with Clients

Claude Desktop / Claude Code

claude mcp add --scope user markfetch -- npx -y markfetch

Sources: README.md:40-43

Codex

codex mcp add markfetch -- npx -y markfetch

Sources: README.md:46-48

Manual Configuration

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Sources: README.md:52-58

Dependencies

The MCP server depends on:

Package	Version	Purpose
`@modelcontextprotocol/sdk`	^1.29.0	MCP protocol implementation
`zod`	^3.0.0	Input schema validation
`@mozilla/readability`	^0.5.0	Article extraction
`turndown`	^7.0.0	HTML to Markdown conversion
`undici`	^8.2.0	HTTP client
`linkedom`	^0.18.0	DOM parsing

Sources: package.json:36-47

Source: https://github.com/vasylenko/markfetch / Human Manual

Environment Variables

Related topics: HTTP/2 Fingerprinting, Write Sandbox Security, Error Handling

Section Related Pages

Continue reading this section for the full explanation and source context.

Section MARKFETCHTIMEOUTMS

Continue reading this section for the full explanation and source context.

Section MARKFETCHMAXBYTES

Continue reading this section for the full explanation and source context.

Section MARKFETCHUSERAGENT

Continue reading this section for the full explanation and source context.

Environment Variables

markfetch uses environment variables to configure runtime behavior at startup. These variables control network timeouts, response size limits, HTTP fingerprinting, and file write permissions for the MCP server.

Overview

Environment variables in markfetch serve as the primary configuration mechanism. Unlike per-request options, these settings apply globally to every operation and are validated once at process startup. This fail-fast design prevents misconfiguration from producing confusing per-request errors later.

graph TD
    A[Process Start] --> B[Validate MARKFETCH_TIMEOUT_MS]
    A --> C[Validate MARKFETCH_MAX_BYTES]
    A --> D[Validate MARKFETCH_USER_AGENT]
    A --> E[Build MARKFETCH_ALLOWED_WRITE_ROOTS]
    B --> F{Valid?}
    C --> F
    D --> F
    E --> F
    F -->|Yes| G[Server Ready]
    F -->|No| H[Exit with stderr error]

All validation occurs before the server begins accepting requests. Invalid values cause immediate process termination with a descriptive error message written to stderr.

Configuration Variables

MARKFETCH_TIMEOUT_MS

Property	Value
Default	`30000` (30 seconds)
Purpose	Per-request timeout in milliseconds
Type	Positive integer

Controls the maximum duration allowed for any single HTTP request, including DNS resolution, TCP connection, TLS handshake, and response body transfer.

const config = {
  timeoutMs: intEnv("MARKFETCH_TIMEOUT_MS", 30_000),
};

Validation rejects non-positive integers, non-integer values, and non-finite numbers (NaN, Infinity). A malformed value produces:

[core] Error: Invalid MARKFETCH_TIMEOUT_MS="abc" — expected a positive integer.

Sources: src/core.ts:1-50

MARKFETCH_MAX_BYTES

Property	Value
Default	`5000000` (~4.77 MB)
Purpose	Cap on response body and extracted markdown
Type	Positive integer

Both the raw HTTP response body and the final extracted markdown are checked against this limit. If either exceeds the cap, the operation returns too_large error.

const config = {
  maxBytes: intEnv("MARKFETCH_MAX_BYTES", 5_000_000),
};

Sources: src/core.ts:1-50

MARKFETCH_USER_AGENT

Property	Value
Default	`Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36`
Purpose	HTTP User-Agent header and Sec-CH-UA-* client hints
Type	String (must contain "Chrome")

The User-Agent string determines both the HTTP header sent to servers and the derived Sec-CH-UA-* client hints. The hints are derived at startup and remain fixed for the process lifetime.

graph LR
    A[MARKFETCH_USER_AGENT] --> B[deriveClientHints]
    B --> C[Sec-CH-UA]
    B --> D[Sec-CH-UA-Mobile]
    B --> E[Sec-CH-UA-Platform]
    A --> F[User-Agent Header]

function deriveClientHints(ua: string): {
  brands: string;
  mobile: string;
  platform: string;
} {
  const versionMatch = /\bChrome\/(\d+)/.exec(ua);
  if (!versionMatch) {
    throw new Error(
      `Invalid MARKFETCH_USER_AGENT=${JSON.stringify(ua)} — expected a Chrome User-Agent containing "Chrome/..."`
    );
  }
  // ...
}

The UA must contain a Chrome version string. Non-Chrome UAs fail fast at startup to prevent fingerprinting mismatches that would increase bot detection.

Sources: src/core.ts:1-50

Write Sandbox (MCP-Only)

MARKFETCH_ALLOWED_WRITE_ROOTS

Property	Value
Default	`os.tmpdir() ∪ process.cwd()`
Purpose	Restrict MCP `savePath` writes to specific directories
Type	Platform-delimiter-separated absolute paths
Platform	POSIX: `:` delimiter; Windows: `;` delimiter
Mode	MCP-only (CLI has no sandbox)

This variable applies exclusively to the MCP server mode. The CLI operates without restriction, treating the human at the shell as the security boundary.

graph TD
    A[MCP savePath request] --> B{Path inside allowed roots?}
    B -->|Yes| C[Write file]
    B -->|No| D[Return save_forbidden error]
    C --> E[Confirmation to client]
    D --> F[No file created]

When set, the value replaces the defaults entirely rather than merging with them. To retain access to the default directories, include them explicitly:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
      }
    }
  }
}

On Windows:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "C:\\Users\\me\\markfetch-out;C:\\Users\\me\\AppData\\Local\\Temp"
      }
    }
  }
}

Validation Rules

Each entry in the list must be:

An absolute path (relative paths fail fast)
An existing directory at startup
Resolved through symlinks for containment checks

function buildAllowedRoots(envValue?: string): string[] {
  // ...
}

Symlinks pointing outside the sandbox are blocked. The canonicalized path flows from the containment check into writeFile, ensuring the file is created exactly at the validated location.

Sources: src/sandbox.ts:1-50 Sources: src/mcp.ts:1-50

Error Codes

When environment variable validation fails, markfetch writes to stderr and exits with a non-zero status:

Error Code	Trigger	Exit Status
Startup failure	Invalid MARKFETCH_TIMEOUT_MS	Non-zero
Startup failure	Invalid MARKFETCH_MAX_BYTES	Non-zero
Startup failure	Non-Chrome MARKFETFETCH_USER_AGENT	Non-zero
Startup failure	Malformed MARKFETCH_ALLOWED_WRITE_ROOTS	Non-zero
Runtime error	`save_forbidden` (MCP only)	Non-zero

Runtime errors from invalid environment values (e.g., MARKFETCH_TIMEOUT_MS="abc") differ from request-scoped errors like http_error or timeout. Environment misconfiguration is always fatal at startup.

Environment Variable Summary

Variable	Default	Scope	Purpose
`MARKFETCH_TIMEOUT_MS`	`30000`	Both	Request timeout in ms
`MARKFETCH_MAX_BYTES`	`5000000`	Both	Response and markdown size cap
`MARKFETCH_USER_AGENT`	Chrome 130 string	Both	HTTP fingerprint
`MARKFETCH_ALLOWED_WRITE_ROOTS`	tmpdir + cwd	MCP only	Write sandbox boundaries

Configuration Priority

Environment variables set at process startup take precedence over all other configuration. There is no runtime override mechanism—changing these values requires restarting the server.

graph TD
    A[Environment Variable] --> B[Validated at Startup]
    B --> C[Stored in config object]
    C --> D[Used by core.ts pipeline]
    D --> E[HTTP Request]
    D --> F[File Write]
    D --> G[Response Validation]

Security Considerations

The write sandbox exists because the MCP tool is driven by a language model, which may be steered by content from a page it just fetched. Without sandboxing, a malicious page could诱导 the model to request writes outside expected directories.

The CLI intentionally has no sandbox—direct human invocation at the shell establishes the trust boundary.

Sources: README.md:1-100

Sources: src/core.ts:1-50

Write Sandbox Security

Related topics: MCP Server Integration, Environment Variables, Error Handling

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Security Boundary

Continue reading this section for the full explanation and source context.

Section Scope Limitations

Continue reading this section for the full explanation and source context.

Section Environment Variable

Continue reading this section for the full explanation and source context.

Write Sandbox Security

Overview

The Write Sandbox is a security mechanism in markfetch that restricts filesystem writes initiated via the MCP (Model Context Protocol) interface to a configurable set of allowed root directories. This protection prevents a language model, which may be influenced by fetched content, from writing files to arbitrary locations on the host system.

The sandbox enforces path containment by resolving symlinks and comparing canonicalized paths against the configured allowed roots. Any attempted write outside the sandbox boundary returns a save_forbidden error and the file is never created.

Purpose and Scope

Security Boundary

The sandbox exists because MCP tools are driven by a language model that can be steered by content from pages it fetches. Without containment:

A malicious or compromised webpage could instruct the LLM to write files to sensitive locations (e.g., ~/.ssh/authorized_keys, ~/.bashrc)
Path traversal attempts via symlinks could escape expected boundaries
Untrusted fetched content could modify configuration files or inject malicious code

The CLI mode intentionally has no sandbox. A human at the shell is considered the security boundary, as the user has direct control over command invocation and can review output before it reaches any model.

Scope Limitations

Scope	Sandboxed?
MCP server (`fetch_markdown` tool)	Yes
CLI mode (`markfetch <url>`)	No
Direct `node` execution	No

Sources: README.md:68-70

Configuration

Environment Variable

Variable	Type	Default	Description
`MARKFETCH_ALLOWED_WRITE_ROOTS`	String	`os.tmpdir()` + `process.cwd()`	Path-delimiter-separated list of absolute paths permitted as MCP `savePath` write roots

Path Delimiters

The delimiter varies by platform:

Platform	Delimiter	Example
POSIX (Linux, macOS)	`:`	`/tmp:/home/user/markfetch-out`
Windows	`;`	`C:\Users\me\markfetch-out;C:\Temp`

Behavior Rules

Replacement, not merge: When set, the variable replaces the defaults entirely. To retain access to os.tmpdir() or process.cwd(), explicitly include them.

Validation at startup: Malformed values (non-absolute entries, nonexistent directories) cause the server to fail fast on stderr.

Realpath resolution: Each root is resolved once via fs.realpath at startup to canonicalize symlinks.

Sources: README.md:71-89

Configuration Example

POSIX:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
      }
    }
  }
}

Windows:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "C:\\Users\\me\\markfetch-out;C:\\Users\\me\\AppData\\Local\\Temp"
      }
    }
  }
}

Security Model

Path Resolution Flow

graph TD
    A[User provides savePath] --> B{Is path absolute?}
    B -->|No| E[Error: savePath must be absolute]
    B -->|Yes| C[Resolve via fs.realpath]
    C --> D{Is resolved path inside allowed roots?}
    D -->|Yes| F[Allow write to resolved path]
    D -->|No| G[Return save_forbidden error]
    
    H[Allowed roots from env] --> I[Realpath-resolved at startup]
    I --> D

Symlink Handling

The sandbox protects against symlink-based escapes:

Resolve before check: Symlinks are resolved via fs.realpath before containment validation
Re-resolve at write time: The canonicalized path from the validation check flows directly into writeFile
No lexical comparison: A path like <sandbox>/link/.. is not compared lexically against the roots—it's resolved first, then validated

This prevents attacks where a symlink planted inside the sandbox points outside, collapsing lexically for the check but resolving to an external location at write time.

Sources: CHANGELOG.md:17-25

Platform-Specific Behaviors

Platform	Case Sensitivity	Notes
Linux/macOS	Case-sensitive	Paths must match exactly
Windows	Case-insensitive	`C:\Users\Bob` and `c:\users\bob` are equivalent

On Windows, the containment check lowercases both the root and target paths before comparison.

Sources: src/sandbox.ts:28-30

Core Implementation

API Design

The sandbox module exposes two primary functions:

function buildAllowedRoots(env: Record<string, string | undefined>): string[]
function validateSavePath(
  savePath: string,
  roots: string[]
): { ok: boolean; resolved?: string; reason?: string }

`buildAllowedRoots()`

Parses MARKFETCH_ALLOWED_WRITE_ROOTS from environment variables:

Parameter	Type	Description
`env`	`Record<string, string \	undefined>`	Process environment variables

Return Type	Description
`string[]`	Array of absolute, realpath-resolved directory paths

Logic:

If MARKFETCH_ALLOWED_WRITE_ROOTS is unset: return [os.tmpdir(), process.cwd()]
If set: split by platform delimiter, validate each is absolute and exists
Resolve each via fs.realpath for canonical form

`validateSavePath()`

Validates a save path is within allowed roots:

Parameter	Type	Description
`savePath`	`string`	The requested save path
`roots`	`string[]`	Allowed root directories

Return Type	Description
`{ ok: true, resolved: string }`	Path is allowed; `resolved` is the canonicalized path for writing
`{ ok: false, reason: string }`	Path is outside sandbox; `reason` describes the violation

Validation steps:

Resolve savePath via fs.realpath
For each root, compute relative path from root to resolved target
If relative path is empty (same directory) or does not start with .. and is not absolute: allow
Otherwise: reject with reason listing allowed roots

Sources: src/sandbox.ts:1-50

Error Handling

Error Codes

Code	Condition	Response
`save_forbidden`	`savePath` resolves outside allowed roots	No file written; MCP returns error
`save_failed`	`savePath` is valid but `writeFile` fails	No file written; MCP returns error

Error Message Format

All sandbox errors return the format:

[save_forbidden] '<path>' is outside the allowed write roots: ['/allowed/root1', '/allowed/root2']

This provides:

The attempted path
The reason for rejection
The list of allowed roots for debugging

Sources: src/mcp.ts:8-13

MCP Integration

Tool Schema

server.registerTool("fetch_markdown", {
  inputSchema: {
    url: z.string().url().describe("..."),
    savePath: z.string()
      .refine(isAbsolute, "savePath must be an absolute filesystem path")
      .optional()
      .describe("Optional. When provided, the fetched markdown is written to this absolute filesystem path...")
  }
});

Validation Flow

MCP adapter receives savePath parameter
Validates path is absolute (via Zod schema)
Calls validateSavePath(savePath, allowedRoots)
If ok: false: throw MarkfetchError with save_forbidden code
If ok: true: use resolved path for writeFile

Sources: src/mcp.ts:24-35

Architecture Diagram

graph LR
    subgraph MCP_Client
        A[LLM sends fetch_markdown with savePath]
    end
    
    subgraph MCP_Server
        B[src/mcp.ts - MCP adapter]
        C[src/core.ts - fetchMarkdown]
        D[src/sandbox.ts - validateSavePath]
    end
    
    subgraph File_System
        E[fs.realpath resolution]
        F[fs.writeFile]
    end
    
    A --> B
    B -->|validate path| D
    D -->|resolve symlink| E
    E -->|check containment| D
    D -->|ok: true| C
    C -->|write markdown| F
    
    D -->|ok: false| B
    B -->|save_forbidden| A

CLI vs MCP Behavior

Aspect	CLI Mode	MCP Mode
Write sandbox	None	Enforced
Path validation	Not performed	Required
Symlink resolution	Not performed	Required
`savePath` parameter	Optional, `-o` flag	Optional, tool parameter
Relative path resolution	Resolves against cwd	Not allowed (must be absolute)

The CLI adapter resolves relative paths internally for convenience, but the MCP adapter requires absolute paths and enforces the sandbox.

Sources: src/cli.ts:6-18

Security Considerations

Attack Vectors Mitigated

Path traversal: ../../etc/passwd is resolved before checking
Symlink escape: <sandbox>/link_to_external is resolved and rejected
Case confusion (Windows): C:\Users\Bob equals c:\users\bob
Tilde expansion: Not performed; shell expands ~ before argv reaches process

Remaining Trust Boundaries

Trust Level	Description
Filesystem permissions	Sandbox does not override OS file permissions
Network	Does not prevent network-based attacks
Content injection	Does not sanitize markdown content before writing

File	Role
`src/sandbox.ts`	Core sandbox validation logic
`src/mcp.ts`	MCP server adapter, uses sandbox
`src/cli.ts`	CLI adapter, no sandbox
`src/core.ts`	Core fetch pipeline
`README.md`	User documentation and configuration
`CHANGELOG.md`	Historical security fix for symlink escape

Changelog

Version	Change
0.6.0	Current release with full sandbox implementation
0.5.0	CLI mode added (unrestricted by design)
< 0.5.0	MCP-only, sandbox introduced

Sources: package.json:3

Sources: README.md:68-70

Error Handling

Related topics: Processing Pipeline, Write Sandbox Security, Environment Variables

Section Related Pages

Continue reading this section for the full explanation and source context.

Section User-Agent Validation

Continue reading this section for the full explanation and source context.

Section CLI Exit Codes

Continue reading this section for the full explanation and source context.

Section MCP Response Structure for Errors

Continue reading this section for the full explanation and source context.

Error Handling

markfetch implements a deterministic, structured error handling system that provides consistent error reporting across both CLI and MCP interfaces. All errors are categorized into specific codes that enable precise failure diagnosis and appropriate recovery strategies.

Error Code Reference

markfetch defines eight deterministic error codes that cover all failure scenarios. Each code is designed to be actionable, helping callers understand exactly what went wrong and how to respond.

Error Code	Meaning	Typical Cause
`network_error`	DNS, TCP, or TLS failure	Firewall blocking, network unavailable, invalid hostname
`http_error`	Non-2xx HTTP response	404 page not found, 403 forbidden, 500 server error
`timeout`	Request exceeded `MARKFETCH_TIMEOUT_MS`	Slow server, large page, network latency
`unsupported_content_type`	Response is not HTML	Binary files, JSON APIs, PDF documents
`extraction_failed`	Readability found no article content	Pure client-rendered SPAs with no static HTML
`too_large`	Body or markdown exceeded `MARKFETCH_MAX_BYTES`	Very large articles with embedded media
`save_failed`	File write operation failed	Missing parent directory, permission denied
`save_forbidden`	Save path outside allowed write roots	Path traverses symlink outside sandbox

Sources: README.md

Error Architecture

The error handling system follows a layered architecture where core validation and error creation happen in src/core.ts, while each adapter (CLI and MCP) provides interface-specific error formatting and reporting.

graph TD
    A[Request] --> B[core.ts Validation]
    B --> C{Error Condition?}
    C -->|No| D[Successful Fetch]
    C -->|Yes| E[MarkfetchError Thrown]
    E --> F[Adapter Layer]
    F --> G[CLI Adapter]
    F --> H[MCP Adapter]
    G --> I[stderr: [code] message]
    H --> J[content[0].text: [code] message]
    J --> K[isError: true]

Sources: src/core.ts, src/cli.ts, src/mcp.ts

MarkfetchError Class

The central error type is MarkfetchError, which encapsulates both the error code and human-readable message. This class serves as the single error type thrown throughout the application.

class MarkfetchError {
  constructor(
    public readonly code: ErrorCode,
    public readonly message: string
  ) {}
}

Sources: src/core.ts:1-100

Environment Variable Validation

markfetch validates configuration environment variables at startup to fail fast on misconfiguration rather than producing confusing per-request errors.

Variable	Default	Validation Rules
`MARKFETCH_TIMEOUT_MS`	`30000`	Positive integer
`MARKFETCH_MAX_BYTES`	`5000000`	Positive integer
`MARKFETCH_USER_AGENT`	Chrome 130 UA string	Must contain Chrome substring

The intEnv function performs validation:

function intEnv(name: string, fallback: number): number {
  const raw = process.env[name];
  if (raw == null || raw === "") return fallback;
  const n = Number(raw);
  if (!Number.isFinite(n) || !Number.isInteger(n) || n <= 0) {
    throw new Error(
      `Invalid ${name}=${JSON.stringify(raw)} — expected a positive integer.`,
    );
  }
  return n;
}

Sources: src/core.ts:1-100

User-Agent Validation

The MARKFETFET_USER_AGENT must be a valid Chrome User-Agent string. This requirement exists because Sec-CH-UA-* client hints are derived from the User-Agent at startup, and a mismatch creates a stronger bot signal.

function deriveClientHints(ua: string): {
  brands: string;
  mobile: string;
  platform: string;
} {
  const versionMatch = /\bChrome\/(\d+)/.exec(ua);
  if (!versionMatch) {
    throw new Error(
      `Invalid MARKFETCH_USER_AGENT=${JSON.stringify(ua)} — expected a Chrome User-Agent containing "Chrome/VERSION".`,
    );
  }
  // ...
}

Sources: src/core.ts:1-100

CLI Error Handling

The CLI adapter catches errors thrown from core and formats them for stderr output. Error output follows a consistent [code] message format that matches the MCP error format exactly.

try {
  const { markdown, bytes, savedTo } = await fetchMarkdown({
    url,
    savePath,
  });
  // ... success handling
} catch (err) {
  const { code, message } = classifyError(err);
  console.error(`[${code}] ${message}`);
  // Use exitCode so pending output drains before process exits
  process.exitCode = 1;
}

Sources: src/cli.ts:1-50

CLI Exit Codes

Scenario	Exit Code	Output
Success (stdout)	0	Raw markdown
Success (save to file)	0	`Saved X bytes to /path`
Any error	1	`[code] message` to stderr

The use of process.exitCode = 1 (rather than process.exit(1)) ensures pending stdout/stderr output drains before the process terminates, which is important when stdout is piped to a slow consumer.

Sources: src/cli.ts:1-50

MCP Error Handling

The MCP adapter returns errors in a format compatible with the MCP protocol. Errors appear in the content[0].text field with isError: true set.

function errorResult(code: ErrorCode, message: string) {
  return {
    content: [{ type: "text" as const, text: `[${code}] ${message}` }],
    isError: true,
  };
}

Sources: src/mcp.ts:1-50

MCP Response Structure for Errors

{
  "content": [
    {
      "type": "text",
      "text": "[network_error] DNS lookup failed"
    }
  ],
  "isError": true
}

Sources: src/mcp.ts:1-50

Write Sandbox Errors

The MCP interface enforces a write sandbox that restricts file saves to configured root directories. Errors occur when savePath resolves to a location outside the allowed roots.

export function checkWritePath(
  target: string,
  roots: string[],
): { ok: true; resolved: string } | { ok: false; reason: string } {
  // ... validation logic
  return {
    ok: false,
    reason: `'${reattached}' is outside the allowed write roots: [${roots.map((r) => `'${r}'`).join(", ")}]`,
  };
}

Sources: src/sandbox.ts:1-100

Allowed Write Roots Configuration

Platform	Default Roots	Delimiter
POSIX	`os.tmpdir()` + `process.cwd()`	`:`
Windows	`os.tmpdir()` + `process.cwd()`	`;`

Override with MARKFETCH_ALLOWED_WRITE_ROOTS environment variable. When set, this replaces the defaults entirely rather than merging.

Sources: README.md

Symlink Handling

The sandbox correctly resolves symlinks to prevent escape attempts like <sandbox>/link/../out.md where link points outside the sandbox. The canonicalized path flows from the containment check into writeFile, ensuring the file is created exactly at the validated location.

Sources: CHANGELOG.md, src/sandbox.ts:1-100

Error Classification

The classifyError function normalizes different error types into the MarkfetchError format used throughout the system:

function classifyError(err: unknown): { code: string; message: string } {
  if (err instanceof MarkfetchError) {
    return { code: err.code, message: err.message };
  }
  if (err instanceof Error) {
    return { code: "network_error", message: err.message };
  }
  return { code: "network_error", message: String(err) };
}

Sources: src/core.ts:1-100

Error Source Mapping

Error Source	Code Produced
`MarkfetchError` instances	Original code preserved
`Error` instances	`network_error`
Non-Error values	`network_error` with string coercion

Unified Error Flow

Version 0.5.0 introduced a refactoring where three inline return errorResult(...) sites in the MCP handler were converted to throw MarkfetchError from core uniformly. Both adapters now catch and convert errors consistently.

This architectural change ensures that both CLI and MCP interfaces produce identical error codes and messages for the same failure conditions.

Sources: CHANGELOG.md

Best Practices for Error Handling

For MCP Clients

Check isError field in the response object
Parse the content[0].text field for the [code] message format
Handle extraction_failed gracefully for client-rendered SPAs
Use savePath parameter for large responses to avoid tool-result truncation

For CLI Consumers

Redirect stderr to capture error codes
Parse [code] message format from stderr
Use markfetch url 2>&1 | head -1 to get the error

For Save Operations

Always use absolute paths for savePath
Verify MARKFETCH_ALLOWED_WRITE_ROOTS includes your target directory
Check for save_forbidden before save_failed in error handling logic

Sources: README.md

Development Guide

Related topics: Introduction, Quick Start Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Pipeline (src/core.ts)

Continue reading this section for the full explanation and source context.

Section Adapters (src/cli.ts & src/mcp.ts)

Continue reading this section for the full explanation and source context.

Section Prerequisites

Continue reading this section for the full explanation and source context.

Related topics: Introduction, Quick Start Guide

Development Guide

This guide provides comprehensive information for developers who want to understand, extend, or contribute to markfetch.

Overview

markfetch is a Node.js tool that fetches URLs and converts web content to clean markdown. It operates in two modes:

CLI Mode - Command-line interface for shell integration
MCP Mode - Model Context Protocol server for AI agent integration

The project requires Node.js ≥ 24 and is distributed as an npm package. Sources: package.json:8

Architecture

graph TD
    A[User Input] --> B{process.argv.length}
    B -->|≥ 2 args| C[CLI Adapter]
    B -->|Zero args| D[MCP Adapter]
    
    C --> E[src/cli.ts]
    D --> F[src/mcp.ts]
    
    E --> G[src/core.ts]
    F --> G
    
    G --> H[undici HTTP Client]
    G --> I[linkedom HTML Parser]
    G --> J[@mozilla/readability]
    G --> K[turndown]
    
    H --> L[HTTP Response]
    I --> M[DOM Document]
    J --> N[Extracted Article]
    K --> O[Markdown Output]

Core Pipeline (src/core.ts)

The core module implements the main fetch-and-convert pipeline. It orchestrates:

Component	Role
`undici`	HTTP/2 transport with Chrome-like fingerprinting
`linkedom`	HTML parsing to DOM
`@mozilla/readability`	Article content extraction
`turndown`	HTML to markdown conversion

Sources: src/core.ts:1-50

Adapters (src/cli.ts & src/mcp.ts)

The source is structured into three distinct files:

File	Purpose
`src/core.ts`	Pipeline + errors (shared logic)
`src/mcp.ts`	MCP stdio server adapter
`src/cli.ts`	CLI argv parser + dispatcher
`src/index.ts`	Lazy-import dispatcher based on `process.argv.length`

Sources: README.md:95-100

The lazy-import dispatcher ensures console.log calls in cli.ts are never reachable from the MCP path, maintaining the invariant that stdout is reserved for MCP frames. Sources: CHANGELOG.md:45-47

Setting Up the Development Environment

Prerequisites

Node.js ≥ 24
npm or yarn

Installation

# Clone the repository
git clone https://github.com/vasylenko/markfetch.git
cd markfetch

# Install dependencies
npm install

Available Scripts

Script	Command	Purpose
`dev`	`npm run dev`	Run source directly with tsx (no build required)
`build`	`npm run build`	Compile TypeScript to JavaScript
`test`	`npm run test`	Run test suite with tsx
`inspect`	`npm run inspect`	Launch MCP inspector for debugging

Sources: package.json:21-28

Build Process

The build process consists of two steps:

# Compile TypeScript
npm run build

# Post-build script (automatically runs after build)
npm run postbuild

The postbuild script (scripts/postbuild.mjs) performs additional transformations after TypeScript compilation. Sources: package.json:26

Project Structure

markfetch/
├── src/
│   ├── index.ts      # Entry point with argv dispatcher
│   ├── core.ts       # Core fetch/extract/convert pipeline
│   ├── cli.ts        # CLI adapter using commander
│   ├── mcp.ts        # MCP stdio server
│   └── sandbox.ts    # Write path sandboxing
├── dist/             # Compiled JavaScript output
├── tests/            # Test fixtures and test files
├── scripts/
│   └── postbuild.mjs # Post-compilation transformations
└── docs/
    └── SPEC.md       # Detailed specification

Configuration

Environment Variables

Variable	Default	Purpose
`MARKFETCH_TIMEOUT_MS`	`30000`	Per-request timeout in milliseconds
`MARKFETCH_MAX_BYTES`	`5000000`	Cap on response body and extracted markdown
`MARKFETCH_USER_AGENT`	Chrome 130 string	Override the User-Agent header
`MARKFETCH_ALLOWED_WRITE_ROOTS`	`os.tmpdir()` + `process.cwd()`	MCP-only write sandbox roots

Sources: README.md:60-66

Configuration Precedence

Environment variables set at startup
Command-line flags (CLI mode)
MCP tool parameters (MCP mode)

Core API

fetchMarkdown Function

The main function exported from core.ts:

interface FetchOptions {
  url: string;
  savePath?: string;
}

interface FetchResult {
  markdown: string;
  bytes: number;
  savedTo?: string;
}

Error Handling

The core module defines eight deterministic error codes:

Code	Meaning
`network_error`	DNS/TCP/TLS failure
`http_error`	Non-2xx HTTP status
`timeout`	Request timeout exceeded
`unsupported_content_type`	Not `text/html` or `application/xhtml+xml`
`extraction_failed`	Readability found no article content
`too_large`	Response or markdown exceeded size cap
`save_failed`	File write failed (permissions, missing directory)
`save_forbidden`	Path outside allowed write roots

Sources: README.md:71-80

Errors are thrown as MarkfetchError from core uniformly and caught by adapters for conversion. Sources: CHANGELOG.md:49-51

Extending the Pipeline

Adding New HTML Rewrites

The rewriteForReadability() function in core.ts handles pre-extraction HTML transformations:

function rewriteForReadability(document: Document): void {
  // Transform <aside class="footnote-brackets"> to <section>
  // Flatten <details> elements
  // Replace div.mw-heading with their heading children
}

To add new rewrite rules, append to this function before the return statement. Sources: src/core.ts:120-160

Customizing Markdown Conversion

The TURNDOWN instance is configured with:

Plugin/Option	Purpose
`gfm` plugin	GitHub Flavored Markdown support
`keepClasses: true`	Preserve `class="language-X"` for code fences
Custom escape	Handle `-`/`=` after inline elements

Sources: src/core.ts:50-90

Modifying Error Handling

Error handling flows through the MarkfetchError class in core:

Core throws MarkfetchError with code and message
Adapters catch and format for their protocol
CLI: writes [code] message to stderr
MCP: returns { content: [...], isError: true }

Sources: src/cli.ts:35-42 和 src/mcp.ts:15-20

Write Sandbox

The MCP adapter enforces write path restrictions:

graph TD
    A[MCP savePath] --> B{absolutely path?}
    B -->|No| C[Refine fails: savePath must be absolute]
    B -->|Yes| D{Inside allowed roots?}
    D -->|Yes| E[Write file]
    D -->|No| F[Return save_forbidden error]

Configuring Allowed Roots

Set the environment variable with platform delimiter:

# POSIX
export MARKFETCH_ALLOWED_WRITE_ROOTS="/tmp:/home/user/docs"

# Windows
set MARKFETCH_ALLOWED_WRITE_ROOTS="C:\Users\me\docs;C:\temp"

The sandbox checks resolve symlinks and applies case-folding on Windows. Sources: src/sandbox.ts:20-40

Testing

Running Tests

npm test

Test Structure

Tests use Node.js built-in test runner (--test flag) with tsx for TypeScript support. Sources: package.json:27

Writing New Tests

Place test files in tests/ directory
Use *.test.ts naming pattern
Run with tsx --test tests/*.test.ts

MCP Inspector

Debug MCP integration using the official inspector:

npm run inspect

This launches the MCP inspector at http://localhost:6274 where you can:

Test tool calls interactively
Inspect request/response frames
Verify schema validation

Sources: package.json:27

Dependencies

Production Dependencies

Package	Version	Purpose
`@modelcontextprotocol/sdk`	^1.29.0	MCP server implementation
`@mozilla/readability`	^0.5.0	Article extraction
`commander`	^14.0.3	CLI argument parsing
`linkedom`	^0.18.0	HTML parsing
`turndown`	^7.0.0	HTML to markdown
`turndown-plugin-gfm`	^1.0.2	GFM support
`undici`	^8.2.0	HTTP client
`zod`	^3.0.0	Schema validation

Development Dependencies

Package	Purpose
`@types/node`	Node.js type definitions
`@types/turndown`	Turndown type definitions
`tsx`	TypeScript execution
`typescript`	TypeScript compiler

Sources: package.json:30-50

Version History

Version	Date	Key Changes
0.6.0	2026-05-13	Write sandbox, Windows CI, save_forbidden error
0.5.0	2026-05-12	CLI mode, commander dependency
0.4.1	2026-05-11	README rewrite, bin path fix
0.4.0	2026-05-10	MCP server with fetch_markdown tool

Sources: CHANGELOG.md:1-60

Contributing Guidelines

Code Standards

All source in TypeScript under src/
Build output to dist/ via npm run build
Tests in tests/ with *.test.ts pattern
No runtime console.log in MCP path (enforced by lazy-import structure)

Pull Request Checklist

[ ] Run npm run build successfully
[ ] Run npm test with all tests passing
[ ] Update CHANGELOG.md with changes
[ ] Ensure documentation reflects new behavior

Release Process

npm run prepublishOnly

This runs the build automatically before npm publish. Sources: package.json:29

Sources: src/core.ts:1-50

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium v0.4.1

First-time setup may fail or require extra isolation and rollback planning.

medium README/documentation is current enough for a first validation pass.

The project should not be treated as fully validated until this signal is reviewed.

medium Maintainer activity is unknown

Users cannot judge support quality until recent activity, releases, and issue response are checked.

medium no_demo

The project may affect permissions, credentials, data exposure, or host boundaries.

Doramagic Pitfall Log

Doramagic extracted 7 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: v0.4.1

Severity: medium
Finding: Installation risk is backed by a source signal: v0.4.1. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/vasylenko/markfetch/releases/tag/v0.4.1

2. Capability assumption: README/documentation is current enough for a first validation pass.

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: capability.assumptions | github_repo:1234238440 | https://github.com/vasylenko/markfetch | README/documentation is current enough for a first validation pass.

3. Maintenance risk: Maintainer activity is unknown

Severity: medium
Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | last_activity_observed missing

4. Security or permission risk: no_demo

Severity: medium
Finding: no_demo
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: downstream_validation.risk_items | github_repo:1234238440 | https://github.com/vasylenko/markfetch | no_demo; severity=medium

5. Security or permission risk: no_demo

Severity: medium
Finding: no_demo
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: risks.scoring_risks | github_repo:1234238440 | https://github.com/vasylenko/markfetch | no_demo; severity=medium

6. Maintenance risk: issue_or_pr_quality=unknown

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | issue_or_pr_quality=unknown

7. Maintenance risk: release_recency=unknown

Severity: low
Finding: release_recency=unknown。
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | release_recency=unknown

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 2

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using markfetch with real data or production workflows.

v0.4.1 - github / github_release
README/documentation is current enough for a first validation pass. - GitHub / issue

Source: Project Pack community evidence and pitfall evidence