Doramagic Project Pack · Human Manual

markfetch

Related topics: Quick Start Guide, Processing Pipeline

Introduction

Related topics: Quick Start Guide, Processing Pipeline

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section CLI Mode

Continue reading this section for the full explanation and source context.

Section MCP Mode

Continue reading this section for the full explanation and source context.

Related topics: Quick Start Guide, Processing Pipeline

Introduction

What is markfetch?

markfetch is a Node.js tool that fetches public HTTP/S URLs and returns clean, readable markdown — indistinguishable from what a human would get by running "Save as Markdown" in a browser. It is designed to provide high-quality content extraction for language models, with a focus on producing output that LLM clients can actually consume reliably.

Sources: README.md

Core Design Philosophy

markfetch is built around several key principles that differentiate it from generic fetching solutions:

PrincipleDescription
Single-channel outputReturns markdown in content[0].text only — no structuredContent that some LLM clients drop
Real-browser fingerprintUses HTTP/2 transport with a coherent Chrome header set and Sec-CH-UA-* client hints
Reader-View extractionLeverages Mozilla's Readability library to extract the main article content
Zero-config defaultsWorks out of the box with sensible defaults
Deterministic errors8 structured error codes for reliable error handling

Sources: README.md

Architecture Overview

markfetch follows an adapter pattern with a unified core:

graph TD
    A[User / LLM Client] --> B[Adapter Layer]
    B --> C{Invocation Mode}
    C -->|CLI args| D[cli.ts]
    C -->|MCP stdio| E[mcp.ts]
    D --> F[core.ts - fetchMarkdown]
    E --> F
    F --> G[HTTP Fetch - undici]
    G --> H[Readability Extraction]
    H --> I[Turndown Conversion]
    I --> J[Markdown Output]

Core Components

ComponentFileResponsibility
Core Pipelinesrc/core.tsURL fetching, HTML parsing, content extraction, markdown conversion, error throwing
CLI Adaptersrc/cli.tsCommand-line argument parsing, stdout/stderr output
MCP Adaptersrc/mcp.tsModel Context Protocol stdio server, tool registration
Write Sandboxsrc/sandbox.tsPath validation for file saves

Sources: src/core.ts, src/cli.ts, src/mcp.ts

Two Operating Modes

CLI Mode

The command-line interface accepts a URL and outputs markdown to stdout:

markfetch https://en.wikipedia.org/wiki/Markdown

Options include:

  • -o, --output <path> — Save markdown to a file
  • -V, --version — Print version
  • -h, --help — Print usage

The CLI respects the same environment variables as the MCP mode and resolves relative output paths against the current working directory.

Sources: README.md, src/cli.ts

MCP Mode

The Model Context Protocol server provides a single tool fetch_markdown(url, savePath?) for integration with LLM clients like Claude Code, Cursor, or Goose:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

The MCP mode has additional security features:

  • Write sandbox: File saves are restricted to allowed write roots
  • Lazy loading: The CLI adapter is never loaded in MCP mode, ensuring console.log is never reachable

Sources: src/mcp.ts, src/cli.ts

Content Extraction Pipeline

The markdown conversion process involves several stages:

graph LR
    A[HTML Response] --> B[Decode Encoded Tags]
    B --> C[Ensure Base Href]
    C --> D[Rewrite for Readability]
    D --> E[Readability Parse]
    E --> F[Turndown Convert]
    F --> G[Prune Empty Headings]
    G --> H[Clean Markdown]

Extraction Details

  1. Encoded Tag Decoding: Handles HTML entities like &lt;code&gt; in code blocks
  2. Base Href Injection: Ensures relative URLs become absolute using the canonical URL
  3. Pre-processing Rewrites: Handles footnotes, `

Quick Start Guide

markfetch is a tool that fetches URLs and returns clean markdown output. It operates as both a CLI command and an MCP (Model Context Protocol) server, making it suitable for AI agents like Claude Code, Codex, and Gemini CLI.

Installation

Prerequisites

CLI Installation (Global)

npm i -g markfetch

After installation, the markfetch command is available globally. Sources: README.md:38

CLI Installation (npx)

For one-off usage without global installation:

npx -y markfetch <url>

MCP Server Setup

Add markfetch to your MCP client configuration. The setup varies by client.

#### Claude Code

claude mcp add --scope user markfetch -- npx -y markfetch

#### Codex

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

#### Gemini CLI

gemini mcp add -s user markfetch npx -y markfetch

#### Cursor / Goose / Other stdio-MCP Clients

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Sources: README.md:46-69

CLI Usage

Basic Fetch

markfetch <url>

The fetched markdown is printed to stdout. Sources: src/cli.ts:18

Save to File

markfetch <url> -o <path>

Use -o or --output to save markdown to a file. Relative paths resolve against the current working directory. Sources: src/cli.ts:12-15

Example:

markfetch https://en.wikipedia.org/wiki/Markdown -o output.md

Help and Version

markfetch --help
markfetch --version

MCP Tool Usage

Tool Name

fetch_markdown

Parameters

ParameterTypeRequiredDescription
urlstringYesAbsolute http(s) URL to fetch. The server follows redirects automatically. No authentication headers, cookies, or session state are sent.
savePathstringNoAbsolute filesystem path. When provided, the fetched markdown is written to this path instead of returned in the response.

Sources: src/mcp.ts:22-33

Return Value

The tool returns markdown content in content[0].text. No structuredContent field is used — this ensures compatibility with MCP clients that forward only structuredContent to the model. Sources: README.md:18-21

Environment Configuration

VariableDefaultPurpose
MARKFETCH_TIMEOUT_MS30000Per-request timeout in milliseconds
MARKFETCH_MAX_BYTES5000000Cap on response body and extracted markdown (5MB)
MARKFETCH_USER_AGENTPinned Chrome 130 stringOverride the User-Agent header. Must be a Chrome UA string.
MARKFETCH_ALLOWED_WRITE_ROOTSos.tmpdir() + process.cwd()MCP-only. Colon-delimited (POSIX) or semicolon-delimited (Windows) list of absolute paths permitted for savePath writes.

Sources: README.md:99-103

Passing Environment Variables to MCP

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_TIMEOUT_MS": "60000"
      }
    }
  }
}

Error Handling

Errors are returned with deterministic codes in the format [code] message:

CodeMeaning
network_errorDNS, TCP, or TLS failure
http_errorUpstream returned a non-2xx status
timeoutRequest exceeded MARKFETCH_TIMEOUT_MS
unsupported_content_typeResponse was not text/html or application/xhtml+xml
extraction_failedReadability found no article content (typical for pure client-rendered SPAs)
too_largeResponse body or extracted markdown exceeded MARKFETCH_MAX_BYTES
save_failedwriteFile failed (missing directory, permission denied)
save_forbiddensavePath resolves outside the allowed write roots

Errors go to stderr with non-zero exit status in CLI mode. Sources: README.md:72-85

Quick Workflow

graph TD
    A[Start markfetch] --> B{Arguments provided?}
    B -->|Yes, URL argument| C[CLI Mode]
    B -->|No arguments| D[MCP Server Mode]
    C --> E[Fetch URL]
    D --> F[Wait for MCP request]
    E --> G{Output path specified?}
    F --> H[Receive fetch_markdown request]
    G -->|No| I[Print to stdout]
    G -->|Yes, -o path| J[Write to file]
    H --> I
    J --> K[Return confirmation]
    I --> L[Return markdown content]
    K --> L

Use Cases

Use CaseRecommended ModeCommand/Config
One-time URL fetch in shellCLImarkfetch <url>
Batch processing with shell scriptsCLI + -omarkfetch <url> -o out.md
AI agent web content retrievalMCPConfigure in client
Large document bypass inline limitsMCP + savePathSet savePath to local file

Limitations

  • Not a crawler: No recursion, no robots.txt parsing. One URL in, one document out. Sources: README.md:89-91
  • Not authenticated: Anonymous fetch only. Pages behind login walls return whatever the public response is. Sources: README.md:93-95
  • Not a JS renderer: Pure client-rendered SPAs with no static HTML return extraction_failed. Sources: README.md:97-99

Sources: README.md

Processing Pipeline

Related topics: Introduction, HTTP/2 Fingerprinting, Error Handling

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Transport Configuration

Continue reading this section for the full explanation and source context.

Section Error Conditions

Continue reading this section for the full explanation and source context.

Section Technology Stack

Continue reading this section for the full explanation and source context.

Related topics: Introduction, HTTP/2 Fingerprinting, Error Handling

Processing Pipeline

Overview

The Processing Pipeline is the core data flow engine in markfetch. It transforms raw HTML fetched from a URL into clean, readable markdown suitable for consumption by AI agents and language models. The pipeline is intentionally single-purpose — one URL in, one markdown document out — with no recursion, pagination, or client-side JavaScript rendering.

The pipeline operates identically whether invoked via CLI or MCP adapter, ensuring consistent behavior across both interfaces.

Sources: src/core.ts

Architecture

The pipeline is composed of three primary stages executed sequentially:

graph TD
    A[URL Input] --> B[HTTP Fetch]
    B --> C{HTML Valid?}
    C -->|No| D[Error: network_error / http_error / timeout]
    C -->|Yes| E[Content-Type Check]
    E -->|Non-HTML| F[Error: unsupported_content_type]
    E -->|HTML| G[Extract Article]
    G -->|No Content| H[Error: extraction_failed]
    G -->|Extracted| I[Convert to Markdown]
    I --> J{Size Check}
    J -->|Exceeds Limit| K[Error: too_large]
    J -->|Valid| L{Save Path?}
    L -->|Yes| M[Write to File / Error: save_forbidden / save_failed]
    L -->|No| N[Return Markdown]

Each stage performs validation and may abort with a deterministic error code, ensuring failures are predictable and actionable.

Sources: src/core.ts

Stage 1: HTTP Fetch

The fetch stage retrieves raw HTML from the target URL using Node.js fetch with a real-browser fingerprint.

Transport Configuration

SettingValuePurpose
ProtocolHTTP/2Modern web fingerprint
User-AgentChrome 130 (pinned)Realistic browser identification
Client HintsSec-CH-UA-* headersDerived from User-Agent at startup
TimeoutMARKFETCH_TIMEOUT_MS (default: 30000ms)Per-request budget

The User-Agent string is validated at startup. Non-Chrome strings fail fast to prevent fingerprint inconsistencies that could trigger bot detection.

Sources: README.md

Error Conditions

CodeTrigger
network_errorDNS failure, TCP failure, TLS error, unexpected fetcher error
http_errorNon-2xx HTTP status code
timeoutResponse exceeds MARKFETCH_TIMEOUT_MS

Redirects are followed automatically by the underlying HTTP client.

Stage 2: Article Extraction

Article extraction identifies and isolates the main content from the fetched HTML, stripping navigation, sidebars, footers, and other boilerplate.

Technology Stack

ComponentLibraryPurpose
HTML ParserlinkedomParses HTML into a DOM-like structure
Extractionreadability (Mozilla)Identifies main article content
ConfigurationkeepClasses: truePreserves code block language hints

The linkedom parser is chosen over native DOMParser to ensure consistent behavior across Node.js versions and environments.

Sources: src/core.ts

Pre-Extraction Rewrites

Before Readability processes the document, the pipeline applies targeted HTML rewrites to normalize content and improve extraction quality:

function rewriteForReadability(document: Document): void {
  // Normalize code blocks (pre and code elements)
  // Convert aside elements to sections
  // Expand details/summary elements
  // Flatten MediaWiki heading wrappers
}

Specific transformations include:

| Details expansion | `

TransformTargetAction
Code block normalization<pre>, <code>Standardize encoding artifacts
Base href injection<head> / <html>Ensure absolute URLs after redirects
Aside conversion<aside> with footnote rolesConvert to <section>

HTTP/2 Fingerprinting

Overview

HTTP/2 Fingerprinting is a technique used by markfetch to mimic real browser traffic when fetching web pages. Instead of making requests that appear to come from a typical HTTP library (like curl or a basic fetch implementation), markfetch generates HTTP/2 requests with headers and client hints that closely match those of an actual Chrome browser session.

This approach serves two critical purposes:

  1. Bypass anti-bot measures: Many websites employ fingerprinting techniques to detect and block automated scrapers. By presenting headers identical to a genuine Chrome browser, markfetch avoids triggering these defenses.
  2. Access SEO-rendered content: Sites that serve different content to bots vs. browsers will return the full article content when markfetch requests arrive with Chrome-like fingerprints.

Sources: README.md

Architecture

graph TD
    A[URL Request] --> B{Adapter Type?}
    B -->|MCP| C[src/mcp.ts]
    B -->|CLI| D[src/cli.ts]
    C --> E[src/core.ts - fetchMarkdown]
    D --> E
    E --> F[Undici Dispatcher]
    F --> G[HTTP/2 Transport]
    G --> H[Sec-CH-UA-* Client Hints]
    G --> I[Chrome Headers]
    H --> J[Upstream Server]
    I --> J
    J --> K[HTML Response]
    K --> L[Readability Parser]
    L --> M[Markdown Output]

Implementation Details

User Agent String

The default user agent is a pinned Chrome 130 string. This can be overridden via the MARKFETCH_USER_AGENT environment variable, but must be a valid Chrome UA string.

Environment VariableDefault ValuePurpose
MARKFETCH_USER_AGENTPinned Chrome 130 stringOverride the browser fingerprint UA

Constraint: The UA string must be a Chrome browser UA. Non-Chrome strings fail fast at startup because Sec-CH-UA-* client hints are derived from the UA at initialization time.

Sources: README.md

Client Hints Generation

When the server starts, markfetch parses the MARKFETCH_USER_AGENT value and derives Sec-CH-UA-* client hint headers from it. These hints are sent with every HTTP/2 request and include:

  • Sec-CH-UA — Browser brand and version
  • Sec-CH-UA-Mobile — Mobile indicator
  • Sec-CH-UA-Platform — Operating system
graph LR
    A[MARKFETCH_USER_AGENT<br/>Chrome 130] --> B[Startup<br/>Initialization]
    B --> C[Sec-CH-UA Header<br/>Derived Value]
    B --> D[Sec-CH-UA-Mobile<br/>Derived Value]
    B --> E[Sec-CH-UA-Platform<br/>Derived Value]
    C --> F[Every HTTP/2<br/>Request]
    D --> F
    E --> F

Sources: README.md

HTTP/2 Transport

Markfetch uses the undici HTTP client library with HTTP/2 protocol support. The HTTP/2 transport is selected automatically by undici when the server supports it, enabling:

  • Multiplexed requests over a single connection
  • Header compression
  • Server push capabilities

The combination of HTTP/2 transport + coherent Chrome header set creates a fingerprint that is indistinguishable from a human browsing with Chrome DevTools open.

Sources: README.md

Request Flow

sequenceDiagram
    participant Client
    participant Markfetch
    participant Undici
    participant Server

    Client->>Markfetch: fetch_markdown(url)
    Markfetch->>Markfetch: Validate MARKFETCH_USER_AGENT
    Markfetch->>Undici: Dispatch with Chrome headers
    Undici->>Server: HTTP/2 CONNECT<br/>Sec-CH-UA: "Chromium"
    Undici->>Server: Sec-CH-UA-Mobile: ?U
    Undici->>Server: Sec-CH-UA-Platform: "Windows"
    Undici->>Server: GET /path HTTP/2
    Server->>Undici: HTTP/2 200 OK<br/>text/html
    Undici->>Markfetch: HTML Content
    Markfetch->>Markfetch: Apply Readability
    Markfetch->>Markfetch: Convert to Markdown
    Markfetch->>Client: Clean Markdown

Configuration

Environment Variables

VariableDefaultPurpose
MARKFETCH_TIMEOUT_MS30000Per-request timeout in milliseconds
MARKFETCH_MAX_BYTES5000000Cap on response body and extracted markdown
MARKFETCH_USER_AGENTPinned Chrome 130Browser fingerprint override

Validation

All environment variables are validated at startup. Invalid values cause the process to fail fast on stderr with descriptive error messages, rather than producing confusing per-request errors.

Sources: README.md

Integration Points

MCP Adapter

The MCP server (src/mcp.ts) uses the core fetch pipeline which includes the HTTP/2 fingerprinting. The tool description explicitly documents this behavior:

Fetch a single public HTTP/S URL and return its main article content as clean markdown. Best for articles, documentation, blog posts, news, and reference pages. Non-HTML responses return unsupported_content_type.

Sources: src/mcp.ts

CLI Adapter

The CLI adapter (src/cli.ts) also uses the same core fetch pipeline, ensuring consistent HTTP/2 fingerprinting behavior whether invoked via MCP or command line:

markfetch https://en.wikipedia.org/wiki/Markdown

Sources: src/cli.ts

Version History

VersionDateChange
0.4.02026-05-10HTTP/2 fingerprinting feature added with Sec-CH-UA-* client hints
0.5.02026-05-12CLI mode added with same fingerprinting behavior
0.6.0CurrentEnhanced write sandbox and validation

Sources: CHANGELOG.md

Limitations

SPA Handling

Pure client-rendered Single Page Applications (SPAs) with no static HTML content return extraction_failed. Sites that ship server-rendered or SEO-prerendered HTML will extract whatever static content they expose, including when accessed with Chrome fingerprints.

Authentication

Markfetch performs anonymous fetches only — no cookie jar, no auth headers, no session reuse. Pages behind login walls return whatever the public response is, usually surfaced as http_error.

Sources: README.md

Security Considerations

The HTTP/2 fingerprinting approach makes requests appear legitimate, which raises responsibility concerns. The documentation explicitly states:

Use it on URLs whose targets you have permission to fetch, and respect the terms of service of any site you query. The maintainer assumes no liability for misuse.

Sources: README.md

Sources: src/core.ts

CLI Usage

Related topics: Quick Start Guide, MCP Server Integration, Write Sandbox Security

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Arguments

Continue reading this section for the full explanation and source context.

Section Options

Continue reading this section for the full explanation and source context.

Section Error Codes

Continue reading this section for the full explanation and source context.

Related topics: Quick Start Guide, MCP Server Integration, Write Sandbox Security

CLI Usage

The markfetch CLI provides a command-line interface for fetching URLs and converting their content to clean markdown. It operates as one of two execution surfaces—the other being the MCP (Model Context Protocol) stdio server—with both sharing the same underlying core pipeline.

Overview

The CLI accepts a URL as its primary argument and outputs the converted markdown to stdout or to a specified file. It was introduced in version 0.5.0 as a way to make markfetch accessible from standard shell environments, pipelines, and scripts.

AspectDetails
Entry Pointmarkfetch <url>
Outputstdout (default) or file via -o
Version0.6.0
RuntimeNode.js ≥ 24
Distributionnpm package

Sources: README.md

Architecture

The CLI is implemented as an adapter layer that delegates to the shared core. When the process is invoked with arguments, the dispatcher in index.ts lazy-loads the CLI adapter; bare invocation (zero arguments) routes to the MCP server instead.

graph TD
    A["markfetch CLI Invokation<br/>process.argv.length > 1"] --> B["src/index.ts<br/>Dispatcher"]
    B --> C["src/cli.ts<br/>CLI Adapter"]
    C --> D["src/core.ts<br/>fetchMarkdown()"]
    D --> E["src/sandbox.ts<br/>Write Validation"]
    D --> F["HTTP Fetch + Readability + Turndown"]
    
    G["Bare Invocation<br/>process.argv.length === 1"] --> H["src/mcp.ts<br/>MCP Server"]

Sources: src/cli.ts:39-47

Command Syntax

markfetch <url> [options]

Arguments

ArgumentRequiredDescription
<url>YesAbsolute http(s) URL to fetch

Options

FlagDescription
-o, --output <path>Save markdown to a file (absolute or relative path). Default is stdout.
-V, --versionPrint version and exit
-h, --helpPrint usage and exit

Sources: src/cli.ts:23-30

Output Behavior

The CLI maintains strict separation between its output channels:

ScenarioChannelContent
Raw markdown (no -o)stdoutRaw markdown body via process.stdout.write()
File output (-o)stdoutConfirmation: Saved N bytes to <path>
Any errorstderr[code] message

The raw markdown is written using process.stdout.write() rather than console.log() to preserve trailing whitespace in the output—matching the exact bytes the MCP adapter would emit in content[0].text.

Sources: src/cli.ts:50-58

Error Handling

Errors are written to stderr with a deterministic format: [code] message. The process exits with a non-zero status code.

process.exitCode = 1;
console.error(`[${code}] ${message}`);

The CLI uses process.exitCode (not process.exit()) to ensure pending output drains before the process exits—important when stdout is piped to a slow consumer.

Sources: src/cli.ts:58-62

Error Codes

CodeMeaning
network_errorDNS / TCP / TLS failure
http_errorUpstream returned a non-2xx status
timeoutRequest exceeded MARKFETCH_TIMEOUT_MS
unsupported_content_typeResponse was not HTML
extraction_failedNo extractable article content
too_largeResponse or markdown exceeded MARKFETCH_MAX_BYTES
save_failedFile write failed (permission denied, etc.)

Note: save_forbidden is MCP-only and does not apply to CLI (no sandbox).

Sources: README.md

Path Resolution

The CLI resolves relative output paths against the current working directory before passing them to the core:

const savePath = options.output
  ? resolve(process.cwd(), options.output)
  : undefined;

Tilde expansion is intentionally not performed—the shell expands ~/foo before argv reaches the process, and a quoted literal '~/foo' should produce a file named ~/foo in cwd (standard tool behavior).

Sources: src/cli.ts:32-39

Environment Variables

These environment variables apply to both CLI and MCP modes:

VariableDefaultPurpose
MARKFETCH_TIMEOUT_MS30000Per-request timeout in ms
MARKFETCH_MAX_BYTES5000000Cap on response body and extracted markdown
MARKFETCH_USER_AGENTChrome 130 stringOverride User-Agent header

The CLI adapter imports fetchMarkdown and classifyError from the core module, which validates these environment variables at startup.

Sources: src/cli.ts:15 and README.md

File Structure

The project source is organized into adapter modules:

src/
├── index.ts    # Dispatcher (lazy-loads cli.ts or mcp.ts)
├── core.ts     # Shared pipeline and errors
├── cli.ts      # CLI adapter (commander-based)
└── mcp.ts      # MCP stdio server adapter

The lazy-import pattern ensures that cli.ts code (which calls console.log) is never loaded when running in MCP mode, preserving the "stdout is reserved for MCP frames" invariant structurally.

Sources: CHANGELOG.md and src/cli.ts:1-13

Installation

Install globally via npm:

npm i -g markfetch

Or use via npx without installation:

npx -y markfetch <url>

The bin entry in package.json points to dist/index.js:

{
  "bin": {
    "markfetch": "dist/index.js"
  }
}

Sources: package.json:16-18

Usage Examples

Basic fetch to stdout

markfetch https://en.wikipedia.org/wiki/Markdown

Save to file

markfetch https://example.com/article -o output.md

With timeout override

MARKFETCH_TIMEOUT_MS=60000 markfetch https://slow-site.example.com

Pipeline to another tool

markfetch https://example.com/doc | grep -A5 "## Installation"

Sources: README.md

Sources: README.md

MCP Server Integration

Related topics: Quick Start Guide, CLI Usage, Write Sandbox Security

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Entry Point Dispatcher

Continue reading this section for the full explanation and source context.

Section Module Isolation

Continue reading this section for the full explanation and source context.

Section Server Initialization

Continue reading this section for the full explanation and source context.

Related topics: Quick Start Guide, CLI Usage, Write Sandbox Security

MCP Server Integration

Overview

The MCP (Model Context Protocol) Server Integration is the primary interface for AI agents to fetch web content as clean markdown. Markfetch exposes a single MCP tool fetch_markdown that accepts a URL and returns extracted markdown content, enabling language models like Claude to access web information through a standardized protocol.

The MCP server operates as a stdio-based server, meaning it communicates exclusively through standard input and standard output streams. This design ensures the server integrates seamlessly with MCP clients including Claude Desktop, Claude Code, Cursor, and Goose.

Architecture

Entry Point Dispatcher

The src/index.ts file implements an argv-discriminated dispatcher that determines whether to start the MCP server or the CLI based on the presence of command-line arguments:

if (process.argv.length === 2) {
  await import("./mcp.js");
} else {
  await import("./cli.js");
}

Sources: src/index.ts:26-29

When process.argv.length === 2, the process was invoked without arguments—this is the standard pattern MCP clients use when spawning a server. Any extra argument (URL, flags, --help) routes to the CLI adapter.

Module Isolation

The dynamic import pattern ensures complete module isolation:

graph TD
    A[markfetch entry] --> B{argv.length === 2?}
    B -->|Yes| C[Lazy import: mcp.ts]
    B -->|No| D[Lazy import: cli.ts]
    C --> E[@modelcontextprotocol/sdk loaded]
    D --> F[commander loaded]
    E -.-> G[Never reaches console.log]
    F -.-> H[Can use console.log]

Sources: src/index.ts:18-22

This architecture enforces the "stdout is reserved for MCP frames" invariant structurally—the MCP path never imports cli.ts, so code that calls console.log is literally unreachable from the MCP execution path.

MCP Server Implementation

Server Initialization

The MCP server is initialized using the @modelcontextprotocol/sdk package:

const server = new McpServer({ name: "markfetch", version: "0.6.0" });

Sources: src/mcp.ts:20

Tool Registration

The server registers a single tool fetch_markdown with a Zod-based input schema:

server.registerTool(
  "fetch_markdown",
  {
    description: "Fetch a single public HTTP/S URL and return its main article content as clean markdown...",
    inputSchema: {
      url: z.string().url().describe("Absolute http(s) URL of the page to fetch..."),
      savePath: z.string().refine(isAbsolute, "savePath must be an absolute filesystem path").optional().describe("Optional. When provided...")
    }
  },
  async ({ url, savePath }) => {
    // Implementation
  }
);

Sources: src/mcp.ts:22-47

Tool Input Schema

ParameterTypeRequiredDescription
urlstringYesAbsolute http(s) URL of the page to fetch. The server follows redirects automatically. No authentication headers, cookies, or session state are sent.
savePathstringNoOptional absolute filesystem path. When provided, the fetched markdown is written to this path instead of returned inline.

The url parameter is validated using Zod's .url() method to ensure a valid URL format. The savePath parameter must be an absolute path, enforced by the .refine(isAbsolute, ...) check.

Response Format

The tool returns a response in this structure:

{
  content: [{ type: "text", text: "markdown content or [errorcode] message" }],
  isError: boolean
}

Sources: src/mcp.ts:8-12

Error Handling

Error Code System

The MCP adapter uses a uniform error code system with 8 deterministic codes:

Error CodeDescriptionSource
network_errorDNS/TCP/TLS failure or unexpected internal errorcore.ts
http_errorUpstream returned non-2xx statuscore.ts
timeoutPer-request budget exceededcore.ts
unsupported_content_typeResponse was not text/html or application/xhtml+xmlcore.ts
extraction_failedReadability returned no article contentcore.ts
too_largeResponse or markdown exceeded MARKFETCH_MAX_BYTEScore.ts
save_failedwriteFile failed (permission denied, missing directory)core.ts
save_forbiddensavePath resolves outside allowed write rootssrc/mcp.ts

Error Result Factory

function errorResult(code: ErrorCode, message: string) {
  return {
    content: [{ type: "text" as const, text: `[${code}] ${message}` }],
    isError: true,
  };
}

Sources: src/mcp.ts:8-12

Error Propagation Pattern

In version 0.5.0, error handling was refactored so that core functions now throw MarkfetchError instead of returning error results inline. Both the MCP and CLI adapters catch these exceptions and convert them to their respective output formats.

Sources: CHANGELOG.md:19-21

Write Sandbox (MCP-Specific)

The MCP server implements a write sandbox that restricts savePath operations to a set of allowed root directories.

Default Allowed Roots

By default, the allowed set is:

  • os.tmpdir() (system temp directory)
  • process.cwd() (current working directory)

Each path is resolved via fs.realpath at startup to handle symlinks.

Configuration

The MARKFETCH_ALLOWED_WRITE_ROOTS environment variable overrides the default set entirely:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
      }
    }
  }
}

Sources: README.md:89-100

Security Rationale

The sandbox is MCP-only by design. The CLI is unrestricted because "a human at the shell is the security boundary." The asymmetry exists because the MCP tool is driven by a language model, which may be steered by content from a page it just fetched.

Sources: README.md:102-104

Request Flow

sequenceDiagram
    participant Client as MCP Client
    participant MCP as MCP Server
    participant Core as fetchMarkdown()
    participant Fetch as HTTP Fetcher

    Client->>MCP: fetch_markdown({url, savePath?})
    MCP->>Core: fetchMarkdown({url, savePath})
    Core->>Fetch: GET url (with Chrome fingerprint)
    Fetch-->>Core: HTML response
    Core->>Core: Readability parsing
    Core->>Core: Turndown conversion
    alt savePath provided
        Core->>Core: Write to file (within sandbox)
    end
    Core-->>MCP: {markdown, bytes, savedTo?}
    MCP-->>Client: {content: [{text: markdown}], isError: false}

Environment Configuration

VariableDefaultPurposeMCP-Specific
MARKFETCH_TIMEOUT_MS30000Per-request timeout in msNo
MARKFETCH_MAX_BYTES5000000Cap on response body and extracted markdownNo
MARKFETCH_USER_AGENTChrome 130 stringOverride the User-Agent headerNo
MARKFETCH_ALLOWED_WRITE_ROOTSos.tmpdir() + process.cwd()Permitted write roots for savePathYes

Sources: src/mcp.ts:1-5, README.md:68-75

Integration with Clients

Claude Desktop / Claude Code

claude mcp add --scope user markfetch -- npx -y markfetch

Sources: README.md:40-43

Codex

codex mcp add markfetch -- npx -y markfetch

Sources: README.md:46-48

Manual Configuration

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Sources: README.md:52-58

Dependencies

The MCP server depends on:

PackageVersionPurpose
@modelcontextprotocol/sdk^1.29.0MCP protocol implementation
zod^3.0.0Input schema validation
@mozilla/readability^0.5.0Article extraction
turndown^7.0.0HTML to Markdown conversion
undici^8.2.0HTTP client
linkedom^0.18.0DOM parsing

Sources: package.json:36-47

Source: https://github.com/vasylenko/markfetch / Human Manual

Environment Variables

Related topics: HTTP/2 Fingerprinting, Write Sandbox Security, Error Handling

Section Related Pages

Continue reading this section for the full explanation and source context.

Section MARKFETCHTIMEOUTMS

Continue reading this section for the full explanation and source context.

Section MARKFETCHMAXBYTES

Continue reading this section for the full explanation and source context.

Section MARKFETCHUSERAGENT

Continue reading this section for the full explanation and source context.

Related topics: HTTP/2 Fingerprinting, Write Sandbox Security, Error Handling

Environment Variables

markfetch uses environment variables to configure runtime behavior at startup. These variables control network timeouts, response size limits, HTTP fingerprinting, and file write permissions for the MCP server.

Overview

Environment variables in markfetch serve as the primary configuration mechanism. Unlike per-request options, these settings apply globally to every operation and are validated once at process startup. This fail-fast design prevents misconfiguration from producing confusing per-request errors later.

graph TD
    A[Process Start] --> B[Validate MARKFETCH_TIMEOUT_MS]
    A --> C[Validate MARKFETCH_MAX_BYTES]
    A --> D[Validate MARKFETCH_USER_AGENT]
    A --> E[Build MARKFETCH_ALLOWED_WRITE_ROOTS]
    B --> F{Valid?}
    C --> F
    D --> F
    E --> F
    F -->|Yes| G[Server Ready]
    F -->|No| H[Exit with stderr error]

All validation occurs before the server begins accepting requests. Invalid values cause immediate process termination with a descriptive error message written to stderr.

Configuration Variables

MARKFETCH_TIMEOUT_MS

PropertyValue
Default30000 (30 seconds)
PurposePer-request timeout in milliseconds
TypePositive integer

Controls the maximum duration allowed for any single HTTP request, including DNS resolution, TCP connection, TLS handshake, and response body transfer.

const config = {
  timeoutMs: intEnv("MARKFETCH_TIMEOUT_MS", 30_000),
};

Validation rejects non-positive integers, non-integer values, and non-finite numbers (NaN, Infinity). A malformed value produces:

[core] Error: Invalid MARKFETCH_TIMEOUT_MS="abc" — expected a positive integer.

Sources: src/core.ts:1-50

MARKFETCH_MAX_BYTES

PropertyValue
Default5000000 (~4.77 MB)
PurposeCap on response body and extracted markdown
TypePositive integer

Both the raw HTTP response body and the final extracted markdown are checked against this limit. If either exceeds the cap, the operation returns too_large error.

const config = {
  maxBytes: intEnv("MARKFETCH_MAX_BYTES", 5_000_000),
};

Sources: src/core.ts:1-50

MARKFETCH_USER_AGENT

PropertyValue
DefaultMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36
PurposeHTTP User-Agent header and Sec-CH-UA-* client hints
TypeString (must contain "Chrome")

The User-Agent string determines both the HTTP header sent to servers and the derived Sec-CH-UA-* client hints. The hints are derived at startup and remain fixed for the process lifetime.

graph LR
    A[MARKFETCH_USER_AGENT] --> B[deriveClientHints]
    B --> C[Sec-CH-UA]
    B --> D[Sec-CH-UA-Mobile]
    B --> E[Sec-CH-UA-Platform]
    A --> F[User-Agent Header]
function deriveClientHints(ua: string): {
  brands: string;
  mobile: string;
  platform: string;
} {
  const versionMatch = /\bChrome\/(\d+)/.exec(ua);
  if (!versionMatch) {
    throw new Error(
      `Invalid MARKFETCH_USER_AGENT=${JSON.stringify(ua)} — expected a Chrome User-Agent containing "Chrome/..."`
    );
  }
  // ...
}

The UA must contain a Chrome version string. Non-Chrome UAs fail fast at startup to prevent fingerprinting mismatches that would increase bot detection.

Sources: src/core.ts:1-50

Write Sandbox (MCP-Only)

MARKFETCH_ALLOWED_WRITE_ROOTS

PropertyValue
Defaultos.tmpdir() ∪ process.cwd()
PurposeRestrict MCP savePath writes to specific directories
TypePlatform-delimiter-separated absolute paths
PlatformPOSIX: : delimiter; Windows: ; delimiter
ModeMCP-only (CLI has no sandbox)

This variable applies exclusively to the MCP server mode. The CLI operates without restriction, treating the human at the shell as the security boundary.

graph TD
    A[MCP savePath request] --> B{Path inside allowed roots?}
    B -->|Yes| C[Write file]
    B -->|No| D[Return save_forbidden error]
    C --> E[Confirmation to client]
    D --> F[No file created]

When set, the value replaces the defaults entirely rather than merging with them. To retain access to the default directories, include them explicitly:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
      }
    }
  }
}

On Windows:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "C:\\Users\\me\\markfetch-out;C:\\Users\\me\\AppData\\Local\\Temp"
      }
    }
  }
}

Validation Rules

Each entry in the list must be:

  1. An absolute path (relative paths fail fast)
  2. An existing directory at startup
  3. Resolved through symlinks for containment checks
function buildAllowedRoots(envValue?: string): string[] {
  // ...
}

Symlinks pointing outside the sandbox are blocked. The canonicalized path flows from the containment check into writeFile, ensuring the file is created exactly at the validated location.

Sources: src/sandbox.ts:1-50 Sources: src/mcp.ts:1-50

Error Codes

When environment variable validation fails, markfetch writes to stderr and exits with a non-zero status:

Error CodeTriggerExit Status
Startup failureInvalid MARKFETCH_TIMEOUT_MSNon-zero
Startup failureInvalid MARKFETCH_MAX_BYTESNon-zero
Startup failureNon-Chrome MARKFETFETCH_USER_AGENTNon-zero
Startup failureMalformed MARKFETCH_ALLOWED_WRITE_ROOTSNon-zero
Runtime errorsave_forbidden (MCP only)Non-zero

Runtime errors from invalid environment values (e.g., MARKFETCH_TIMEOUT_MS="abc") differ from request-scoped errors like http_error or timeout. Environment misconfiguration is always fatal at startup.

Environment Variable Summary

VariableDefaultScopePurpose
MARKFETCH_TIMEOUT_MS30000BothRequest timeout in ms
MARKFETCH_MAX_BYTES5000000BothResponse and markdown size cap
MARKFETCH_USER_AGENTChrome 130 stringBothHTTP fingerprint
MARKFETCH_ALLOWED_WRITE_ROOTStmpdir + cwdMCP onlyWrite sandbox boundaries

Configuration Priority

Environment variables set at process startup take precedence over all other configuration. There is no runtime override mechanism—changing these values requires restarting the server.

graph TD
    A[Environment Variable] --> B[Validated at Startup]
    B --> C[Stored in config object]
    C --> D[Used by core.ts pipeline]
    D --> E[HTTP Request]
    D --> F[File Write]
    D --> G[Response Validation]

Security Considerations

The write sandbox exists because the MCP tool is driven by a language model, which may be steered by content from a page it just fetched. Without sandboxing, a malicious page could诱导 the model to request writes outside expected directories.

The CLI intentionally has no sandbox—direct human invocation at the shell establishes the trust boundary.

Sources: README.md:1-100

Sources: src/core.ts:1-50

Write Sandbox Security

Related topics: MCP Server Integration, Environment Variables, Error Handling

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Security Boundary

Continue reading this section for the full explanation and source context.

Section Scope Limitations

Continue reading this section for the full explanation and source context.

Section Environment Variable

Continue reading this section for the full explanation and source context.

Related topics: MCP Server Integration, Environment Variables, Error Handling

Write Sandbox Security

Overview

The Write Sandbox is a security mechanism in markfetch that restricts filesystem writes initiated via the MCP (Model Context Protocol) interface to a configurable set of allowed root directories. This protection prevents a language model, which may be influenced by fetched content, from writing files to arbitrary locations on the host system.

The sandbox enforces path containment by resolving symlinks and comparing canonicalized paths against the configured allowed roots. Any attempted write outside the sandbox boundary returns a save_forbidden error and the file is never created.

Purpose and Scope

Security Boundary

The sandbox exists because MCP tools are driven by a language model that can be steered by content from pages it fetches. Without containment:

  • A malicious or compromised webpage could instruct the LLM to write files to sensitive locations (e.g., ~/.ssh/authorized_keys, ~/.bashrc)
  • Path traversal attempts via symlinks could escape expected boundaries
  • Untrusted fetched content could modify configuration files or inject malicious code

The CLI mode intentionally has no sandbox. A human at the shell is considered the security boundary, as the user has direct control over command invocation and can review output before it reaches any model.

Scope Limitations

ScopeSandboxed?
MCP server (fetch_markdown tool)Yes
CLI mode (markfetch <url>)No
Direct node executionNo

Sources: README.md:68-70

Configuration

Environment Variable

VariableTypeDefaultDescription
MARKFETCH_ALLOWED_WRITE_ROOTSStringos.tmpdir() + process.cwd()Path-delimiter-separated list of absolute paths permitted as MCP savePath write roots

Path Delimiters

The delimiter varies by platform:

PlatformDelimiterExample
POSIX (Linux, macOS):/tmp:/home/user/markfetch-out
Windows;C:\Users\me\markfetch-out;C:\Temp

Behavior Rules

  1. Replacement, not merge: When set, the variable replaces the defaults entirely. To retain access to os.tmpdir() or process.cwd(), explicitly include them.
  1. Validation at startup: Malformed values (non-absolute entries, nonexistent directories) cause the server to fail fast on stderr.
  1. Realpath resolution: Each root is resolved once via fs.realpath at startup to canonicalize symlinks.

Sources: README.md:71-89

Configuration Example

POSIX:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
      }
    }
  }
}

Windows:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "MARKFETCH_ALLOWED_WRITE_ROOTS": "C:\\Users\\me\\markfetch-out;C:\\Users\\me\\AppData\\Local\\Temp"
      }
    }
  }
}

Security Model

Path Resolution Flow

graph TD
    A[User provides savePath] --> B{Is path absolute?}
    B -->|No| E[Error: savePath must be absolute]
    B -->|Yes| C[Resolve via fs.realpath]
    C --> D{Is resolved path inside allowed roots?}
    D -->|Yes| F[Allow write to resolved path]
    D -->|No| G[Return save_forbidden error]
    
    H[Allowed roots from env] --> I[Realpath-resolved at startup]
    I --> D

The sandbox protects against symlink-based escapes:

  1. Resolve before check: Symlinks are resolved via fs.realpath before containment validation
  2. Re-resolve at write time: The canonicalized path from the validation check flows directly into writeFile
  3. No lexical comparison: A path like <sandbox>/link/.. is not compared lexically against the roots—it's resolved first, then validated

This prevents attacks where a symlink planted inside the sandbox points outside, collapsing lexically for the check but resolving to an external location at write time.

Sources: CHANGELOG.md:17-25

Platform-Specific Behaviors

PlatformCase SensitivityNotes
Linux/macOSCase-sensitivePaths must match exactly
WindowsCase-insensitiveC:\Users\Bob and c:\users\bob are equivalent

On Windows, the containment check lowercases both the root and target paths before comparison.

Sources: src/sandbox.ts:28-30

Core Implementation

API Design

The sandbox module exposes two primary functions:

function buildAllowedRoots(env: Record<string, string | undefined>): string[]
function validateSavePath(
  savePath: string,
  roots: string[]
): { ok: boolean; resolved?: string; reason?: string }

`buildAllowedRoots()`

Parses MARKFETCH_ALLOWED_WRITE_ROOTS from environment variables:

ParameterTypeDescription
env`Record<string, string \undefined>`Process environment variables
Return TypeDescription
string[]Array of absolute, realpath-resolved directory paths

Logic:

  1. If MARKFETCH_ALLOWED_WRITE_ROOTS is unset: return [os.tmpdir(), process.cwd()]
  2. If set: split by platform delimiter, validate each is absolute and exists
  3. Resolve each via fs.realpath for canonical form

`validateSavePath()`

Validates a save path is within allowed roots:

ParameterTypeDescription
savePathstringThe requested save path
rootsstring[]Allowed root directories
Return TypeDescription
{ ok: true, resolved: string }Path is allowed; resolved is the canonicalized path for writing
{ ok: false, reason: string }Path is outside sandbox; reason describes the violation

Validation steps:

  1. Resolve savePath via fs.realpath
  2. For each root, compute relative path from root to resolved target
  3. If relative path is empty (same directory) or does not start with .. and is not absolute: allow
  4. Otherwise: reject with reason listing allowed roots

Sources: src/sandbox.ts:1-50

Error Handling

Error Codes

CodeConditionResponse
save_forbiddensavePath resolves outside allowed rootsNo file written; MCP returns error
save_failedsavePath is valid but writeFile failsNo file written; MCP returns error

Error Message Format

All sandbox errors return the format:

[save_forbidden] '<path>' is outside the allowed write roots: ['/allowed/root1', '/allowed/root2']

This provides:

  • The attempted path
  • The reason for rejection
  • The list of allowed roots for debugging

Sources: src/mcp.ts:8-13

MCP Integration

Tool Schema

server.registerTool("fetch_markdown", {
  inputSchema: {
    url: z.string().url().describe("..."),
    savePath: z.string()
      .refine(isAbsolute, "savePath must be an absolute filesystem path")
      .optional()
      .describe("Optional. When provided, the fetched markdown is written to this absolute filesystem path...")
  }
});

Validation Flow

  1. MCP adapter receives savePath parameter
  2. Validates path is absolute (via Zod schema)
  3. Calls validateSavePath(savePath, allowedRoots)
  4. If ok: false: throw MarkfetchError with save_forbidden code
  5. If ok: true: use resolved path for writeFile

Sources: src/mcp.ts:24-35

Architecture Diagram

graph LR
    subgraph MCP_Client
        A[LLM sends fetch_markdown with savePath]
    end
    
    subgraph MCP_Server
        B[src/mcp.ts - MCP adapter]
        C[src/core.ts - fetchMarkdown]
        D[src/sandbox.ts - validateSavePath]
    end
    
    subgraph File_System
        E[fs.realpath resolution]
        F[fs.writeFile]
    end
    
    A --> B
    B -->|validate path| D
    D -->|resolve symlink| E
    E -->|check containment| D
    D -->|ok: true| C
    C -->|write markdown| F
    
    D -->|ok: false| B
    B -->|save_forbidden| A

CLI vs MCP Behavior

AspectCLI ModeMCP Mode
Write sandboxNoneEnforced
Path validationNot performedRequired
Symlink resolutionNot performedRequired
savePath parameterOptional, -o flagOptional, tool parameter
Relative path resolutionResolves against cwdNot allowed (must be absolute)

The CLI adapter resolves relative paths internally for convenience, but the MCP adapter requires absolute paths and enforces the sandbox.

Sources: src/cli.ts:6-18

Security Considerations

Attack Vectors Mitigated

  1. Path traversal: ../../etc/passwd is resolved before checking
  2. Symlink escape: <sandbox>/link_to_external is resolved and rejected
  3. Case confusion (Windows): C:\Users\Bob equals c:\users\bob
  4. Tilde expansion: Not performed; shell expands ~ before argv reaches process

Remaining Trust Boundaries

Trust LevelDescription
Filesystem permissionsSandbox does not override OS file permissions
NetworkDoes not prevent network-based attacks
Content injectionDoes not sanitize markdown content before writing
FileRole
src/sandbox.tsCore sandbox validation logic
src/mcp.tsMCP server adapter, uses sandbox
src/cli.tsCLI adapter, no sandbox
src/core.tsCore fetch pipeline
README.mdUser documentation and configuration
CHANGELOG.mdHistorical security fix for symlink escape

Changelog

VersionChange
0.6.0Current release with full sandbox implementation
0.5.0CLI mode added (unrestricted by design)
< 0.5.0MCP-only, sandbox introduced

Sources: package.json:3

Sources: README.md:68-70

Error Handling

Related topics: Processing Pipeline, Write Sandbox Security, Environment Variables

Section Related Pages

Continue reading this section for the full explanation and source context.

Section User-Agent Validation

Continue reading this section for the full explanation and source context.

Section CLI Exit Codes

Continue reading this section for the full explanation and source context.

Section MCP Response Structure for Errors

Continue reading this section for the full explanation and source context.

Related topics: Processing Pipeline, Write Sandbox Security, Environment Variables

Error Handling

markfetch implements a deterministic, structured error handling system that provides consistent error reporting across both CLI and MCP interfaces. All errors are categorized into specific codes that enable precise failure diagnosis and appropriate recovery strategies.

Error Code Reference

markfetch defines eight deterministic error codes that cover all failure scenarios. Each code is designed to be actionable, helping callers understand exactly what went wrong and how to respond.

Error CodeMeaningTypical Cause
network_errorDNS, TCP, or TLS failureFirewall blocking, network unavailable, invalid hostname
http_errorNon-2xx HTTP response404 page not found, 403 forbidden, 500 server error
timeoutRequest exceeded MARKFETCH_TIMEOUT_MSSlow server, large page, network latency
unsupported_content_typeResponse is not HTMLBinary files, JSON APIs, PDF documents
extraction_failedReadability found no article contentPure client-rendered SPAs with no static HTML
too_largeBody or markdown exceeded MARKFETCH_MAX_BYTESVery large articles with embedded media
save_failedFile write operation failedMissing parent directory, permission denied
save_forbiddenSave path outside allowed write rootsPath traverses symlink outside sandbox

Sources: README.md

Error Architecture

The error handling system follows a layered architecture where core validation and error creation happen in src/core.ts, while each adapter (CLI and MCP) provides interface-specific error formatting and reporting.

graph TD
    A[Request] --> B[core.ts Validation]
    B --> C{Error Condition?}
    C -->|No| D[Successful Fetch]
    C -->|Yes| E[MarkfetchError Thrown]
    E --> F[Adapter Layer]
    F --> G[CLI Adapter]
    F --> H[MCP Adapter]
    G --> I[stderr: [code] message]
    H --> J[content[0].text: [code] message]
    J --> K[isError: true]

Sources: src/core.ts, src/cli.ts, src/mcp.ts

MarkfetchError Class

The central error type is MarkfetchError, which encapsulates both the error code and human-readable message. This class serves as the single error type thrown throughout the application.

class MarkfetchError {
  constructor(
    public readonly code: ErrorCode,
    public readonly message: string
  ) {}
}

Sources: src/core.ts:1-100

Environment Variable Validation

markfetch validates configuration environment variables at startup to fail fast on misconfiguration rather than producing confusing per-request errors.

VariableDefaultValidation Rules
MARKFETCH_TIMEOUT_MS30000Positive integer
MARKFETCH_MAX_BYTES5000000Positive integer
MARKFETCH_USER_AGENTChrome 130 UA stringMust contain Chrome substring

The intEnv function performs validation:

function intEnv(name: string, fallback: number): number {
  const raw = process.env[name];
  if (raw == null || raw === "") return fallback;
  const n = Number(raw);
  if (!Number.isFinite(n) || !Number.isInteger(n) || n <= 0) {
    throw new Error(
      `Invalid ${name}=${JSON.stringify(raw)} — expected a positive integer.`,
    );
  }
  return n;
}

Sources: src/core.ts:1-100

User-Agent Validation

The MARKFETFET_USER_AGENT must be a valid Chrome User-Agent string. This requirement exists because Sec-CH-UA-* client hints are derived from the User-Agent at startup, and a mismatch creates a stronger bot signal.

function deriveClientHints(ua: string): {
  brands: string;
  mobile: string;
  platform: string;
} {
  const versionMatch = /\bChrome\/(\d+)/.exec(ua);
  if (!versionMatch) {
    throw new Error(
      `Invalid MARKFETCH_USER_AGENT=${JSON.stringify(ua)} — expected a Chrome User-Agent containing "Chrome/VERSION".`,
    );
  }
  // ...
}

Sources: src/core.ts:1-100

CLI Error Handling

The CLI adapter catches errors thrown from core and formats them for stderr output. Error output follows a consistent [code] message format that matches the MCP error format exactly.

try {
  const { markdown, bytes, savedTo } = await fetchMarkdown({
    url,
    savePath,
  });
  // ... success handling
} catch (err) {
  const { code, message } = classifyError(err);
  console.error(`[${code}] ${message}`);
  // Use exitCode so pending output drains before process exits
  process.exitCode = 1;
}

Sources: src/cli.ts:1-50

CLI Exit Codes

ScenarioExit CodeOutput
Success (stdout)0Raw markdown
Success (save to file)0Saved X bytes to /path
Any error1[code] message to stderr

The use of process.exitCode = 1 (rather than process.exit(1)) ensures pending stdout/stderr output drains before the process terminates, which is important when stdout is piped to a slow consumer.

Sources: src/cli.ts:1-50

MCP Error Handling

The MCP adapter returns errors in a format compatible with the MCP protocol. Errors appear in the content[0].text field with isError: true set.

function errorResult(code: ErrorCode, message: string) {
  return {
    content: [{ type: "text" as const, text: `[${code}] ${message}` }],
    isError: true,
  };
}

Sources: src/mcp.ts:1-50

MCP Response Structure for Errors

{
  "content": [
    {
      "type": "text",
      "text": "[network_error] DNS lookup failed"
    }
  ],
  "isError": true
}

Sources: src/mcp.ts:1-50

Write Sandbox Errors

The MCP interface enforces a write sandbox that restricts file saves to configured root directories. Errors occur when savePath resolves to a location outside the allowed roots.

export function checkWritePath(
  target: string,
  roots: string[],
): { ok: true; resolved: string } | { ok: false; reason: string } {
  // ... validation logic
  return {
    ok: false,
    reason: `'${reattached}' is outside the allowed write roots: [${roots.map((r) => `'${r}'`).join(", ")}]`,
  };
}

Sources: src/sandbox.ts:1-100

Allowed Write Roots Configuration

PlatformDefault RootsDelimiter
POSIXos.tmpdir() + process.cwd():
Windowsos.tmpdir() + process.cwd();

Override with MARKFETCH_ALLOWED_WRITE_ROOTS environment variable. When set, this replaces the defaults entirely rather than merging.

Sources: README.md

The sandbox correctly resolves symlinks to prevent escape attempts like <sandbox>/link/../out.md where link points outside the sandbox. The canonicalized path flows from the containment check into writeFile, ensuring the file is created exactly at the validated location.

Sources: CHANGELOG.md, src/sandbox.ts:1-100

Error Classification

The classifyError function normalizes different error types into the MarkfetchError format used throughout the system:

function classifyError(err: unknown): { code: string; message: string } {
  if (err instanceof MarkfetchError) {
    return { code: err.code, message: err.message };
  }
  if (err instanceof Error) {
    return { code: "network_error", message: err.message };
  }
  return { code: "network_error", message: String(err) };
}

Sources: src/core.ts:1-100

Error Source Mapping

Error SourceCode Produced
MarkfetchError instancesOriginal code preserved
Error instancesnetwork_error
Non-Error valuesnetwork_error with string coercion

Unified Error Flow

Version 0.5.0 introduced a refactoring where three inline return errorResult(...) sites in the MCP handler were converted to throw MarkfetchError from core uniformly. Both adapters now catch and convert errors consistently.

This architectural change ensures that both CLI and MCP interfaces produce identical error codes and messages for the same failure conditions.

Sources: CHANGELOG.md

Best Practices for Error Handling

For MCP Clients

  1. Check isError field in the response object
  2. Parse the content[0].text field for the [code] message format
  3. Handle extraction_failed gracefully for client-rendered SPAs
  4. Use savePath parameter for large responses to avoid tool-result truncation

For CLI Consumers

  1. Redirect stderr to capture error codes
  2. Parse [code] message format from stderr
  3. Use markfetch url 2>&1 | head -1 to get the error

For Save Operations

  1. Always use absolute paths for savePath
  2. Verify MARKFETCH_ALLOWED_WRITE_ROOTS includes your target directory
  3. Check for save_forbidden before save_failed in error handling logic

Sources: README.md

Development Guide

Related topics: Introduction, Quick Start Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Pipeline (src/core.ts)

Continue reading this section for the full explanation and source context.

Section Adapters (src/cli.ts & src/mcp.ts)

Continue reading this section for the full explanation and source context.

Section Prerequisites

Continue reading this section for the full explanation and source context.

Related topics: Introduction, Quick Start Guide

Development Guide

This guide provides comprehensive information for developers who want to understand, extend, or contribute to markfetch.

Overview

markfetch is a Node.js tool that fetches URLs and converts web content to clean markdown. It operates in two modes:

  1. CLI Mode - Command-line interface for shell integration
  2. MCP Mode - Model Context Protocol server for AI agent integration

The project requires Node.js ≥ 24 and is distributed as an npm package. Sources: package.json:8

Architecture

graph TD
    A[User Input] --> B{process.argv.length}
    B -->|≥ 2 args| C[CLI Adapter]
    B -->|Zero args| D[MCP Adapter]
    
    C --> E[src/cli.ts]
    D --> F[src/mcp.ts]
    
    E --> G[src/core.ts]
    F --> G
    
    G --> H[undici HTTP Client]
    G --> I[linkedom HTML Parser]
    G --> J[@mozilla/readability]
    G --> K[turndown]
    
    H --> L[HTTP Response]
    I --> M[DOM Document]
    J --> N[Extracted Article]
    K --> O[Markdown Output]

Core Pipeline (src/core.ts)

The core module implements the main fetch-and-convert pipeline. It orchestrates:

ComponentRole
undiciHTTP/2 transport with Chrome-like fingerprinting
linkedomHTML parsing to DOM
@mozilla/readabilityArticle content extraction
turndownHTML to markdown conversion

Sources: src/core.ts:1-50

Adapters (src/cli.ts & src/mcp.ts)

The source is structured into three distinct files:

FilePurpose
src/core.tsPipeline + errors (shared logic)
src/mcp.tsMCP stdio server adapter
src/cli.tsCLI argv parser + dispatcher
src/index.tsLazy-import dispatcher based on process.argv.length

Sources: README.md:95-100

The lazy-import dispatcher ensures console.log calls in cli.ts are never reachable from the MCP path, maintaining the invariant that stdout is reserved for MCP frames. Sources: CHANGELOG.md:45-47

Setting Up the Development Environment

Prerequisites

  • Node.js ≥ 24
  • npm or yarn

Installation

# Clone the repository
git clone https://github.com/vasylenko/markfetch.git
cd markfetch

# Install dependencies
npm install

Available Scripts

ScriptCommandPurpose
devnpm run devRun source directly with tsx (no build required)
buildnpm run buildCompile TypeScript to JavaScript
testnpm run testRun test suite with tsx
inspectnpm run inspectLaunch MCP inspector for debugging

Sources: package.json:21-28

Build Process

The build process consists of two steps:

# Compile TypeScript
npm run build

# Post-build script (automatically runs after build)
npm run postbuild

The postbuild script (scripts/postbuild.mjs) performs additional transformations after TypeScript compilation. Sources: package.json:26

Project Structure

markfetch/
├── src/
│   ├── index.ts      # Entry point with argv dispatcher
│   ├── core.ts       # Core fetch/extract/convert pipeline
│   ├── cli.ts        # CLI adapter using commander
│   ├── mcp.ts        # MCP stdio server
│   └── sandbox.ts    # Write path sandboxing
├── dist/             # Compiled JavaScript output
├── tests/            # Test fixtures and test files
├── scripts/
│   └── postbuild.mjs # Post-compilation transformations
└── docs/
    └── SPEC.md       # Detailed specification

Configuration

Environment Variables

VariableDefaultPurpose
MARKFETCH_TIMEOUT_MS30000Per-request timeout in milliseconds
MARKFETCH_MAX_BYTES5000000Cap on response body and extracted markdown
MARKFETCH_USER_AGENTChrome 130 stringOverride the User-Agent header
MARKFETCH_ALLOWED_WRITE_ROOTSos.tmpdir() + process.cwd()MCP-only write sandbox roots

Sources: README.md:60-66

Configuration Precedence

  1. Environment variables set at startup
  2. Command-line flags (CLI mode)
  3. MCP tool parameters (MCP mode)

Core API

fetchMarkdown Function

The main function exported from core.ts:

interface FetchOptions {
  url: string;
  savePath?: string;
}

interface FetchResult {
  markdown: string;
  bytes: number;
  savedTo?: string;
}

Error Handling

The core module defines eight deterministic error codes:

CodeMeaning
network_errorDNS/TCP/TLS failure
http_errorNon-2xx HTTP status
timeoutRequest timeout exceeded
unsupported_content_typeNot text/html or application/xhtml+xml
extraction_failedReadability found no article content
too_largeResponse or markdown exceeded size cap
save_failedFile write failed (permissions, missing directory)
save_forbiddenPath outside allowed write roots

Sources: README.md:71-80

Errors are thrown as MarkfetchError from core uniformly and caught by adapters for conversion. Sources: CHANGELOG.md:49-51

Extending the Pipeline

Adding New HTML Rewrites

The rewriteForReadability() function in core.ts handles pre-extraction HTML transformations:

function rewriteForReadability(document: Document): void {
  // Transform <aside class="footnote-brackets"> to <section>
  // Flatten <details> elements
  // Replace div.mw-heading with their heading children
}

To add new rewrite rules, append to this function before the return statement. Sources: src/core.ts:120-160

Customizing Markdown Conversion

The TURNDOWN instance is configured with:

Plugin/OptionPurpose
gfm pluginGitHub Flavored Markdown support
keepClasses: truePreserve class="language-X" for code fences
Custom escapeHandle -/= after inline elements

Sources: src/core.ts:50-90

Modifying Error Handling

Error handling flows through the MarkfetchError class in core:

  1. Core throws MarkfetchError with code and message
  2. Adapters catch and format for their protocol
  3. CLI: writes [code] message to stderr
  4. MCP: returns { content: [...], isError: true }

Sources: src/cli.ts:35-42src/mcp.ts:15-20

Write Sandbox

The MCP adapter enforces write path restrictions:

graph TD
    A[MCP savePath] --> B{absolutely path?}
    B -->|No| C[Refine fails: savePath must be absolute]
    B -->|Yes| D{Inside allowed roots?}
    D -->|Yes| E[Write file]
    D -->|No| F[Return save_forbidden error]

Configuring Allowed Roots

Set the environment variable with platform delimiter:

# POSIX
export MARKFETCH_ALLOWED_WRITE_ROOTS="/tmp:/home/user/docs"

# Windows
set MARKFETCH_ALLOWED_WRITE_ROOTS="C:\Users\me\docs;C:\temp"

The sandbox checks resolve symlinks and applies case-folding on Windows. Sources: src/sandbox.ts:20-40

Testing

Running Tests

npm test

Test Structure

Tests use Node.js built-in test runner (--test flag) with tsx for TypeScript support. Sources: package.json:27

Writing New Tests

  1. Place test files in tests/ directory
  2. Use *.test.ts naming pattern
  3. Run with tsx --test tests/*.test.ts

MCP Inspector

Debug MCP integration using the official inspector:

npm run inspect

This launches the MCP inspector at http://localhost:6274 where you can:

  • Test tool calls interactively
  • Inspect request/response frames
  • Verify schema validation

Sources: package.json:27

Dependencies

Production Dependencies

PackageVersionPurpose
@modelcontextprotocol/sdk^1.29.0MCP server implementation
@mozilla/readability^0.5.0Article extraction
commander^14.0.3CLI argument parsing
linkedom^0.18.0HTML parsing
turndown^7.0.0HTML to markdown
turndown-plugin-gfm^1.0.2GFM support
undici^8.2.0HTTP client
zod^3.0.0Schema validation

Development Dependencies

PackagePurpose
@types/nodeNode.js type definitions
@types/turndownTurndown type definitions
tsxTypeScript execution
typescriptTypeScript compiler

Sources: package.json:30-50

Version History

VersionDateKey Changes
0.6.02026-05-13Write sandbox, Windows CI, save_forbidden error
0.5.02026-05-12CLI mode, commander dependency
0.4.12026-05-11README rewrite, bin path fix
0.4.02026-05-10MCP server with fetch_markdown tool

Sources: CHANGELOG.md:1-60

Contributing Guidelines

Code Standards

  • All source in TypeScript under src/
  • Build output to dist/ via npm run build
  • Tests in tests/ with *.test.ts pattern
  • No runtime console.log in MCP path (enforced by lazy-import structure)

Pull Request Checklist

  • [ ] Run npm run build successfully
  • [ ] Run npm test with all tests passing
  • [ ] Update CHANGELOG.md with changes
  • [ ] Ensure documentation reflects new behavior

Release Process

npm run prepublishOnly

This runs the build automatically before npm publish. Sources: package.json:29

Sources: src/core.ts:1-50

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium v0.4.1

First-time setup may fail or require extra isolation and rollback planning.

medium README/documentation is current enough for a first validation pass.

The project should not be treated as fully validated until this signal is reviewed.

medium Maintainer activity is unknown

Users cannot judge support quality until recent activity, releases, and issue response are checked.

medium no_demo

The project may affect permissions, credentials, data exposure, or host boundaries.

Doramagic Pitfall Log

Doramagic extracted 7 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: v0.4.1

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v0.4.1. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vasylenko/markfetch/releases/tag/v0.4.1

2. Capability assumption: README/documentation is current enough for a first validation pass.

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: capability.assumptions | github_repo:1234238440 | https://github.com/vasylenko/markfetch | README/documentation is current enough for a first validation pass.

3. Maintenance risk: Maintainer activity is unknown

  • Severity: medium
  • Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | last_activity_observed missing

4. Security or permission risk: no_demo

  • Severity: medium
  • Finding: no_demo
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: downstream_validation.risk_items | github_repo:1234238440 | https://github.com/vasylenko/markfetch | no_demo; severity=medium

5. Security or permission risk: no_demo

  • Severity: medium
  • Finding: no_demo
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: risks.scoring_risks | github_repo:1234238440 | https://github.com/vasylenko/markfetch | no_demo; severity=medium

6. Maintenance risk: issue_or_pr_quality=unknown

  • Severity: low
  • Finding: issue_or_pr_quality=unknown。
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | issue_or_pr_quality=unknown

7. Maintenance risk: release_recency=unknown

  • Severity: low
  • Finding: release_recency=unknown。
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | release_recency=unknown

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 2

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using markfetch with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence