Doramagic Project Pack · Human Manual
markfetch
Related topics: Quick Start Guide, Processing Pipeline
Introduction
Related topics: Quick Start Guide, Processing Pipeline
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Quick Start Guide, Processing Pipeline
Introduction
What is markfetch?
markfetch is a Node.js tool that fetches public HTTP/S URLs and returns clean, readable markdown — indistinguishable from what a human would get by running "Save as Markdown" in a browser. It is designed to provide high-quality content extraction for language models, with a focus on producing output that LLM clients can actually consume reliably.
Sources: README.md
Core Design Philosophy
markfetch is built around several key principles that differentiate it from generic fetching solutions:
| Principle | Description |
|---|---|
| Single-channel output | Returns markdown in content[0].text only — no structuredContent that some LLM clients drop |
| Real-browser fingerprint | Uses HTTP/2 transport with a coherent Chrome header set and Sec-CH-UA-* client hints |
| Reader-View extraction | Leverages Mozilla's Readability library to extract the main article content |
| Zero-config defaults | Works out of the box with sensible defaults |
| Deterministic errors | 8 structured error codes for reliable error handling |
Sources: README.md
Architecture Overview
markfetch follows an adapter pattern with a unified core:
graph TD
A[User / LLM Client] --> B[Adapter Layer]
B --> C{Invocation Mode}
C -->|CLI args| D[cli.ts]
C -->|MCP stdio| E[mcp.ts]
D --> F[core.ts - fetchMarkdown]
E --> F
F --> G[HTTP Fetch - undici]
G --> H[Readability Extraction]
H --> I[Turndown Conversion]
I --> J[Markdown Output]Core Components
| Component | File | Responsibility |
|---|---|---|
| Core Pipeline | src/core.ts | URL fetching, HTML parsing, content extraction, markdown conversion, error throwing |
| CLI Adapter | src/cli.ts | Command-line argument parsing, stdout/stderr output |
| MCP Adapter | src/mcp.ts | Model Context Protocol stdio server, tool registration |
| Write Sandbox | src/sandbox.ts | Path validation for file saves |
Sources: src/core.ts, src/cli.ts, src/mcp.ts
Two Operating Modes
CLI Mode
The command-line interface accepts a URL and outputs markdown to stdout:
markfetch https://en.wikipedia.org/wiki/Markdown
Options include:
-o, --output <path>— Save markdown to a file-V, --version— Print version-h, --help— Print usage
The CLI respects the same environment variables as the MCP mode and resolves relative output paths against the current working directory.
Sources: README.md, src/cli.ts
MCP Mode
The Model Context Protocol server provides a single tool fetch_markdown(url, savePath?) for integration with LLM clients like Claude Code, Cursor, or Goose:
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"]
}
}
}
The MCP mode has additional security features:
- Write sandbox: File saves are restricted to allowed write roots
- Lazy loading: The CLI adapter is never loaded in MCP mode, ensuring
console.logis never reachable
Sources: src/mcp.ts, src/cli.ts
Content Extraction Pipeline
The markdown conversion process involves several stages:
graph LR
A[HTML Response] --> B[Decode Encoded Tags]
B --> C[Ensure Base Href]
C --> D[Rewrite for Readability]
D --> E[Readability Parse]
E --> F[Turndown Convert]
F --> G[Prune Empty Headings]
G --> H[Clean Markdown]Extraction Details
- Encoded Tag Decoding: Handles HTML entities like
<code>in code blocks - Base Href Injection: Ensures relative URLs become absolute using the canonical URL
- Pre-processing Rewrites: Handles footnotes, `
Quick Start Guide
markfetch is a tool that fetches URLs and returns clean markdown output. It operates as both a CLI command and an MCP (Model Context Protocol) server, making it suitable for AI agents like Claude Code, Codex, and Gemini CLI.
Installation
Prerequisites
- Node.js ≥ 24 Sources: package.json:8
CLI Installation (Global)
npm i -g markfetch
After installation, the markfetch command is available globally. Sources: README.md:38
CLI Installation (npx)
For one-off usage without global installation:
npx -y markfetch <url>
MCP Server Setup
Add markfetch to your MCP client configuration. The setup varies by client.
#### Claude Code
claude mcp add --scope user markfetch -- npx -y markfetch
#### Codex
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"]
}
}
}
#### Gemini CLI
gemini mcp add -s user markfetch npx -y markfetch
#### Cursor / Goose / Other stdio-MCP Clients
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"]
}
}
}
Sources: README.md:46-69
CLI Usage
Basic Fetch
markfetch <url>
The fetched markdown is printed to stdout. Sources: src/cli.ts:18
Save to File
markfetch <url> -o <path>
Use -o or --output to save markdown to a file. Relative paths resolve against the current working directory. Sources: src/cli.ts:12-15
Example:
markfetch https://en.wikipedia.org/wiki/Markdown -o output.md
Help and Version
markfetch --help
markfetch --version
MCP Tool Usage
Tool Name
fetch_markdown
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | Absolute http(s) URL to fetch. The server follows redirects automatically. No authentication headers, cookies, or session state are sent. |
savePath | string | No | Absolute filesystem path. When provided, the fetched markdown is written to this path instead of returned in the response. |
Sources: src/mcp.ts:22-33
Return Value
The tool returns markdown content in content[0].text. No structuredContent field is used — this ensures compatibility with MCP clients that forward only structuredContent to the model. Sources: README.md:18-21
Environment Configuration
| Variable | Default | Purpose |
|---|---|---|
MARKFETCH_TIMEOUT_MS | 30000 | Per-request timeout in milliseconds |
MARKFETCH_MAX_BYTES | 5000000 | Cap on response body and extracted markdown (5MB) |
MARKFETCH_USER_AGENT | Pinned Chrome 130 string | Override the User-Agent header. Must be a Chrome UA string. |
MARKFETCH_ALLOWED_WRITE_ROOTS | os.tmpdir() + process.cwd() | MCP-only. Colon-delimited (POSIX) or semicolon-delimited (Windows) list of absolute paths permitted for savePath writes. |
Sources: README.md:99-103
Passing Environment Variables to MCP
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"],
"env": {
"MARKFETCH_TIMEOUT_MS": "60000"
}
}
}
}
Error Handling
Errors are returned with deterministic codes in the format [code] message:
| Code | Meaning |
|---|---|
network_error | DNS, TCP, or TLS failure |
http_error | Upstream returned a non-2xx status |
timeout | Request exceeded MARKFETCH_TIMEOUT_MS |
unsupported_content_type | Response was not text/html or application/xhtml+xml |
extraction_failed | Readability found no article content (typical for pure client-rendered SPAs) |
too_large | Response body or extracted markdown exceeded MARKFETCH_MAX_BYTES |
save_failed | writeFile failed (missing directory, permission denied) |
save_forbidden | savePath resolves outside the allowed write roots |
Errors go to stderr with non-zero exit status in CLI mode. Sources: README.md:72-85
Quick Workflow
graph TD
A[Start markfetch] --> B{Arguments provided?}
B -->|Yes, URL argument| C[CLI Mode]
B -->|No arguments| D[MCP Server Mode]
C --> E[Fetch URL]
D --> F[Wait for MCP request]
E --> G{Output path specified?}
F --> H[Receive fetch_markdown request]
G -->|No| I[Print to stdout]
G -->|Yes, -o path| J[Write to file]
H --> I
J --> K[Return confirmation]
I --> L[Return markdown content]
K --> LUse Cases
| Use Case | Recommended Mode | Command/Config |
|---|---|---|
| One-time URL fetch in shell | CLI | markfetch <url> |
| Batch processing with shell scripts | CLI + -o | markfetch <url> -o out.md |
| AI agent web content retrieval | MCP | Configure in client |
| Large document bypass inline limits | MCP + savePath | Set savePath to local file |
Limitations
- Not a crawler: No recursion, no
robots.txtparsing. One URL in, one document out. Sources: README.md:89-91 - Not authenticated: Anonymous fetch only. Pages behind login walls return whatever the public response is. Sources: README.md:93-95
- Not a JS renderer: Pure client-rendered SPAs with no static HTML return
extraction_failed. Sources: README.md:97-99
Sources: README.md
Processing Pipeline
Related topics: Introduction, HTTP/2 Fingerprinting, Error Handling
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Introduction, HTTP/2 Fingerprinting, Error Handling
Processing Pipeline
Overview
The Processing Pipeline is the core data flow engine in markfetch. It transforms raw HTML fetched from a URL into clean, readable markdown suitable for consumption by AI agents and language models. The pipeline is intentionally single-purpose — one URL in, one markdown document out — with no recursion, pagination, or client-side JavaScript rendering.
The pipeline operates identically whether invoked via CLI or MCP adapter, ensuring consistent behavior across both interfaces.
Sources: src/core.ts
Architecture
The pipeline is composed of three primary stages executed sequentially:
graph TD
A[URL Input] --> B[HTTP Fetch]
B --> C{HTML Valid?}
C -->|No| D[Error: network_error / http_error / timeout]
C -->|Yes| E[Content-Type Check]
E -->|Non-HTML| F[Error: unsupported_content_type]
E -->|HTML| G[Extract Article]
G -->|No Content| H[Error: extraction_failed]
G -->|Extracted| I[Convert to Markdown]
I --> J{Size Check}
J -->|Exceeds Limit| K[Error: too_large]
J -->|Valid| L{Save Path?}
L -->|Yes| M[Write to File / Error: save_forbidden / save_failed]
L -->|No| N[Return Markdown]Each stage performs validation and may abort with a deterministic error code, ensuring failures are predictable and actionable.
Sources: src/core.ts
Stage 1: HTTP Fetch
The fetch stage retrieves raw HTML from the target URL using Node.js fetch with a real-browser fingerprint.
Transport Configuration
| Setting | Value | Purpose |
|---|---|---|
| Protocol | HTTP/2 | Modern web fingerprint |
| User-Agent | Chrome 130 (pinned) | Realistic browser identification |
| Client Hints | Sec-CH-UA-* headers | Derived from User-Agent at startup |
| Timeout | MARKFETCH_TIMEOUT_MS (default: 30000ms) | Per-request budget |
The User-Agent string is validated at startup. Non-Chrome strings fail fast to prevent fingerprint inconsistencies that could trigger bot detection.
Sources: README.md
Error Conditions
| Code | Trigger |
|---|---|
network_error | DNS failure, TCP failure, TLS error, unexpected fetcher error |
http_error | Non-2xx HTTP status code |
timeout | Response exceeds MARKFETCH_TIMEOUT_MS |
Redirects are followed automatically by the underlying HTTP client.
Stage 2: Article Extraction
Article extraction identifies and isolates the main content from the fetched HTML, stripping navigation, sidebars, footers, and other boilerplate.
Technology Stack
| Component | Library | Purpose |
|---|---|---|
| HTML Parser | linkedom | Parses HTML into a DOM-like structure |
| Extraction | readability (Mozilla) | Identifies main article content |
| Configuration | keepClasses: true | Preserves code block language hints |
The linkedom parser is chosen over native DOMParser to ensure consistent behavior across Node.js versions and environments.
Sources: src/core.ts
Pre-Extraction Rewrites
Before Readability processes the document, the pipeline applies targeted HTML rewrites to normalize content and improve extraction quality:
function rewriteForReadability(document: Document): void {
// Normalize code blocks (pre and code elements)
// Convert aside elements to sections
// Expand details/summary elements
// Flatten MediaWiki heading wrappers
}
Specific transformations include:
| Details expansion | `
| Transform | Target | Action |
|---|---|---|
| Code block normalization | <pre>, <code> | Standardize encoding artifacts |
| Base href injection | <head> / <html> | Ensure absolute URLs after redirects |
| Aside conversion | <aside> with footnote roles | Convert to <section> |
HTTP/2 Fingerprinting
Overview
HTTP/2 Fingerprinting is a technique used by markfetch to mimic real browser traffic when fetching web pages. Instead of making requests that appear to come from a typical HTTP library (like curl or a basic fetch implementation), markfetch generates HTTP/2 requests with headers and client hints that closely match those of an actual Chrome browser session.
This approach serves two critical purposes:
- Bypass anti-bot measures: Many websites employ fingerprinting techniques to detect and block automated scrapers. By presenting headers identical to a genuine Chrome browser, markfetch avoids triggering these defenses.
- Access SEO-rendered content: Sites that serve different content to bots vs. browsers will return the full article content when markfetch requests arrive with Chrome-like fingerprints.
Sources: README.md
Architecture
graph TD
A[URL Request] --> B{Adapter Type?}
B -->|MCP| C[src/mcp.ts]
B -->|CLI| D[src/cli.ts]
C --> E[src/core.ts - fetchMarkdown]
D --> E
E --> F[Undici Dispatcher]
F --> G[HTTP/2 Transport]
G --> H[Sec-CH-UA-* Client Hints]
G --> I[Chrome Headers]
H --> J[Upstream Server]
I --> J
J --> K[HTML Response]
K --> L[Readability Parser]
L --> M[Markdown Output]Implementation Details
User Agent String
The default user agent is a pinned Chrome 130 string. This can be overridden via the MARKFETCH_USER_AGENT environment variable, but must be a valid Chrome UA string.
| Environment Variable | Default Value | Purpose |
|---|---|---|
MARKFETCH_USER_AGENT | Pinned Chrome 130 string | Override the browser fingerprint UA |
Constraint: The UA string must be a Chrome browser UA. Non-Chrome strings fail fast at startup because Sec-CH-UA-* client hints are derived from the UA at initialization time.
Sources: README.md
Client Hints Generation
When the server starts, markfetch parses the MARKFETCH_USER_AGENT value and derives Sec-CH-UA-* client hint headers from it. These hints are sent with every HTTP/2 request and include:
Sec-CH-UA— Browser brand and versionSec-CH-UA-Mobile— Mobile indicatorSec-CH-UA-Platform— Operating system
graph LR
A[MARKFETCH_USER_AGENT<br/>Chrome 130] --> B[Startup<br/>Initialization]
B --> C[Sec-CH-UA Header<br/>Derived Value]
B --> D[Sec-CH-UA-Mobile<br/>Derived Value]
B --> E[Sec-CH-UA-Platform<br/>Derived Value]
C --> F[Every HTTP/2<br/>Request]
D --> F
E --> FSources: README.md
HTTP/2 Transport
Markfetch uses the undici HTTP client library with HTTP/2 protocol support. The HTTP/2 transport is selected automatically by undici when the server supports it, enabling:
- Multiplexed requests over a single connection
- Header compression
- Server push capabilities
The combination of HTTP/2 transport + coherent Chrome header set creates a fingerprint that is indistinguishable from a human browsing with Chrome DevTools open.
Sources: README.md
Request Flow
sequenceDiagram
participant Client
participant Markfetch
participant Undici
participant Server
Client->>Markfetch: fetch_markdown(url)
Markfetch->>Markfetch: Validate MARKFETCH_USER_AGENT
Markfetch->>Undici: Dispatch with Chrome headers
Undici->>Server: HTTP/2 CONNECT<br/>Sec-CH-UA: "Chromium"
Undici->>Server: Sec-CH-UA-Mobile: ?U
Undici->>Server: Sec-CH-UA-Platform: "Windows"
Undici->>Server: GET /path HTTP/2
Server->>Undici: HTTP/2 200 OK<br/>text/html
Undici->>Markfetch: HTML Content
Markfetch->>Markfetch: Apply Readability
Markfetch->>Markfetch: Convert to Markdown
Markfetch->>Client: Clean MarkdownConfiguration
Environment Variables
| Variable | Default | Purpose |
|---|---|---|
MARKFETCH_TIMEOUT_MS | 30000 | Per-request timeout in milliseconds |
MARKFETCH_MAX_BYTES | 5000000 | Cap on response body and extracted markdown |
MARKFETCH_USER_AGENT | Pinned Chrome 130 | Browser fingerprint override |
Validation
All environment variables are validated at startup. Invalid values cause the process to fail fast on stderr with descriptive error messages, rather than producing confusing per-request errors.
Sources: README.md
Integration Points
MCP Adapter
The MCP server (src/mcp.ts) uses the core fetch pipeline which includes the HTTP/2 fingerprinting. The tool description explicitly documents this behavior:
Fetch a single public HTTP/S URL and return its main article content as clean markdown. Best for articles, documentation, blog posts, news, and reference pages. Non-HTML responses return unsupported_content_type.
Sources: src/mcp.ts
CLI Adapter
The CLI adapter (src/cli.ts) also uses the same core fetch pipeline, ensuring consistent HTTP/2 fingerprinting behavior whether invoked via MCP or command line:
markfetch https://en.wikipedia.org/wiki/Markdown
Sources: src/cli.ts
Version History
| Version | Date | Change |
|---|---|---|
| 0.4.0 | 2026-05-10 | HTTP/2 fingerprinting feature added with Sec-CH-UA-* client hints |
| 0.5.0 | 2026-05-12 | CLI mode added with same fingerprinting behavior |
| 0.6.0 | Current | Enhanced write sandbox and validation |
Sources: CHANGELOG.md
Limitations
SPA Handling
Pure client-rendered Single Page Applications (SPAs) with no static HTML content return extraction_failed. Sites that ship server-rendered or SEO-prerendered HTML will extract whatever static content they expose, including when accessed with Chrome fingerprints.
Authentication
Markfetch performs anonymous fetches only — no cookie jar, no auth headers, no session reuse. Pages behind login walls return whatever the public response is, usually surfaced as http_error.
Sources: README.md
Security Considerations
The HTTP/2 fingerprinting approach makes requests appear legitimate, which raises responsibility concerns. The documentation explicitly states:
Use it on URLs whose targets you have permission to fetch, and respect the terms of service of any site you query. The maintainer assumes no liability for misuse.
Sources: README.md
Sources: src/core.ts
CLI Usage
Related topics: Quick Start Guide, MCP Server Integration, Write Sandbox Security
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Quick Start Guide, MCP Server Integration, Write Sandbox Security
CLI Usage
The markfetch CLI provides a command-line interface for fetching URLs and converting their content to clean markdown. It operates as one of two execution surfaces—the other being the MCP (Model Context Protocol) stdio server—with both sharing the same underlying core pipeline.
Overview
The CLI accepts a URL as its primary argument and outputs the converted markdown to stdout or to a specified file. It was introduced in version 0.5.0 as a way to make markfetch accessible from standard shell environments, pipelines, and scripts.
| Aspect | Details |
|---|---|
| Entry Point | markfetch <url> |
| Output | stdout (default) or file via -o |
| Version | 0.6.0 |
| Runtime | Node.js ≥ 24 |
| Distribution | npm package |
Sources: README.md
Architecture
The CLI is implemented as an adapter layer that delegates to the shared core. When the process is invoked with arguments, the dispatcher in index.ts lazy-loads the CLI adapter; bare invocation (zero arguments) routes to the MCP server instead.
graph TD
A["markfetch CLI Invokation<br/>process.argv.length > 1"] --> B["src/index.ts<br/>Dispatcher"]
B --> C["src/cli.ts<br/>CLI Adapter"]
C --> D["src/core.ts<br/>fetchMarkdown()"]
D --> E["src/sandbox.ts<br/>Write Validation"]
D --> F["HTTP Fetch + Readability + Turndown"]
G["Bare Invocation<br/>process.argv.length === 1"] --> H["src/mcp.ts<br/>MCP Server"]Sources: src/cli.ts:39-47
Command Syntax
markfetch <url> [options]
Arguments
| Argument | Required | Description |
|---|---|---|
<url> | Yes | Absolute http(s) URL to fetch |
Options
| Flag | Description |
|---|---|
-o, --output <path> | Save markdown to a file (absolute or relative path). Default is stdout. |
-V, --version | Print version and exit |
-h, --help | Print usage and exit |
Sources: src/cli.ts:23-30
Output Behavior
The CLI maintains strict separation between its output channels:
| Scenario | Channel | Content |
|---|---|---|
Raw markdown (no -o) | stdout | Raw markdown body via process.stdout.write() |
File output (-o) | stdout | Confirmation: Saved N bytes to <path> |
| Any error | stderr | [code] message |
The raw markdown is written using process.stdout.write() rather than console.log() to preserve trailing whitespace in the output—matching the exact bytes the MCP adapter would emit in content[0].text.
Sources: src/cli.ts:50-58
Error Handling
Errors are written to stderr with a deterministic format: [code] message. The process exits with a non-zero status code.
process.exitCode = 1;
console.error(`[${code}] ${message}`);
The CLI uses process.exitCode (not process.exit()) to ensure pending output drains before the process exits—important when stdout is piped to a slow consumer.
Sources: src/cli.ts:58-62
Error Codes
| Code | Meaning |
|---|---|
network_error | DNS / TCP / TLS failure |
http_error | Upstream returned a non-2xx status |
timeout | Request exceeded MARKFETCH_TIMEOUT_MS |
unsupported_content_type | Response was not HTML |
extraction_failed | No extractable article content |
too_large | Response or markdown exceeded MARKFETCH_MAX_BYTES |
save_failed | File write failed (permission denied, etc.) |
Note: save_forbidden is MCP-only and does not apply to CLI (no sandbox).
Sources: README.md
Path Resolution
The CLI resolves relative output paths against the current working directory before passing them to the core:
const savePath = options.output
? resolve(process.cwd(), options.output)
: undefined;
Tilde expansion is intentionally not performed—the shell expands ~/foo before argv reaches the process, and a quoted literal '~/foo' should produce a file named ~/foo in cwd (standard tool behavior).
Sources: src/cli.ts:32-39
Environment Variables
These environment variables apply to both CLI and MCP modes:
| Variable | Default | Purpose |
|---|---|---|
MARKFETCH_TIMEOUT_MS | 30000 | Per-request timeout in ms |
MARKFETCH_MAX_BYTES | 5000000 | Cap on response body and extracted markdown |
MARKFETCH_USER_AGENT | Chrome 130 string | Override User-Agent header |
The CLI adapter imports fetchMarkdown and classifyError from the core module, which validates these environment variables at startup.
Sources: src/cli.ts:15 and README.md
File Structure
The project source is organized into adapter modules:
src/
├── index.ts # Dispatcher (lazy-loads cli.ts or mcp.ts)
├── core.ts # Shared pipeline and errors
├── cli.ts # CLI adapter (commander-based)
└── mcp.ts # MCP stdio server adapter
The lazy-import pattern ensures that cli.ts code (which calls console.log) is never loaded when running in MCP mode, preserving the "stdout is reserved for MCP frames" invariant structurally.
Sources: CHANGELOG.md and src/cli.ts:1-13
Installation
Install globally via npm:
npm i -g markfetch
Or use via npx without installation:
npx -y markfetch <url>
The bin entry in package.json points to dist/index.js:
{
"bin": {
"markfetch": "dist/index.js"
}
}
Sources: package.json:16-18
Usage Examples
Basic fetch to stdout
markfetch https://en.wikipedia.org/wiki/Markdown
Save to file
markfetch https://example.com/article -o output.md
With timeout override
MARKFETCH_TIMEOUT_MS=60000 markfetch https://slow-site.example.com
Pipeline to another tool
markfetch https://example.com/doc | grep -A5 "## Installation"
Sources: README.md
Sources: README.md
MCP Server Integration
Related topics: Quick Start Guide, CLI Usage, Write Sandbox Security
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Quick Start Guide, CLI Usage, Write Sandbox Security
MCP Server Integration
Overview
The MCP (Model Context Protocol) Server Integration is the primary interface for AI agents to fetch web content as clean markdown. Markfetch exposes a single MCP tool fetch_markdown that accepts a URL and returns extracted markdown content, enabling language models like Claude to access web information through a standardized protocol.
The MCP server operates as a stdio-based server, meaning it communicates exclusively through standard input and standard output streams. This design ensures the server integrates seamlessly with MCP clients including Claude Desktop, Claude Code, Cursor, and Goose.
Architecture
Entry Point Dispatcher
The src/index.ts file implements an argv-discriminated dispatcher that determines whether to start the MCP server or the CLI based on the presence of command-line arguments:
if (process.argv.length === 2) {
await import("./mcp.js");
} else {
await import("./cli.js");
}
Sources: src/index.ts:26-29
When process.argv.length === 2, the process was invoked without arguments—this is the standard pattern MCP clients use when spawning a server. Any extra argument (URL, flags, --help) routes to the CLI adapter.
Module Isolation
The dynamic import pattern ensures complete module isolation:
graph TD
A[markfetch entry] --> B{argv.length === 2?}
B -->|Yes| C[Lazy import: mcp.ts]
B -->|No| D[Lazy import: cli.ts]
C --> E[@modelcontextprotocol/sdk loaded]
D --> F[commander loaded]
E -.-> G[Never reaches console.log]
F -.-> H[Can use console.log]Sources: src/index.ts:18-22
This architecture enforces the "stdout is reserved for MCP frames" invariant structurally—the MCP path never imports cli.ts, so code that calls console.log is literally unreachable from the MCP execution path.
MCP Server Implementation
Server Initialization
The MCP server is initialized using the @modelcontextprotocol/sdk package:
const server = new McpServer({ name: "markfetch", version: "0.6.0" });
Sources: src/mcp.ts:20
Tool Registration
The server registers a single tool fetch_markdown with a Zod-based input schema:
server.registerTool(
"fetch_markdown",
{
description: "Fetch a single public HTTP/S URL and return its main article content as clean markdown...",
inputSchema: {
url: z.string().url().describe("Absolute http(s) URL of the page to fetch..."),
savePath: z.string().refine(isAbsolute, "savePath must be an absolute filesystem path").optional().describe("Optional. When provided...")
}
},
async ({ url, savePath }) => {
// Implementation
}
);
Sources: src/mcp.ts:22-47
Tool Input Schema
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | Absolute http(s) URL of the page to fetch. The server follows redirects automatically. No authentication headers, cookies, or session state are sent. |
savePath | string | No | Optional absolute filesystem path. When provided, the fetched markdown is written to this path instead of returned inline. |
The url parameter is validated using Zod's .url() method to ensure a valid URL format. The savePath parameter must be an absolute path, enforced by the .refine(isAbsolute, ...) check.
Response Format
The tool returns a response in this structure:
{
content: [{ type: "text", text: "markdown content or [errorcode] message" }],
isError: boolean
}
Sources: src/mcp.ts:8-12
Error Handling
Error Code System
The MCP adapter uses a uniform error code system with 8 deterministic codes:
| Error Code | Description | Source |
|---|---|---|
network_error | DNS/TCP/TLS failure or unexpected internal error | core.ts |
http_error | Upstream returned non-2xx status | core.ts |
timeout | Per-request budget exceeded | core.ts |
unsupported_content_type | Response was not text/html or application/xhtml+xml | core.ts |
extraction_failed | Readability returned no article content | core.ts |
too_large | Response or markdown exceeded MARKFETCH_MAX_BYTES | core.ts |
save_failed | writeFile failed (permission denied, missing directory) | core.ts |
save_forbidden | savePath resolves outside allowed write roots | src/mcp.ts |
Error Result Factory
function errorResult(code: ErrorCode, message: string) {
return {
content: [{ type: "text" as const, text: `[${code}] ${message}` }],
isError: true,
};
}
Sources: src/mcp.ts:8-12
Error Propagation Pattern
In version 0.5.0, error handling was refactored so that core functions now throw MarkfetchError instead of returning error results inline. Both the MCP and CLI adapters catch these exceptions and convert them to their respective output formats.
Sources: CHANGELOG.md:19-21
Write Sandbox (MCP-Specific)
The MCP server implements a write sandbox that restricts savePath operations to a set of allowed root directories.
Default Allowed Roots
By default, the allowed set is:
os.tmpdir()(system temp directory)process.cwd()(current working directory)
Each path is resolved via fs.realpath at startup to handle symlinks.
Configuration
The MARKFETCH_ALLOWED_WRITE_ROOTS environment variable overrides the default set entirely:
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"],
"env": {
"MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
}
}
}
}
Sources: README.md:89-100
Security Rationale
The sandbox is MCP-only by design. The CLI is unrestricted because "a human at the shell is the security boundary." The asymmetry exists because the MCP tool is driven by a language model, which may be steered by content from a page it just fetched.
Sources: README.md:102-104
Request Flow
sequenceDiagram
participant Client as MCP Client
participant MCP as MCP Server
participant Core as fetchMarkdown()
participant Fetch as HTTP Fetcher
Client->>MCP: fetch_markdown({url, savePath?})
MCP->>Core: fetchMarkdown({url, savePath})
Core->>Fetch: GET url (with Chrome fingerprint)
Fetch-->>Core: HTML response
Core->>Core: Readability parsing
Core->>Core: Turndown conversion
alt savePath provided
Core->>Core: Write to file (within sandbox)
end
Core-->>MCP: {markdown, bytes, savedTo?}
MCP-->>Client: {content: [{text: markdown}], isError: false}Environment Configuration
| Variable | Default | Purpose | MCP-Specific |
|---|---|---|---|
MARKFETCH_TIMEOUT_MS | 30000 | Per-request timeout in ms | No |
MARKFETCH_MAX_BYTES | 5000000 | Cap on response body and extracted markdown | No |
MARKFETCH_USER_AGENT | Chrome 130 string | Override the User-Agent header | No |
MARKFETCH_ALLOWED_WRITE_ROOTS | os.tmpdir() + process.cwd() | Permitted write roots for savePath | Yes |
Sources: src/mcp.ts:1-5, README.md:68-75
Integration with Clients
Claude Desktop / Claude Code
claude mcp add --scope user markfetch -- npx -y markfetch
Sources: README.md:40-43
Codex
codex mcp add markfetch -- npx -y markfetch
Sources: README.md:46-48
Manual Configuration
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"]
}
}
}
Sources: README.md:52-58
Dependencies
The MCP server depends on:
| Package | Version | Purpose |
|---|---|---|
@modelcontextprotocol/sdk | ^1.29.0 | MCP protocol implementation |
zod | ^3.0.0 | Input schema validation |
@mozilla/readability | ^0.5.0 | Article extraction |
turndown | ^7.0.0 | HTML to Markdown conversion |
undici | ^8.2.0 | HTTP client |
linkedom | ^0.18.0 | DOM parsing |
Sources: package.json:36-47
Source: https://github.com/vasylenko/markfetch / Human Manual
Environment Variables
Related topics: HTTP/2 Fingerprinting, Write Sandbox Security, Error Handling
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: HTTP/2 Fingerprinting, Write Sandbox Security, Error Handling
Environment Variables
markfetch uses environment variables to configure runtime behavior at startup. These variables control network timeouts, response size limits, HTTP fingerprinting, and file write permissions for the MCP server.
Overview
Environment variables in markfetch serve as the primary configuration mechanism. Unlike per-request options, these settings apply globally to every operation and are validated once at process startup. This fail-fast design prevents misconfiguration from producing confusing per-request errors later.
graph TD
A[Process Start] --> B[Validate MARKFETCH_TIMEOUT_MS]
A --> C[Validate MARKFETCH_MAX_BYTES]
A --> D[Validate MARKFETCH_USER_AGENT]
A --> E[Build MARKFETCH_ALLOWED_WRITE_ROOTS]
B --> F{Valid?}
C --> F
D --> F
E --> F
F -->|Yes| G[Server Ready]
F -->|No| H[Exit with stderr error]All validation occurs before the server begins accepting requests. Invalid values cause immediate process termination with a descriptive error message written to stderr.
Configuration Variables
MARKFETCH_TIMEOUT_MS
| Property | Value |
|---|---|
| Default | 30000 (30 seconds) |
| Purpose | Per-request timeout in milliseconds |
| Type | Positive integer |
Controls the maximum duration allowed for any single HTTP request, including DNS resolution, TCP connection, TLS handshake, and response body transfer.
const config = {
timeoutMs: intEnv("MARKFETCH_TIMEOUT_MS", 30_000),
};
Validation rejects non-positive integers, non-integer values, and non-finite numbers (NaN, Infinity). A malformed value produces:
[core] Error: Invalid MARKFETCH_TIMEOUT_MS="abc" — expected a positive integer.
Sources: src/core.ts:1-50
MARKFETCH_MAX_BYTES
| Property | Value |
|---|---|
| Default | 5000000 (~4.77 MB) |
| Purpose | Cap on response body and extracted markdown |
| Type | Positive integer |
Both the raw HTTP response body and the final extracted markdown are checked against this limit. If either exceeds the cap, the operation returns too_large error.
const config = {
maxBytes: intEnv("MARKFETCH_MAX_BYTES", 5_000_000),
};
Sources: src/core.ts:1-50
MARKFETCH_USER_AGENT
| Property | Value |
|---|---|
| Default | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 |
| Purpose | HTTP User-Agent header and Sec-CH-UA-* client hints |
| Type | String (must contain "Chrome") |
The User-Agent string determines both the HTTP header sent to servers and the derived Sec-CH-UA-* client hints. The hints are derived at startup and remain fixed for the process lifetime.
graph LR
A[MARKFETCH_USER_AGENT] --> B[deriveClientHints]
B --> C[Sec-CH-UA]
B --> D[Sec-CH-UA-Mobile]
B --> E[Sec-CH-UA-Platform]
A --> F[User-Agent Header]function deriveClientHints(ua: string): {
brands: string;
mobile: string;
platform: string;
} {
const versionMatch = /\bChrome\/(\d+)/.exec(ua);
if (!versionMatch) {
throw new Error(
`Invalid MARKFETCH_USER_AGENT=${JSON.stringify(ua)} — expected a Chrome User-Agent containing "Chrome/..."`
);
}
// ...
}
The UA must contain a Chrome version string. Non-Chrome UAs fail fast at startup to prevent fingerprinting mismatches that would increase bot detection.
Sources: src/core.ts:1-50
Write Sandbox (MCP-Only)
MARKFETCH_ALLOWED_WRITE_ROOTS
| Property | Value |
|---|---|
| Default | os.tmpdir() ∪ process.cwd() |
| Purpose | Restrict MCP savePath writes to specific directories |
| Type | Platform-delimiter-separated absolute paths |
| Platform | POSIX: : delimiter; Windows: ; delimiter |
| Mode | MCP-only (CLI has no sandbox) |
This variable applies exclusively to the MCP server mode. The CLI operates without restriction, treating the human at the shell as the security boundary.
graph TD
A[MCP savePath request] --> B{Path inside allowed roots?}
B -->|Yes| C[Write file]
B -->|No| D[Return save_forbidden error]
C --> E[Confirmation to client]
D --> F[No file created]When set, the value replaces the defaults entirely rather than merging with them. To retain access to the default directories, include them explicitly:
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"],
"env": {
"MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
}
}
}
}
On Windows:
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"],
"env": {
"MARKFETCH_ALLOWED_WRITE_ROOTS": "C:\\Users\\me\\markfetch-out;C:\\Users\\me\\AppData\\Local\\Temp"
}
}
}
}
Validation Rules
Each entry in the list must be:
- An absolute path (relative paths fail fast)
- An existing directory at startup
- Resolved through symlinks for containment checks
function buildAllowedRoots(envValue?: string): string[] {
// ...
}
Symlinks pointing outside the sandbox are blocked. The canonicalized path flows from the containment check into writeFile, ensuring the file is created exactly at the validated location.
Sources: src/sandbox.ts:1-50 Sources: src/mcp.ts:1-50
Error Codes
When environment variable validation fails, markfetch writes to stderr and exits with a non-zero status:
| Error Code | Trigger | Exit Status |
|---|---|---|
| Startup failure | Invalid MARKFETCH_TIMEOUT_MS | Non-zero |
| Startup failure | Invalid MARKFETCH_MAX_BYTES | Non-zero |
| Startup failure | Non-Chrome MARKFETFETCH_USER_AGENT | Non-zero |
| Startup failure | Malformed MARKFETCH_ALLOWED_WRITE_ROOTS | Non-zero |
| Runtime error | save_forbidden (MCP only) | Non-zero |
Runtime errors from invalid environment values (e.g., MARKFETCH_TIMEOUT_MS="abc") differ from request-scoped errors like http_error or timeout. Environment misconfiguration is always fatal at startup.
Environment Variable Summary
| Variable | Default | Scope | Purpose |
|---|---|---|---|
MARKFETCH_TIMEOUT_MS | 30000 | Both | Request timeout in ms |
MARKFETCH_MAX_BYTES | 5000000 | Both | Response and markdown size cap |
MARKFETCH_USER_AGENT | Chrome 130 string | Both | HTTP fingerprint |
MARKFETCH_ALLOWED_WRITE_ROOTS | tmpdir + cwd | MCP only | Write sandbox boundaries |
Configuration Priority
Environment variables set at process startup take precedence over all other configuration. There is no runtime override mechanism—changing these values requires restarting the server.
graph TD
A[Environment Variable] --> B[Validated at Startup]
B --> C[Stored in config object]
C --> D[Used by core.ts pipeline]
D --> E[HTTP Request]
D --> F[File Write]
D --> G[Response Validation]Security Considerations
The write sandbox exists because the MCP tool is driven by a language model, which may be steered by content from a page it just fetched. Without sandboxing, a malicious page could诱导 the model to request writes outside expected directories.
The CLI intentionally has no sandbox—direct human invocation at the shell establishes the trust boundary.
Sources: README.md:1-100
Sources: src/core.ts:1-50
Write Sandbox Security
Related topics: MCP Server Integration, Environment Variables, Error Handling
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: MCP Server Integration, Environment Variables, Error Handling
Write Sandbox Security
Overview
The Write Sandbox is a security mechanism in markfetch that restricts filesystem writes initiated via the MCP (Model Context Protocol) interface to a configurable set of allowed root directories. This protection prevents a language model, which may be influenced by fetched content, from writing files to arbitrary locations on the host system.
The sandbox enforces path containment by resolving symlinks and comparing canonicalized paths against the configured allowed roots. Any attempted write outside the sandbox boundary returns a save_forbidden error and the file is never created.
Purpose and Scope
Security Boundary
The sandbox exists because MCP tools are driven by a language model that can be steered by content from pages it fetches. Without containment:
- A malicious or compromised webpage could instruct the LLM to write files to sensitive locations (e.g.,
~/.ssh/authorized_keys,~/.bashrc) - Path traversal attempts via symlinks could escape expected boundaries
- Untrusted fetched content could modify configuration files or inject malicious code
The CLI mode intentionally has no sandbox. A human at the shell is considered the security boundary, as the user has direct control over command invocation and can review output before it reaches any model.
Scope Limitations
| Scope | Sandboxed? |
|---|---|
MCP server (fetch_markdown tool) | Yes |
CLI mode (markfetch <url>) | No |
Direct node execution | No |
Sources: README.md:68-70
Configuration
Environment Variable
| Variable | Type | Default | Description |
|---|---|---|---|
MARKFETCH_ALLOWED_WRITE_ROOTS | String | os.tmpdir() + process.cwd() | Path-delimiter-separated list of absolute paths permitted as MCP savePath write roots |
Path Delimiters
The delimiter varies by platform:
| Platform | Delimiter | Example |
|---|---|---|
| POSIX (Linux, macOS) | : | /tmp:/home/user/markfetch-out |
| Windows | ; | C:\Users\me\markfetch-out;C:\Temp |
Behavior Rules
- Replacement, not merge: When set, the variable replaces the defaults entirely. To retain access to
os.tmpdir()orprocess.cwd(), explicitly include them.
- Validation at startup: Malformed values (non-absolute entries, nonexistent directories) cause the server to fail fast on stderr.
- Realpath resolution: Each root is resolved once via
fs.realpathat startup to canonicalize symlinks.
Sources: README.md:71-89
Configuration Example
POSIX:
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"],
"env": {
"MARKFETCH_ALLOWED_WRITE_ROOTS": "/Users/me/markfetch-out:/tmp"
}
}
}
}
Windows:
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"],
"env": {
"MARKFETCH_ALLOWED_WRITE_ROOTS": "C:\\Users\\me\\markfetch-out;C:\\Users\\me\\AppData\\Local\\Temp"
}
}
}
}
Security Model
Path Resolution Flow
graph TD
A[User provides savePath] --> B{Is path absolute?}
B -->|No| E[Error: savePath must be absolute]
B -->|Yes| C[Resolve via fs.realpath]
C --> D{Is resolved path inside allowed roots?}
D -->|Yes| F[Allow write to resolved path]
D -->|No| G[Return save_forbidden error]
H[Allowed roots from env] --> I[Realpath-resolved at startup]
I --> DSymlink Handling
The sandbox protects against symlink-based escapes:
- Resolve before check: Symlinks are resolved via
fs.realpathbefore containment validation - Re-resolve at write time: The canonicalized path from the validation check flows directly into
writeFile - No lexical comparison: A path like
<sandbox>/link/..is not compared lexically against the roots—it's resolved first, then validated
This prevents attacks where a symlink planted inside the sandbox points outside, collapsing lexically for the check but resolving to an external location at write time.
Sources: CHANGELOG.md:17-25
Platform-Specific Behaviors
| Platform | Case Sensitivity | Notes |
|---|---|---|
| Linux/macOS | Case-sensitive | Paths must match exactly |
| Windows | Case-insensitive | C:\Users\Bob and c:\users\bob are equivalent |
On Windows, the containment check lowercases both the root and target paths before comparison.
Sources: src/sandbox.ts:28-30
Core Implementation
API Design
The sandbox module exposes two primary functions:
function buildAllowedRoots(env: Record<string, string | undefined>): string[]
function validateSavePath(
savePath: string,
roots: string[]
): { ok: boolean; resolved?: string; reason?: string }
`buildAllowedRoots()`
Parses MARKFETCH_ALLOWED_WRITE_ROOTS from environment variables:
| Parameter | Type | Description | |
|---|---|---|---|
env | `Record<string, string \ | undefined>` | Process environment variables |
| Return Type | Description |
|---|---|
string[] | Array of absolute, realpath-resolved directory paths |
Logic:
- If
MARKFETCH_ALLOWED_WRITE_ROOTSis unset: return[os.tmpdir(), process.cwd()] - If set: split by platform delimiter, validate each is absolute and exists
- Resolve each via
fs.realpathfor canonical form
`validateSavePath()`
Validates a save path is within allowed roots:
| Parameter | Type | Description |
|---|---|---|
savePath | string | The requested save path |
roots | string[] | Allowed root directories |
| Return Type | Description |
|---|---|
{ ok: true, resolved: string } | Path is allowed; resolved is the canonicalized path for writing |
{ ok: false, reason: string } | Path is outside sandbox; reason describes the violation |
Validation steps:
- Resolve
savePathviafs.realpath - For each root, compute relative path from root to resolved target
- If relative path is empty (same directory) or does not start with
..and is not absolute: allow - Otherwise: reject with reason listing allowed roots
Sources: src/sandbox.ts:1-50
Error Handling
Error Codes
| Code | Condition | Response |
|---|---|---|
save_forbidden | savePath resolves outside allowed roots | No file written; MCP returns error |
save_failed | savePath is valid but writeFile fails | No file written; MCP returns error |
Error Message Format
All sandbox errors return the format:
[save_forbidden] '<path>' is outside the allowed write roots: ['/allowed/root1', '/allowed/root2']
This provides:
- The attempted path
- The reason for rejection
- The list of allowed roots for debugging
Sources: src/mcp.ts:8-13
MCP Integration
Tool Schema
server.registerTool("fetch_markdown", {
inputSchema: {
url: z.string().url().describe("..."),
savePath: z.string()
.refine(isAbsolute, "savePath must be an absolute filesystem path")
.optional()
.describe("Optional. When provided, the fetched markdown is written to this absolute filesystem path...")
}
});
Validation Flow
- MCP adapter receives
savePathparameter - Validates path is absolute (via Zod schema)
- Calls
validateSavePath(savePath, allowedRoots) - If
ok: false: throwMarkfetchErrorwithsave_forbiddencode - If
ok: true: useresolvedpath forwriteFile
Sources: src/mcp.ts:24-35
Architecture Diagram
graph LR
subgraph MCP_Client
A[LLM sends fetch_markdown with savePath]
end
subgraph MCP_Server
B[src/mcp.ts - MCP adapter]
C[src/core.ts - fetchMarkdown]
D[src/sandbox.ts - validateSavePath]
end
subgraph File_System
E[fs.realpath resolution]
F[fs.writeFile]
end
A --> B
B -->|validate path| D
D -->|resolve symlink| E
E -->|check containment| D
D -->|ok: true| C
C -->|write markdown| F
D -->|ok: false| B
B -->|save_forbidden| ACLI vs MCP Behavior
| Aspect | CLI Mode | MCP Mode |
|---|---|---|
| Write sandbox | None | Enforced |
| Path validation | Not performed | Required |
| Symlink resolution | Not performed | Required |
savePath parameter | Optional, -o flag | Optional, tool parameter |
| Relative path resolution | Resolves against cwd | Not allowed (must be absolute) |
The CLI adapter resolves relative paths internally for convenience, but the MCP adapter requires absolute paths and enforces the sandbox.
Sources: src/cli.ts:6-18
Security Considerations
Attack Vectors Mitigated
- Path traversal:
../../etc/passwdis resolved before checking - Symlink escape:
<sandbox>/link_to_externalis resolved and rejected - Case confusion (Windows):
C:\Users\Bobequalsc:\users\bob - Tilde expansion: Not performed; shell expands
~before argv reaches process
Remaining Trust Boundaries
| Trust Level | Description |
|---|---|
| Filesystem permissions | Sandbox does not override OS file permissions |
| Network | Does not prevent network-based attacks |
| Content injection | Does not sanitize markdown content before writing |
Related Files
| File | Role |
|---|---|
src/sandbox.ts | Core sandbox validation logic |
src/mcp.ts | MCP server adapter, uses sandbox |
src/cli.ts | CLI adapter, no sandbox |
src/core.ts | Core fetch pipeline |
README.md | User documentation and configuration |
CHANGELOG.md | Historical security fix for symlink escape |
Changelog
| Version | Change |
|---|---|
| 0.6.0 | Current release with full sandbox implementation |
| 0.5.0 | CLI mode added (unrestricted by design) |
| < 0.5.0 | MCP-only, sandbox introduced |
Sources: package.json:3
Sources: README.md:68-70
Error Handling
Related topics: Processing Pipeline, Write Sandbox Security, Environment Variables
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Processing Pipeline, Write Sandbox Security, Environment Variables
Error Handling
markfetch implements a deterministic, structured error handling system that provides consistent error reporting across both CLI and MCP interfaces. All errors are categorized into specific codes that enable precise failure diagnosis and appropriate recovery strategies.
Error Code Reference
markfetch defines eight deterministic error codes that cover all failure scenarios. Each code is designed to be actionable, helping callers understand exactly what went wrong and how to respond.
| Error Code | Meaning | Typical Cause |
|---|---|---|
network_error | DNS, TCP, or TLS failure | Firewall blocking, network unavailable, invalid hostname |
http_error | Non-2xx HTTP response | 404 page not found, 403 forbidden, 500 server error |
timeout | Request exceeded MARKFETCH_TIMEOUT_MS | Slow server, large page, network latency |
unsupported_content_type | Response is not HTML | Binary files, JSON APIs, PDF documents |
extraction_failed | Readability found no article content | Pure client-rendered SPAs with no static HTML |
too_large | Body or markdown exceeded MARKFETCH_MAX_BYTES | Very large articles with embedded media |
save_failed | File write operation failed | Missing parent directory, permission denied |
save_forbidden | Save path outside allowed write roots | Path traverses symlink outside sandbox |
Sources: README.md
Error Architecture
The error handling system follows a layered architecture where core validation and error creation happen in src/core.ts, while each adapter (CLI and MCP) provides interface-specific error formatting and reporting.
graph TD
A[Request] --> B[core.ts Validation]
B --> C{Error Condition?}
C -->|No| D[Successful Fetch]
C -->|Yes| E[MarkfetchError Thrown]
E --> F[Adapter Layer]
F --> G[CLI Adapter]
F --> H[MCP Adapter]
G --> I[stderr: [code] message]
H --> J[content[0].text: [code] message]
J --> K[isError: true]Sources: src/core.ts, src/cli.ts, src/mcp.ts
MarkfetchError Class
The central error type is MarkfetchError, which encapsulates both the error code and human-readable message. This class serves as the single error type thrown throughout the application.
class MarkfetchError {
constructor(
public readonly code: ErrorCode,
public readonly message: string
) {}
}
Sources: src/core.ts:1-100
Environment Variable Validation
markfetch validates configuration environment variables at startup to fail fast on misconfiguration rather than producing confusing per-request errors.
| Variable | Default | Validation Rules |
|---|---|---|
MARKFETCH_TIMEOUT_MS | 30000 | Positive integer |
MARKFETCH_MAX_BYTES | 5000000 | Positive integer |
MARKFETCH_USER_AGENT | Chrome 130 UA string | Must contain Chrome substring |
The intEnv function performs validation:
function intEnv(name: string, fallback: number): number {
const raw = process.env[name];
if (raw == null || raw === "") return fallback;
const n = Number(raw);
if (!Number.isFinite(n) || !Number.isInteger(n) || n <= 0) {
throw new Error(
`Invalid ${name}=${JSON.stringify(raw)} — expected a positive integer.`,
);
}
return n;
}
Sources: src/core.ts:1-100
User-Agent Validation
The MARKFETFET_USER_AGENT must be a valid Chrome User-Agent string. This requirement exists because Sec-CH-UA-* client hints are derived from the User-Agent at startup, and a mismatch creates a stronger bot signal.
function deriveClientHints(ua: string): {
brands: string;
mobile: string;
platform: string;
} {
const versionMatch = /\bChrome\/(\d+)/.exec(ua);
if (!versionMatch) {
throw new Error(
`Invalid MARKFETCH_USER_AGENT=${JSON.stringify(ua)} — expected a Chrome User-Agent containing "Chrome/VERSION".`,
);
}
// ...
}
Sources: src/core.ts:1-100
CLI Error Handling
The CLI adapter catches errors thrown from core and formats them for stderr output. Error output follows a consistent [code] message format that matches the MCP error format exactly.
try {
const { markdown, bytes, savedTo } = await fetchMarkdown({
url,
savePath,
});
// ... success handling
} catch (err) {
const { code, message } = classifyError(err);
console.error(`[${code}] ${message}`);
// Use exitCode so pending output drains before process exits
process.exitCode = 1;
}
Sources: src/cli.ts:1-50
CLI Exit Codes
| Scenario | Exit Code | Output |
|---|---|---|
| Success (stdout) | 0 | Raw markdown |
| Success (save to file) | 0 | Saved X bytes to /path |
| Any error | 1 | [code] message to stderr |
The use of process.exitCode = 1 (rather than process.exit(1)) ensures pending stdout/stderr output drains before the process terminates, which is important when stdout is piped to a slow consumer.
Sources: src/cli.ts:1-50
MCP Error Handling
The MCP adapter returns errors in a format compatible with the MCP protocol. Errors appear in the content[0].text field with isError: true set.
function errorResult(code: ErrorCode, message: string) {
return {
content: [{ type: "text" as const, text: `[${code}] ${message}` }],
isError: true,
};
}
Sources: src/mcp.ts:1-50
MCP Response Structure for Errors
{
"content": [
{
"type": "text",
"text": "[network_error] DNS lookup failed"
}
],
"isError": true
}
Sources: src/mcp.ts:1-50
Write Sandbox Errors
The MCP interface enforces a write sandbox that restricts file saves to configured root directories. Errors occur when savePath resolves to a location outside the allowed roots.
export function checkWritePath(
target: string,
roots: string[],
): { ok: true; resolved: string } | { ok: false; reason: string } {
// ... validation logic
return {
ok: false,
reason: `'${reattached}' is outside the allowed write roots: [${roots.map((r) => `'${r}'`).join(", ")}]`,
};
}
Sources: src/sandbox.ts:1-100
Allowed Write Roots Configuration
| Platform | Default Roots | Delimiter |
|---|---|---|
| POSIX | os.tmpdir() + process.cwd() | : |
| Windows | os.tmpdir() + process.cwd() | ; |
Override with MARKFETCH_ALLOWED_WRITE_ROOTS environment variable. When set, this replaces the defaults entirely rather than merging.
Sources: README.md
Symlink Handling
The sandbox correctly resolves symlinks to prevent escape attempts like <sandbox>/link/../out.md where link points outside the sandbox. The canonicalized path flows from the containment check into writeFile, ensuring the file is created exactly at the validated location.
Sources: CHANGELOG.md, src/sandbox.ts:1-100
Error Classification
The classifyError function normalizes different error types into the MarkfetchError format used throughout the system:
function classifyError(err: unknown): { code: string; message: string } {
if (err instanceof MarkfetchError) {
return { code: err.code, message: err.message };
}
if (err instanceof Error) {
return { code: "network_error", message: err.message };
}
return { code: "network_error", message: String(err) };
}
Sources: src/core.ts:1-100
Error Source Mapping
| Error Source | Code Produced |
|---|---|
MarkfetchError instances | Original code preserved |
Error instances | network_error |
| Non-Error values | network_error with string coercion |
Unified Error Flow
Version 0.5.0 introduced a refactoring where three inline return errorResult(...) sites in the MCP handler were converted to throw MarkfetchError from core uniformly. Both adapters now catch and convert errors consistently.
This architectural change ensures that both CLI and MCP interfaces produce identical error codes and messages for the same failure conditions.
Sources: CHANGELOG.md
Best Practices for Error Handling
For MCP Clients
- Check
isErrorfield in the response object - Parse the
content[0].textfield for the[code] messageformat - Handle
extraction_failedgracefully for client-rendered SPAs - Use
savePathparameter for large responses to avoid tool-result truncation
For CLI Consumers
- Redirect stderr to capture error codes
- Parse
[code] messageformat from stderr - Use
markfetch url 2>&1 | head -1to get the error
For Save Operations
- Always use absolute paths for
savePath - Verify
MARKFETCH_ALLOWED_WRITE_ROOTSincludes your target directory - Check for
save_forbiddenbeforesave_failedin error handling logic
Sources: README.md
Development Guide
Related topics: Introduction, Quick Start Guide
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Introduction, Quick Start Guide
Development Guide
This guide provides comprehensive information for developers who want to understand, extend, or contribute to markfetch.
Overview
markfetch is a Node.js tool that fetches URLs and converts web content to clean markdown. It operates in two modes:
- CLI Mode - Command-line interface for shell integration
- MCP Mode - Model Context Protocol server for AI agent integration
The project requires Node.js ≥ 24 and is distributed as an npm package. Sources: package.json:8
Architecture
graph TD
A[User Input] --> B{process.argv.length}
B -->|≥ 2 args| C[CLI Adapter]
B -->|Zero args| D[MCP Adapter]
C --> E[src/cli.ts]
D --> F[src/mcp.ts]
E --> G[src/core.ts]
F --> G
G --> H[undici HTTP Client]
G --> I[linkedom HTML Parser]
G --> J[@mozilla/readability]
G --> K[turndown]
H --> L[HTTP Response]
I --> M[DOM Document]
J --> N[Extracted Article]
K --> O[Markdown Output]Core Pipeline (src/core.ts)
The core module implements the main fetch-and-convert pipeline. It orchestrates:
| Component | Role |
|---|---|
undici | HTTP/2 transport with Chrome-like fingerprinting |
linkedom | HTML parsing to DOM |
@mozilla/readability | Article content extraction |
turndown | HTML to markdown conversion |
Sources: src/core.ts:1-50
Adapters (src/cli.ts & src/mcp.ts)
The source is structured into three distinct files:
| File | Purpose |
|---|---|
src/core.ts | Pipeline + errors (shared logic) |
src/mcp.ts | MCP stdio server adapter |
src/cli.ts | CLI argv parser + dispatcher |
src/index.ts | Lazy-import dispatcher based on process.argv.length |
Sources: README.md:95-100
The lazy-import dispatcher ensures console.log calls in cli.ts are never reachable from the MCP path, maintaining the invariant that stdout is reserved for MCP frames. Sources: CHANGELOG.md:45-47
Setting Up the Development Environment
Prerequisites
- Node.js ≥ 24
- npm or yarn
Installation
# Clone the repository
git clone https://github.com/vasylenko/markfetch.git
cd markfetch
# Install dependencies
npm install
Available Scripts
| Script | Command | Purpose |
|---|---|---|
dev | npm run dev | Run source directly with tsx (no build required) |
build | npm run build | Compile TypeScript to JavaScript |
test | npm run test | Run test suite with tsx |
inspect | npm run inspect | Launch MCP inspector for debugging |
Sources: package.json:21-28
Build Process
The build process consists of two steps:
# Compile TypeScript
npm run build
# Post-build script (automatically runs after build)
npm run postbuild
The postbuild script (scripts/postbuild.mjs) performs additional transformations after TypeScript compilation. Sources: package.json:26
Project Structure
markfetch/
├── src/
│ ├── index.ts # Entry point with argv dispatcher
│ ├── core.ts # Core fetch/extract/convert pipeline
│ ├── cli.ts # CLI adapter using commander
│ ├── mcp.ts # MCP stdio server
│ └── sandbox.ts # Write path sandboxing
├── dist/ # Compiled JavaScript output
├── tests/ # Test fixtures and test files
├── scripts/
│ └── postbuild.mjs # Post-compilation transformations
└── docs/
└── SPEC.md # Detailed specification
Configuration
Environment Variables
| Variable | Default | Purpose |
|---|---|---|
MARKFETCH_TIMEOUT_MS | 30000 | Per-request timeout in milliseconds |
MARKFETCH_MAX_BYTES | 5000000 | Cap on response body and extracted markdown |
MARKFETCH_USER_AGENT | Chrome 130 string | Override the User-Agent header |
MARKFETCH_ALLOWED_WRITE_ROOTS | os.tmpdir() + process.cwd() | MCP-only write sandbox roots |
Sources: README.md:60-66
Configuration Precedence
- Environment variables set at startup
- Command-line flags (CLI mode)
- MCP tool parameters (MCP mode)
Core API
fetchMarkdown Function
The main function exported from core.ts:
interface FetchOptions {
url: string;
savePath?: string;
}
interface FetchResult {
markdown: string;
bytes: number;
savedTo?: string;
}
Error Handling
The core module defines eight deterministic error codes:
| Code | Meaning |
|---|---|
network_error | DNS/TCP/TLS failure |
http_error | Non-2xx HTTP status |
timeout | Request timeout exceeded |
unsupported_content_type | Not text/html or application/xhtml+xml |
extraction_failed | Readability found no article content |
too_large | Response or markdown exceeded size cap |
save_failed | File write failed (permissions, missing directory) |
save_forbidden | Path outside allowed write roots |
Sources: README.md:71-80
Errors are thrown as MarkfetchError from core uniformly and caught by adapters for conversion. Sources: CHANGELOG.md:49-51
Extending the Pipeline
Adding New HTML Rewrites
The rewriteForReadability() function in core.ts handles pre-extraction HTML transformations:
function rewriteForReadability(document: Document): void {
// Transform <aside class="footnote-brackets"> to <section>
// Flatten <details> elements
// Replace div.mw-heading with their heading children
}
To add new rewrite rules, append to this function before the return statement. Sources: src/core.ts:120-160
Customizing Markdown Conversion
The TURNDOWN instance is configured with:
| Plugin/Option | Purpose |
|---|---|
gfm plugin | GitHub Flavored Markdown support |
keepClasses: true | Preserve class="language-X" for code fences |
| Custom escape | Handle -/= after inline elements |
Sources: src/core.ts:50-90
Modifying Error Handling
Error handling flows through the MarkfetchError class in core:
- Core throws
MarkfetchErrorwith code and message - Adapters catch and format for their protocol
- CLI: writes
[code] messageto stderr - MCP: returns
{ content: [...], isError: true }
Sources: src/cli.ts:35-42 和 src/mcp.ts:15-20
Write Sandbox
The MCP adapter enforces write path restrictions:
graph TD
A[MCP savePath] --> B{absolutely path?}
B -->|No| C[Refine fails: savePath must be absolute]
B -->|Yes| D{Inside allowed roots?}
D -->|Yes| E[Write file]
D -->|No| F[Return save_forbidden error]Configuring Allowed Roots
Set the environment variable with platform delimiter:
# POSIX
export MARKFETCH_ALLOWED_WRITE_ROOTS="/tmp:/home/user/docs"
# Windows
set MARKFETCH_ALLOWED_WRITE_ROOTS="C:\Users\me\docs;C:\temp"
The sandbox checks resolve symlinks and applies case-folding on Windows. Sources: src/sandbox.ts:20-40
Testing
Running Tests
npm test
Test Structure
Tests use Node.js built-in test runner (--test flag) with tsx for TypeScript support. Sources: package.json:27
Writing New Tests
- Place test files in
tests/directory - Use
*.test.tsnaming pattern - Run with
tsx --test tests/*.test.ts
MCP Inspector
Debug MCP integration using the official inspector:
npm run inspect
This launches the MCP inspector at http://localhost:6274 where you can:
- Test tool calls interactively
- Inspect request/response frames
- Verify schema validation
Sources: package.json:27
Dependencies
Production Dependencies
| Package | Version | Purpose |
|---|---|---|
@modelcontextprotocol/sdk | ^1.29.0 | MCP server implementation |
@mozilla/readability | ^0.5.0 | Article extraction |
commander | ^14.0.3 | CLI argument parsing |
linkedom | ^0.18.0 | HTML parsing |
turndown | ^7.0.0 | HTML to markdown |
turndown-plugin-gfm | ^1.0.2 | GFM support |
undici | ^8.2.0 | HTTP client |
zod | ^3.0.0 | Schema validation |
Development Dependencies
| Package | Purpose |
|---|---|
@types/node | Node.js type definitions |
@types/turndown | Turndown type definitions |
tsx | TypeScript execution |
typescript | TypeScript compiler |
Sources: package.json:30-50
Version History
| Version | Date | Key Changes |
|---|---|---|
| 0.6.0 | 2026-05-13 | Write sandbox, Windows CI, save_forbidden error |
| 0.5.0 | 2026-05-12 | CLI mode, commander dependency |
| 0.4.1 | 2026-05-11 | README rewrite, bin path fix |
| 0.4.0 | 2026-05-10 | MCP server with fetch_markdown tool |
Sources: CHANGELOG.md:1-60
Contributing Guidelines
Code Standards
- All source in TypeScript under
src/ - Build output to
dist/vianpm run build - Tests in
tests/with*.test.tspattern - No runtime
console.login MCP path (enforced by lazy-import structure)
Pull Request Checklist
- [ ] Run
npm run buildsuccessfully - [ ] Run
npm testwith all tests passing - [ ] Update CHANGELOG.md with changes
- [ ] Ensure documentation reflects new behavior
Release Process
npm run prepublishOnly
This runs the build automatically before npm publish. Sources: package.json:29
Sources: src/core.ts:1-50
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
First-time setup may fail or require extra isolation and rollback planning.
The project should not be treated as fully validated until this signal is reviewed.
Users cannot judge support quality until recent activity, releases, and issue response are checked.
The project may affect permissions, credentials, data exposure, or host boundaries.
Doramagic Pitfall Log
Doramagic extracted 7 source-linked risk signals. Review them before installing or handing real data to the project.
1. Installation risk: v0.4.1
- Severity: medium
- Finding: Installation risk is backed by a source signal: v0.4.1. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/vasylenko/markfetch/releases/tag/v0.4.1
2. Capability assumption: README/documentation is current enough for a first validation pass.
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: capability.assumptions | github_repo:1234238440 | https://github.com/vasylenko/markfetch | README/documentation is current enough for a first validation pass.
3. Maintenance risk: Maintainer activity is unknown
- Severity: medium
- Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | last_activity_observed missing
4. Security or permission risk: no_demo
- Severity: medium
- Finding: no_demo
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: downstream_validation.risk_items | github_repo:1234238440 | https://github.com/vasylenko/markfetch | no_demo; severity=medium
5. Security or permission risk: no_demo
- Severity: medium
- Finding: no_demo
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: risks.scoring_risks | github_repo:1234238440 | https://github.com/vasylenko/markfetch | no_demo; severity=medium
6. Maintenance risk: issue_or_pr_quality=unknown
- Severity: low
- Finding: issue_or_pr_quality=unknown。
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | issue_or_pr_quality=unknown
7. Maintenance risk: release_recency=unknown
- Severity: low
- Finding: release_recency=unknown。
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: evidence.maintainer_signals | github_repo:1234238440 | https://github.com/vasylenko/markfetch | release_recency=unknown
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using markfetch with real data or production workflows.
- v0.4.1 - github / github_release
- README/documentation is current enough for a first validation pass. - GitHub / issue
Source: Project Pack community evidence and pitfall evidence