Doramagic Project Pack ยท Human Manual

Scrapling

๐Ÿ•ท๏ธ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

Scrapling Overview and System Architecture

Related topics: Fetchers and the Stealth Engine, Spider Framework and Crawling Infrastructure, AI Integration, CLI, Interactive Shell, and Agent Skill

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Parser Engine

Continue reading this section for the full explanation and source context.

Section Fetcher Layer

Continue reading this section for the full explanation and source context.

Section Spider Framework

Continue reading this section for the full explanation and source context.

Related topics: Fetchers and the Stealth Engine, Spider Framework and Crawling Infrastructure, AI Integration, CLI, Interactive Shell, and Agent Skill

Scrapling Overview and System Architecture

Introduction and Purpose

Scrapling is an adaptive web scraping framework that scales from a single HTTP request to a full, concurrent crawl. As described in the project README, the framework "handles everything from a single request to a full-scale crawl," with three distinguishing capabilities: an adaptive parser that relocates elements when a site redesigns, fetchers that bypass modern anti-bot systems (including Cloudflare Turnstile) out of the box, and a Scrapy-like spider framework for multi-session crawls. Source: README.md:1-15

The framework is positioned for both ad-hoc scripters and production-scale operators. The README frames it as "Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone." Source: README.md:10-15 Performance is a primary design goal: the parser benchmark (averaging 100+ runs) shows Scrapling at 2.02 ms versus 1584.31 ms for BeautifulSoup with lxml, a ~784ร— speedup. Source: README.md:55-70

High-Level Architecture

Scrapling's design separates concerns into four cooperating subsystems: a parser engine, a layered set of fetchers, a spider framework, and a CLI/MCP integration layer. The following diagram shows how user code flows through these subsystems and the typical escalation path between fetchers.

flowchart TD
    A[User Code: Python or CLI] --> B{Choose Fetcher Layer}
    B -->|Lightweight HTTP| C[Fetcher / FetcherSession / AsyncFetcher]
    B -->|JS rendering required| D[DynamicFetcher / DynamicSession]
    B -->|Anti-bot bypass required| E[StealthyFetcher / StealthySession]
    C --> F[Response / Page Object]
    D --> F
    E --> F
    F --> G[Adaptive Parser<br/>css / xpath / regex]
    G -->|auto_save=True| H[(Element Store)]
    G -->|adaptive=True| I[Relocated Elements]
    F --> J[Spider Framework]
    J --> K[Concurrent Crawl +<br/>Pause/Resume + Proxy Rotation]
    H --> L[Structured Output]
    I --> L
    K --> L

The escalation guide shown in the examples README makes this layering explicit: start with Fetcher/FetcherSession, escalate to DynamicSession when JavaScript is required, escalate again to StealthySession when the site blocks the request, and finally use the Spider for multi-page crawls. Source: agent-skill/Scrapling-Skill/examples/README.md:40-50

Core Components

Parser Engine

The parser is a lxml-based engine that exposes a Scrapy/Parsel-compatible API. The README notes that Scrapling includes "Code adapted from Parsel (BSD License) โ€” Used for translator submodule," indicating that the CSS/XPath translator layer is reused rather than reimplemented. Source: README.md:175-180 Three features are highlighted in the README: enhanced text processing with regex and cleaning helpers, auto selector generation that produces robust CSS/XPath selectors, and a familiar API that uses the same pseudo-elements as Scrapy/Parsel. Source: README.md:5-15

The adaptive element-finding feature is the parser's most distinctive capability. Setting auto_save=True on a selector call persists the matched element's signature; later, when the same call is invoked with adaptive=True, the parser uses element similarity scoring to relocate the element even after structural changes. The benchmark table shows this similarity search at 2.39 ms versus 12.45 ms for AutoScraper, a 5.2ร— speedup. Source: README.md:75-85

Fetcher Layer

Four fetchers are exposed through scrapling.fetchers: Fetcher, AsyncFetcher, StealthyFetcher, and DynamicFetcher. Each has a session counterpart (FetcherSession, AsyncFetcherSession, StealthySession, AsyncStealthySession, DynamicSession) that reuses a connection or browser across requests. The basic usage in the README demonstrates both patterns. Source: README.md:25-40

The stealth fetcher is the most powerful member of this layer. As shown in the README's example, StealthyFetcher.fetch(..., headless=True, network_idle=True) can be used to fetch pages "under the radar." A known limitation tracked in the community is that the embedded Turnstile variant can cause StealthyFetcher to wait indefinitely. Source: community issue #100 This is an open failure mode worth noting for users scraping Turnstile-protected pages.

Spider Framework

The spider framework is a Scrapy-compatible API with start_urls, async parse callbacks, and Request/Response objects. Key features listed in the README include configurable concurrency limits, per-domain throttling, download delays, multi-session support (HTTP and stealth headless browsers in a single spider, routed by session ID), checkpoint-based pause/resume (Ctrl+C triggers a graceful shutdown; restart resumes from the last checkpoint), and a streaming mode via async for item in spider.stream(). Source: README.md:90-115

The spiders also implement automatic blocked-request detection and retry, and an optional robots_txt_obey flag that respects Disallow, Crawl-delay, and Request-rate directives with per-domain policies. Source: README.md:115-120

CLI, MCP, and Agent Skills

The CLI is the entry point for installation and quick extraction. After pip install "scrapling[fetchers]", users run scrapling install (or scrapling install --force) to download browsers, system dependencies, and fingerprint-manipulation libraries. The scrapling extract subcommand supports three fetch modes โ€” get, fetch, and stealthy-fetch โ€” and writes Markdown or plain-text output, with optional --css-selector filtering and --impersonate chrome for TLS fingerprint spoofing. Source: README.md:140-160

The project also ships an Agent Skill package under agent-skill/ that "encapsulates almost all of the documentation website's content in Markdown" and conforms to the AgentSkill specification. The skill is consumable by OpenClaw, Claude Code, and other agentic tools, and is published on Clawhub where it can be installed with clawhub install scrapling-official. Source: agent-skill/README.md:1-15 Community discussion in issue #142 requested explicit OpenClaw support; the published skill is the implementation of that request.

Installation and Observability

Scrapling requires Python 3.10 or higher. The base pip install scrapling only includes the parser engine; fetchers, spiders, and CLI tools require pip install "scrapling[fetchers]" followed by scrapling install to fetch browser binaries. Source: README.md:80-105

A LoggerProxy is provided in scrapling.core.utils._utils to expose the standard logging.Logger interface through a context variable, enabling per-task log routing in async crawls. The default logger uses a single console handler with the format [%(asctime)s] %(levelname)s: %(message)s at INFO level. Source: scrapling/core/utils/_utils.py:20-55

Community-Driven Roadmap Items

Several open feature requests signal where the architecture is evolving. Issue #159 requests a "listen to browser requests and get responses" capability, which would extend the fetcher layer with a request-interception hook (useful for sites that expose data only via XHR/fetch calls). Issue #82 requests automatic pagination-URL detection, which would extend the spider's link-following logic. Source: community issues #159, #82 Users evaluating Scrapling for production use should track these for future releases.

See Also

  • Selection Methods and Adaptive Parser
  • Fetcher Selection Guide
  • Spider Architecture
  • Proxy Rotation and Blocking Detection
  • CLI Reference
  • MCP Server Integration

Source: https://github.com/D4Vinci/Scrapling / Human Manual

Fetchers and the Stealth Engine

Related topics: Scrapling Overview and System Architecture, Spider Framework and Crawling Infrastructure

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 2.1 Fetcher and FetcherSession โ€” Static HTTP

Continue reading this section for the full explanation and source context.

Section 2.2 DynamicFetcher and DynamicSession โ€” Browser Automation

Continue reading this section for the full explanation and source context.

Section 2.3 StealthyFetcher and StealthySession โ€” The Stealth Engine

Continue reading this section for the full explanation and source context.

Related topics: Scrapling Overview and System Architecture, Spider Framework and Crawling Infrastructure

Fetchers and the Stealth Engine

1. Purpose and Scope

The "Fetchers" subsystem is the network boundary of Scrapling. It encapsulates everything between a Python call and a remote HTTP/HTTPS endpoint, providing three escalating levels of fidelity:

  1. A pure-HTTP fetcher based on curl_cffi with TLS fingerprint impersonation.
  2. A real browser fetcher driven by Playwright/CDP for JavaScript-rendered pages.
  3. A stealth-tuned browser fetcher that injects patches, rotates fingerprints, and can solve Cloudflare Turnstile challenges without third-party services.

The "Stealth Engine" is the subset of that stack used by StealthyFetcher / StealthySession. It re-uses the browser automation engine but layers anti-detection logic, optional captcha solving, and adaptive behavior on top. The two concepts are intentionally overlapping: a fetcher is the *interface* a developer calls, while the stealth engine is the *implementation strategy* used to satisfy anti-bot systems. Source: README.md and scrapling/fetchers/__init__.py.

2. The Fetcher Hierarchy

All fetchers return a Response-like object, but they differ in cost and capability. The community-maintained skill recommends an "escalation" approach: start with the cheapest fetcher and only step up when the page refuses to load correctly. Source: agent-skill/Scrapling-Skill/examples/README.md.

flowchart TD
    A[Need data from URL] --> B{Plain HTTP enough?}
    B -- Yes --> C[Fetcher / FetcherSession<br/>curl_cffi TLS impersonation]
    B -- No --> D{JS rendering needed?}
    D -- Yes --> E[DynamicFetcher / DynamicSession<br/>Playwright/Chromium]
    D -- No --> F{Anti-bot blocking?}
    F -- Yes --> G[StealthyFetcher / StealthySession<br/>Stealth Engine + Turnstile solver]
    C --> H[Response]
    E --> H
    G --> H
    H --> I[Adaptive Parser<br/>auto_save / adaptive=True]

The same escalation model is exposed in the CLI as extract get, extract fetch, and extract stealthy-fetch. Source: README.md.

2.1 `Fetcher` and `FetcherSession` โ€” Static HTTP

Fetcher is the lightweight entry point. It performs a single request through the curl_cffi transport, optionally impersonating a real browser's TLS fingerprint (e.g. impersonate='chrome'). When you need connection reuse โ€” cookies, keep-alive, multiple requests in a loop โ€” wrap it in FetcherSession and use session.get(...). Source: scrapling/fetchers/requests.py.

The session also accepts a stealthy_headers=True flag, which is a cheap compromise: it does not run a browser, but it attaches header order and values that mimic a recent Chrome build. This is often sufficient against naive bot detectors and is significantly cheaper than launching Chromium. Source: README.md.

2.2 `DynamicFetcher` and `DynamicSession` โ€” Browser Automation

When the target page renders content client-side (SPAs, infinite scroll, lazy-loaded sections), you need a real browser. DynamicFetcher drives a headless Chromium via Playwright/CDP and returns the post-render DOM. Important toggles include headless, network_idle (wait for the network to settle), and disable_resources (skip stylesheets/images to save bandwidth). The async counterpart is AsyncDynamicFetcher. Source: scrapling/fetchers/chrome.py and scrapling/engines/static.py.

2.3 `StealthyFetcher` and `StealthySession` โ€” The Stealth Engine

This is the highest-cost, highest-success-rate tier. It launches a stealth-patched Chromium (Camoufox/Playwright under the hood) and applies a chain of anti-fingerprint techniques before navigation. Key features:

  • solve_cloudflare=True โ€” solves Cloudflare Turnstile (and related human-challenge pages) in-page, returning the cleared DOM. Source: README.md.
  • headless=True โ€” runs the browser in headless mode while preserving the same fingerprint set as headed mode.
  • adaptive=True โ€” a class-level flag that, when combined with auto_save=True on parser calls, lets Scrapling persist the *current* element layout and re-locate the element after a future redesign.
  • google_search=False โ€” disables Scrapling's automatic Google search fallback (a convenience path used to reach heavily-shielded pages via a search referrer).

Known limitation (community issue #100): On the "embedded" variant of the Cloudflare Turnstile widget, StealthyFetcher can wait indefinitely because the embedded iframe does not emit the same challenge-completion signal that the standalone widget does. The workaround is to pass a stricter wait_selector or to use DynamicFetcher with a manual sleep, but a first-class fix is still pending. Source: community discussion #100 referenced in the prompt context.

3. Architecture and Data Flow

When a StealthyFetcher.fetch(...) call is made, the request travels through the following layers:

sequenceDiagram
    participant U as User Code
    participant F as StealthyFetcher
    participant E as Stealth Engine
    participant B as Chromium (CDP)
    participant W as Target Site
    U->>F: fetch(url, solve_cloudflare=True)
    F->>E: configure stealth args, fingerprints
    E->>B: launch patched browser context
    B->>W: GET url (with spoofed fingerprint)
    W-->>B: HTML + Turnstile iframe
    E->>B: detect challenge, run solver
    B-->>W: post challenge token
    W-->>B: cleared HTML
    B-->>E: serialized DOM
    E-->>F: Response
    F-->>U: Response (page.css, page.find_by_text, ...)

The stealth engine itself is a thin orchestrator: it does not invent new browser behavior, it composes existing tools (browser launch arguments, header sets, fingerprint databases) into a single, reproducible configuration. Constants such as default timeouts, user-agent pools, and challenge-detection selectors live in scrapling/engines/constants.py, and the logger used to emit diagnostic events is configured in scrapling/core/utils/_utils.py โ€” the setup_logger() helper is wrapped in lru_cache(1, typed=True) to enforce a singleton pattern across the process.

4. Common Failure Modes and Mitigations

SymptomLikely CauseMitigation
StealthyFetcher hangs on TurnstileEmbedded widget variant doesn't emit completion signal (issue #100)Use DynamicFetcher with explicit wait_selector; track upstream fix
Browser launches but page returns blankJS-rendered SPA blocked by anti-botEscalate Fetcher โ†’ DynamicFetcher โ†’ StealthyFetcher
TLS impersonation rejectedServer pins a specific browser versionUpdate with scrapling install --force to refresh fingerprints (per release v0.4.9 notes)
Need to inspect intermediate requestsStealthyFetcher does not currently expose a request-interception hook (issue #159)Drop to DynamicFetcher and use Playwright's page.on("request", ...) directly
Crawling paginated listingsManual response.follow() required (issue #82)Build a Spider subclass with an explicit next-page check

Issue #159 ("listening to browser requests and getting responses") is an active feature request that, if implemented, would add a first-class event hook to the stealth engine so that callers can filter traffic without dropping to raw Playwright. Issue #82 ("automatic pagination detection") is orthogonal to the stealth engine but lives in the same response-handling layer.

See Also

  • Adaptive Parser (auto-save and adaptive selection)
  • Spider Framework (multi-session routing, pause/resume)
  • CLI: scrapling extract and scrapling install
  • Agent Skill (OpenClaw / Claude Code integration โ€” issue #142)

Source: https://github.com/D4Vinci/Scrapling / Human Manual

Spider Framework and Crawling Infrastructure

Related topics: Scrapling Overview and System Architecture, Fetchers and the Stealth Engine

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Request Lifecycle

Continue reading this section for the full explanation and source context.

Section Concurrency and Throttling

Continue reading this section for the full explanation and source context.

Section Multi-Session Routing

Continue reading this section for the full explanation and source context.

Related topics: Scrapling Overview and System Architecture, Fetchers and the Stealth Engine

Spider Framework and Crawling Infrastructure

Overview and Purpose

The Spider Framework is the highest-level orchestration layer of Scrapling. While the fetchers (Fetcher, StealthyFetcher, DynamicFetcher) handle individual page requests and the parser handles selection, the spider subsystem exists to scale those primitives into concurrent, stateful, multi-page crawls. It is the recommended tool when a job requires more than one HTTP exchange, multiple session profiles, or resumable execution (Source: README.md).

The framework is described in the README as "A Full Crawling Framework" and exposes a Scrapy-like API: developers subclass Spider, define start_urls and an async parse(response) callback, and the engine takes over scheduling, dispatching, and persistence (Source: README.md). The official Agent Skill documentation reinforces this by positioning the Spider as the top rung of an escalation ladder:

get / FetcherSession
  โ””โ”€ If JS required โ†’ fetch / DynamicSession
       โ””โ”€ If blocked โ†’ stealthy-fetch / StealthySession
            โ””โ”€ If multi-page โ†’ Spider

(Source: agent-skill/Scrapling-Skill/examples/README.md)

Core Components and Architecture

The subsystem is organised into a small set of cooperating modules under scrapling/spiders/:

ModuleResponsibility
__init__.pyPublic re-exports of Spider, Request, and Response for from scrapling.spiders import โ€ฆ
spider.pyUser-facing Spider base class with class-level configuration (name, start_urls, concurrent_requests)
engine.pyAsync execution loop that drives request scheduling, dispatches fetches, and invokes callbacks
request.pyThe Request object used to enqueue URLs and carry session/priority metadata
session.pySession Manager that registers named sessions (e.g. fast, stealth) and routes requests by ID
scheduler.pyQueue, deduplication, and throttling logic

(Source: scrapling/spiders/__init__.py, scrapling/spiders/spider.py, scrapling/spiders/engine.py, scrapling/spiders/request.py, scrapling/spiders/session.py, scrapling/spiders/scheduler.py)

Request Lifecycle

flowchart LR
    A[Spider subclass<br/>start_urls] --> B[Scheduler<br/>dedup + throttle]
    B --> C[Engine<br/>async loop]
    C --> D{Session Manager<br/>pick by id}
    D -->|fast| E[FetcherSession]
    D -->|stealth| F[AsyncStealthySession]
    E --> G[Response]
    F --> G[Response]
    G --> H[parse callback]
    H -->|yield dict| I[Items collection]
    H -->|yield Request| B
    H -->|Ctrl+C| J[Checkpoint<br/>pause/resume]
    I --> K[result.items.to_json/.to_jsonl]

The spider's parse() coroutine returns plain Python dicts for terminal items, or Request objects that are pushed back into the scheduler for further traversal. This is the documented pattern in the README's multi-page quotes example (Source: README.md).

Configuration and Common Patterns

Concurrency and Throttling

Spider.concurrent_requests (default shown in the README example is 10) controls the size of the in-flight pool. Per-domain throttling and download delays are applied by the scheduler so a single aggressive target does not starve other hosts (Source: README.md).

Multi-Session Routing

Spiders can mix fast HTTP and stealthy headless browser sessions under a single crawl. The README's MultiSessionSpider example overrides configure_sessions(self, manager) to register profiles:

def configure_sessions(self, manager):
    manager.add("fast", FetcherSession(impersonate="chrome"))
    manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)

The lazy=True flag defers browser startup until the first request that actually needs it. Subsequent Request(url, session="stealth") calls are routed accordingly (Source: README.md).

Streaming and Export

Long-running crawls can be consumed incrementally with async for item in spider.stream(), which surfaces real-time stats suitable for UIs or pipelines. Terminal export is built in: result.items.to_json("quotes.json") and result.items.to_jsonl(...) write the collected dicts without extra plumbing (Source: README.md).

Persistence, Robots, and Caching

  • Pause & Resume: Pressing Ctrl+C triggers a graceful shutdown that writes a checkpoint; restarting the spider resumes from the same state (Source: README.md).
  • Robots.txt: Setting robots_txt_obey=True makes the scheduler honour Disallow, Crawl-delay, and Request-rate directives with per-domain caching (Source: README.md).
  • Development Mode: The first run caches responses to disk; later runs replay them so parse() can be iterated without re-hitting target servers (Source: README.md).
  • Blocked Request Detection: The engine can auto-retry requests that look blocked, with user-customisable heuristics (Source: README.md).

Logging

Framework diagnostics are emitted through a dedicated scrapling logger configured by setup_logger() in core utilities. The logger is exposed via a ContextVar and a LoggerProxy, which lets the spider engine swap log handlers per async context without mutating global state (Source: scrapling/core/utils/_utils.py).

Community Considerations and Limitations

The Spider Framework deliberately leaves a few high-level concerns to the developer. Community issue #82 ("Add functionality to automatically detect pagination URLs") notes that spiders currently follow pagination only when the developer explicitly yields a Request via response.follow(); there is no automatic discovery of next-page links (Source: README.md follow example, community issue #82).

A related limitation surfaced in issue #100 is that StealthyFetcher can hang indefinitely on certain embedded Cloudflare Turnstile variants. Because the spider engine awaits the fetcher it dispatches, a hung Turnstile session blocks the entire in-flight pool for that session id. The standard workaround is to keep the stealth session lazy=True and isolate its use behind per-request timeouts or custom blocked-request detection logic (Source: community issue #100; README.md).

Issue #159 ("listening to browser requests and getting responses") requests a browser-side network interceptor; this is a spider-adjacent feature that does not currently exist in the spider module surface and would require engine-level hooks to be implemented safely (Source: community issue #159).

Finally, v0.4.9 added a --version CLI flag and refreshed all browser fingerprints, reminding users to run scrapling install --force so spider-managed stealth sessions pick up the new artefacts (Source: README release notes, README.md).

See Also

  • Fetchers and Session Management โ€” the HTTP/stealth/dynamic primitives spiders route through
  • Adaptive Parser and Selector Engine โ€” selection methods used inside parse()
  • CLI and MCP Server โ€” scrapling install --force, the interactive shell, and MCP integration
  • Agent Skill for Scrapling โ€” pre-packaged documentation for Claude Code / OpenClaw

Source: https://github.com/D4Vinci/Scrapling / Human Manual

AI Integration, CLI, Interactive Shell, and Agent Skill

Related topics: Scrapling Overview and System Architecture, Fetchers and the Stealth Engine

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 1.1 scrapling install

Continue reading this section for the full explanation and source context.

Section 1.2 --version flag and ergonomics

Continue reading this section for the full explanation and source context.

Section 3.1 Programmatic AI module

Continue reading this section for the full explanation and source context.

Related topics: Scrapling Overview and System Architecture, Fetchers and the Stealth Engine

AI Integration, CLI, Interactive Shell, and Agent Skill

Scrapling ships with a layered tooling surface that wraps its adaptive parser and fetchers for three different audiences: developers who want a terminal workflow (scrapling CLI), engineers who want a REPL inside Python (scrapling.core.shell), and AI agents / assistants that need a structured, machine-readable description of the library (the agent-skill package and MCP server). This page describes how those four surfaces fit together and where to look in the source.

1. Command-Line Interface (`scrapling`)

The CLI entry point is implemented in scrapling/cli.py. It is the supported way to bootstrap the heavy browser/fingerprint dependencies that the fetchers need.

1.1 `scrapling install`

The headline subcommand downloads all browser engines and their system dependencies that DynamicFetcher and StealthyFetcher require at runtime. Source: scrapling/cli.py:1-50.

pip install "scrapling[fetchers]"
scrapling install           # normal install
scrapling install  --force  # force reinstall of browsers/fingerprints

Release v0.4.9 explicitly instructs users to run scrapling install --force after upgrading because all bundled browser binaries and fingerprints were refreshed. The CLI is also invokable programmatically, which is useful in CI and inside the interactive shell:

from scrapling.cli import install

install([], standalone_mode=False)           # normal install
install(["--force"], standalone_mode=False)  # force reinstall

Source: README.md:installation section.

1.2 `--version` flag and ergonomics

A --version flag was added to the CLI by a community contributor (@ETM-Code) in v0.4.9. Source: README.md:release notes v0.4.9. The CLI also exposes the MCP server installation entry point used by the AI integration, summarized in the badge block of the README.

2. Interactive Python Shell (`scrapling.core.shell`)

The interactive shell lives in scrapling/core/shell.py and its companion signature module scrapling/core/_shell_signatures.py. It is a Python REPL pre-configured with Scrapling's parser, fetchers, and helpers so users can prototype selectors and fetches interactively.

Supporting utilities, including colored output and command parsing, are kept in scrapling/core/utils/_shell.py. The _shell_signatures.py file declares the type signatures that the shell prints as help text when the user types a partial command, mirroring the API surface of the parser (Selector, Adaptor, and the text-processing methods).

The shell is intended for ad-hoc exploration: paste a URL, run page.css(...), inspect results, and iterate. Because it shares the same logger proxy defined in scrapling/core/utils/_utils.py, output is consistent with the CLI and spiders.

3. AI Integration and the Agent Skill

The AI surface has two parts: a programmatic integration and a packaged Agent Skill that follows the AgentSkill specification.

3.1 Programmatic AI module

scrapling/core/ai.py provides the AI-facing helpers (used internally by the MCP server and externally by users who want to embed Scrapling in LLM toolchains). It is exposed via the optional [ai] extra:

pip install "scrapling[ai]"

This extra enables the MCP server referenced from the README's documentation index (https://scrapling.readthedocs.io/en/latest/ai/mcp-server.html). Source: README.md:documentation links.

3.2 Agent Skill (`agent-skill/`)

The agent-skill/ directory ships a ready-to-install Agent Skill bundle. Per agent-skill/README.md, the skill:

  • Aligns with the AgentSkill specification, so it is readable by OpenClaw, Claude Code, and other agentic tools.
  • Encapsulates almost all of the documentation site content in Markdown, so the agent does not need to guess.
  • Is installable from a direct ZIP URL or via Clawhub.
clawhub install scrapling-official

This directly addresses the most engaged community request, Issue #142 "OpenClaw Support", which asked for first-class OpenClaw skill packaging. The skill also bundles runnable examples in agent-skill/Scrapling-Skill/examples/README.md, each example pairing with one of the four escalation tiers.

3.3 Translator (CSS โ†” XPath)

A small but important AI-adjacent utility is the selector translator at scrapling/core/translator.py. It is adapted from Parsel (BSD-licensed) and converts between CSS and XPath selectors. This is useful both for humans migrating selectors and for AI agents that need a normalized representation when reasoning over Scrapling's Selector objects.

4. Component Map and Community Context

The diagram below shows how the four surfaces share the same parser and fetcher core, and how the Agent Skill packages documentation for LLM agents.

flowchart LR
  subgraph Users
    Dev[Developer]
    REPL[Power user]
    Agent[AI Agent / OpenClaw / Claude Code]
  end
  CLI[scrapling CLI<br/>scrapling/cli.py]
  Shell[Interactive shell<br/>scrapling/core/shell.py]
  AI[AI module + MCP<br/>scrapling/core/ai.py]
  Skill[Agent Skill bundle<br/>agent-skill/]
  Trans[Translator<br/>scrapling/core/translator.py]
  Core[Parser + Fetchers core]
  Dev --> CLI
  REPL --> Shell
  Agent --> Skill
  Agent --> AI
  CLI --> Core
  Shell --> Core
  AI --> Core
  Skill -.docs.-> Agent
  Trans --> Core

Community context worth keeping in mind:

  • #142 (OpenClaw Support) โ€” The agent-skill/ package with the clawhub install scrapling-official workflow is the direct response to this request. Source: agent-skill/README.md.
  • #159 (Listening to browser requests) โ€” Tied to the fetcher/stealthy surface rather than the AI surface; mention it here as adjacent work.
  • #82 (Automatic pagination) โ€” Could be exposed through the Agent Skill as a documented helper in a future release; not yet shipped.

See Also

Source: https://github.com/D4Vinci/Scrapling / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 12 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/294

2. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/350

3. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/299

4. Installation risk: Installation risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/348

5. Capability evidence risk: Capability evidence risk requires verification

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: capability.assumptions | https://github.com/D4Vinci/Scrapling

6. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/349

7. Maintenance risk: Maintenance risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/D4Vinci/Scrapling

8. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: downstream_validation.risk_items | https://github.com/D4Vinci/Scrapling

9. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: risks.scoring_risks | https://github.com/D4Vinci/Scrapling

10. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/295

11. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: issue_or_pr_quality=unknownใ€‚
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/D4Vinci/Scrapling

12. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: release_recency=unknownใ€‚
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/D4Vinci/Scrapling

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using Scrapling with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence