Doramagic Project Pack ยท Human Manual
Scrapling
๐ท๏ธ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!
Scrapling Overview and System Architecture
Related topics: Fetchers and the Stealth Engine, Spider Framework and Crawling Infrastructure, AI Integration, CLI, Interactive Shell, and Agent Skill
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Fetchers and the Stealth Engine, Spider Framework and Crawling Infrastructure, AI Integration, CLI, Interactive Shell, and Agent Skill
Scrapling Overview and System Architecture
Introduction and Purpose
Scrapling is an adaptive web scraping framework that scales from a single HTTP request to a full, concurrent crawl. As described in the project README, the framework "handles everything from a single request to a full-scale crawl," with three distinguishing capabilities: an adaptive parser that relocates elements when a site redesigns, fetchers that bypass modern anti-bot systems (including Cloudflare Turnstile) out of the box, and a Scrapy-like spider framework for multi-session crawls. Source: README.md:1-15
The framework is positioned for both ad-hoc scripters and production-scale operators. The README frames it as "Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone." Source: README.md:10-15 Performance is a primary design goal: the parser benchmark (averaging 100+ runs) shows Scrapling at 2.02 ms versus 1584.31 ms for BeautifulSoup with lxml, a ~784ร speedup. Source: README.md:55-70
High-Level Architecture
Scrapling's design separates concerns into four cooperating subsystems: a parser engine, a layered set of fetchers, a spider framework, and a CLI/MCP integration layer. The following diagram shows how user code flows through these subsystems and the typical escalation path between fetchers.
flowchart TD
A[User Code: Python or CLI] --> B{Choose Fetcher Layer}
B -->|Lightweight HTTP| C[Fetcher / FetcherSession / AsyncFetcher]
B -->|JS rendering required| D[DynamicFetcher / DynamicSession]
B -->|Anti-bot bypass required| E[StealthyFetcher / StealthySession]
C --> F[Response / Page Object]
D --> F
E --> F
F --> G[Adaptive Parser<br/>css / xpath / regex]
G -->|auto_save=True| H[(Element Store)]
G -->|adaptive=True| I[Relocated Elements]
F --> J[Spider Framework]
J --> K[Concurrent Crawl +<br/>Pause/Resume + Proxy Rotation]
H --> L[Structured Output]
I --> L
K --> LThe escalation guide shown in the examples README makes this layering explicit: start with Fetcher/FetcherSession, escalate to DynamicSession when JavaScript is required, escalate again to StealthySession when the site blocks the request, and finally use the Spider for multi-page crawls. Source: agent-skill/Scrapling-Skill/examples/README.md:40-50
Core Components
Parser Engine
The parser is a lxml-based engine that exposes a Scrapy/Parsel-compatible API. The README notes that Scrapling includes "Code adapted from Parsel (BSD License) โ Used for translator submodule," indicating that the CSS/XPath translator layer is reused rather than reimplemented. Source: README.md:175-180 Three features are highlighted in the README: enhanced text processing with regex and cleaning helpers, auto selector generation that produces robust CSS/XPath selectors, and a familiar API that uses the same pseudo-elements as Scrapy/Parsel. Source: README.md:5-15
The adaptive element-finding feature is the parser's most distinctive capability. Setting auto_save=True on a selector call persists the matched element's signature; later, when the same call is invoked with adaptive=True, the parser uses element similarity scoring to relocate the element even after structural changes. The benchmark table shows this similarity search at 2.39 ms versus 12.45 ms for AutoScraper, a 5.2ร speedup. Source: README.md:75-85
Fetcher Layer
Four fetchers are exposed through scrapling.fetchers: Fetcher, AsyncFetcher, StealthyFetcher, and DynamicFetcher. Each has a session counterpart (FetcherSession, AsyncFetcherSession, StealthySession, AsyncStealthySession, DynamicSession) that reuses a connection or browser across requests. The basic usage in the README demonstrates both patterns. Source: README.md:25-40
The stealth fetcher is the most powerful member of this layer. As shown in the README's example, StealthyFetcher.fetch(..., headless=True, network_idle=True) can be used to fetch pages "under the radar." A known limitation tracked in the community is that the embedded Turnstile variant can cause StealthyFetcher to wait indefinitely. Source: community issue #100 This is an open failure mode worth noting for users scraping Turnstile-protected pages.
Spider Framework
The spider framework is a Scrapy-compatible API with start_urls, async parse callbacks, and Request/Response objects. Key features listed in the README include configurable concurrency limits, per-domain throttling, download delays, multi-session support (HTTP and stealth headless browsers in a single spider, routed by session ID), checkpoint-based pause/resume (Ctrl+C triggers a graceful shutdown; restart resumes from the last checkpoint), and a streaming mode via async for item in spider.stream(). Source: README.md:90-115
The spiders also implement automatic blocked-request detection and retry, and an optional robots_txt_obey flag that respects Disallow, Crawl-delay, and Request-rate directives with per-domain policies. Source: README.md:115-120
CLI, MCP, and Agent Skills
The CLI is the entry point for installation and quick extraction. After pip install "scrapling[fetchers]", users run scrapling install (or scrapling install --force) to download browsers, system dependencies, and fingerprint-manipulation libraries. The scrapling extract subcommand supports three fetch modes โ get, fetch, and stealthy-fetch โ and writes Markdown or plain-text output, with optional --css-selector filtering and --impersonate chrome for TLS fingerprint spoofing. Source: README.md:140-160
The project also ships an Agent Skill package under agent-skill/ that "encapsulates almost all of the documentation website's content in Markdown" and conforms to the AgentSkill specification. The skill is consumable by OpenClaw, Claude Code, and other agentic tools, and is published on Clawhub where it can be installed with clawhub install scrapling-official. Source: agent-skill/README.md:1-15 Community discussion in issue #142 requested explicit OpenClaw support; the published skill is the implementation of that request.
Installation and Observability
Scrapling requires Python 3.10 or higher. The base pip install scrapling only includes the parser engine; fetchers, spiders, and CLI tools require pip install "scrapling[fetchers]" followed by scrapling install to fetch browser binaries. Source: README.md:80-105
A LoggerProxy is provided in scrapling.core.utils._utils to expose the standard logging.Logger interface through a context variable, enabling per-task log routing in async crawls. The default logger uses a single console handler with the format [%(asctime)s] %(levelname)s: %(message)s at INFO level. Source: scrapling/core/utils/_utils.py:20-55
Community-Driven Roadmap Items
Several open feature requests signal where the architecture is evolving. Issue #159 requests a "listen to browser requests and get responses" capability, which would extend the fetcher layer with a request-interception hook (useful for sites that expose data only via XHR/fetch calls). Issue #82 requests automatic pagination-URL detection, which would extend the spider's link-following logic. Source: community issues #159, #82 Users evaluating Scrapling for production use should track these for future releases.
See Also
- Selection Methods and Adaptive Parser
- Fetcher Selection Guide
- Spider Architecture
- Proxy Rotation and Blocking Detection
- CLI Reference
- MCP Server Integration
Source: https://github.com/D4Vinci/Scrapling / Human Manual
Fetchers and the Stealth Engine
Related topics: Scrapling Overview and System Architecture, Spider Framework and Crawling Infrastructure
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Scrapling Overview and System Architecture, Spider Framework and Crawling Infrastructure
Fetchers and the Stealth Engine
1. Purpose and Scope
The "Fetchers" subsystem is the network boundary of Scrapling. It encapsulates everything between a Python call and a remote HTTP/HTTPS endpoint, providing three escalating levels of fidelity:
- A pure-HTTP fetcher based on
curl_cffiwith TLS fingerprint impersonation. - A real browser fetcher driven by Playwright/CDP for JavaScript-rendered pages.
- A stealth-tuned browser fetcher that injects patches, rotates fingerprints, and can solve Cloudflare Turnstile challenges without third-party services.
The "Stealth Engine" is the subset of that stack used by StealthyFetcher / StealthySession. It re-uses the browser automation engine but layers anti-detection logic, optional captcha solving, and adaptive behavior on top. The two concepts are intentionally overlapping: a fetcher is the *interface* a developer calls, while the stealth engine is the *implementation strategy* used to satisfy anti-bot systems. Source: README.md and scrapling/fetchers/__init__.py.
2. The Fetcher Hierarchy
All fetchers return a Response-like object, but they differ in cost and capability. The community-maintained skill recommends an "escalation" approach: start with the cheapest fetcher and only step up when the page refuses to load correctly. Source: agent-skill/Scrapling-Skill/examples/README.md.
flowchart TD
A[Need data from URL] --> B{Plain HTTP enough?}
B -- Yes --> C[Fetcher / FetcherSession<br/>curl_cffi TLS impersonation]
B -- No --> D{JS rendering needed?}
D -- Yes --> E[DynamicFetcher / DynamicSession<br/>Playwright/Chromium]
D -- No --> F{Anti-bot blocking?}
F -- Yes --> G[StealthyFetcher / StealthySession<br/>Stealth Engine + Turnstile solver]
C --> H[Response]
E --> H
G --> H
H --> I[Adaptive Parser<br/>auto_save / adaptive=True]The same escalation model is exposed in the CLI as extract get, extract fetch, and extract stealthy-fetch. Source: README.md.
2.1 `Fetcher` and `FetcherSession` โ Static HTTP
Fetcher is the lightweight entry point. It performs a single request through the curl_cffi transport, optionally impersonating a real browser's TLS fingerprint (e.g. impersonate='chrome'). When you need connection reuse โ cookies, keep-alive, multiple requests in a loop โ wrap it in FetcherSession and use session.get(...). Source: scrapling/fetchers/requests.py.
The session also accepts a stealthy_headers=True flag, which is a cheap compromise: it does not run a browser, but it attaches header order and values that mimic a recent Chrome build. This is often sufficient against naive bot detectors and is significantly cheaper than launching Chromium. Source: README.md.
2.2 `DynamicFetcher` and `DynamicSession` โ Browser Automation
When the target page renders content client-side (SPAs, infinite scroll, lazy-loaded sections), you need a real browser. DynamicFetcher drives a headless Chromium via Playwright/CDP and returns the post-render DOM. Important toggles include headless, network_idle (wait for the network to settle), and disable_resources (skip stylesheets/images to save bandwidth). The async counterpart is AsyncDynamicFetcher. Source: scrapling/fetchers/chrome.py and scrapling/engines/static.py.
2.3 `StealthyFetcher` and `StealthySession` โ The Stealth Engine
This is the highest-cost, highest-success-rate tier. It launches a stealth-patched Chromium (Camoufox/Playwright under the hood) and applies a chain of anti-fingerprint techniques before navigation. Key features:
solve_cloudflare=Trueโ solves Cloudflare Turnstile (and related human-challenge pages) in-page, returning the cleared DOM. Source: README.md.headless=Trueโ runs the browser in headless mode while preserving the same fingerprint set as headed mode.adaptive=Trueโ a class-level flag that, when combined withauto_save=Trueon parser calls, lets Scrapling persist the *current* element layout and re-locate the element after a future redesign.google_search=Falseโ disables Scrapling's automatic Google search fallback (a convenience path used to reach heavily-shielded pages via a search referrer).
Known limitation (community issue #100): On the "embedded" variant of the Cloudflare Turnstile widget, StealthyFetcher can wait indefinitely because the embedded iframe does not emit the same challenge-completion signal that the standalone widget does. The workaround is to pass a stricter wait_selector or to use DynamicFetcher with a manual sleep, but a first-class fix is still pending. Source: community discussion #100 referenced in the prompt context.
3. Architecture and Data Flow
When a StealthyFetcher.fetch(...) call is made, the request travels through the following layers:
sequenceDiagram
participant U as User Code
participant F as StealthyFetcher
participant E as Stealth Engine
participant B as Chromium (CDP)
participant W as Target Site
U->>F: fetch(url, solve_cloudflare=True)
F->>E: configure stealth args, fingerprints
E->>B: launch patched browser context
B->>W: GET url (with spoofed fingerprint)
W-->>B: HTML + Turnstile iframe
E->>B: detect challenge, run solver
B-->>W: post challenge token
W-->>B: cleared HTML
B-->>E: serialized DOM
E-->>F: Response
F-->>U: Response (page.css, page.find_by_text, ...)The stealth engine itself is a thin orchestrator: it does not invent new browser behavior, it composes existing tools (browser launch arguments, header sets, fingerprint databases) into a single, reproducible configuration. Constants such as default timeouts, user-agent pools, and challenge-detection selectors live in scrapling/engines/constants.py, and the logger used to emit diagnostic events is configured in scrapling/core/utils/_utils.py โ the setup_logger() helper is wrapped in lru_cache(1, typed=True) to enforce a singleton pattern across the process.
4. Common Failure Modes and Mitigations
| Symptom | Likely Cause | Mitigation |
|---|---|---|
StealthyFetcher hangs on Turnstile | Embedded widget variant doesn't emit completion signal (issue #100) | Use DynamicFetcher with explicit wait_selector; track upstream fix |
| Browser launches but page returns blank | JS-rendered SPA blocked by anti-bot | Escalate Fetcher โ DynamicFetcher โ StealthyFetcher |
| TLS impersonation rejected | Server pins a specific browser version | Update with scrapling install --force to refresh fingerprints (per release v0.4.9 notes) |
| Need to inspect intermediate requests | StealthyFetcher does not currently expose a request-interception hook (issue #159) | Drop to DynamicFetcher and use Playwright's page.on("request", ...) directly |
| Crawling paginated listings | Manual response.follow() required (issue #82) | Build a Spider subclass with an explicit next-page check |
Issue #159 ("listening to browser requests and getting responses") is an active feature request that, if implemented, would add a first-class event hook to the stealth engine so that callers can filter traffic without dropping to raw Playwright. Issue #82 ("automatic pagination detection") is orthogonal to the stealth engine but lives in the same response-handling layer.
See Also
- Adaptive Parser (auto-save and adaptive selection)
- Spider Framework (multi-session routing, pause/resume)
- CLI:
scrapling extractandscrapling install - Agent Skill (OpenClaw / Claude Code integration โ issue #142)
Source: https://github.com/D4Vinci/Scrapling / Human Manual
Spider Framework and Crawling Infrastructure
Related topics: Scrapling Overview and System Architecture, Fetchers and the Stealth Engine
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Scrapling Overview and System Architecture, Fetchers and the Stealth Engine
Spider Framework and Crawling Infrastructure
Overview and Purpose
The Spider Framework is the highest-level orchestration layer of Scrapling. While the fetchers (Fetcher, StealthyFetcher, DynamicFetcher) handle individual page requests and the parser handles selection, the spider subsystem exists to scale those primitives into concurrent, stateful, multi-page crawls. It is the recommended tool when a job requires more than one HTTP exchange, multiple session profiles, or resumable execution (Source: README.md).
The framework is described in the README as "A Full Crawling Framework" and exposes a Scrapy-like API: developers subclass Spider, define start_urls and an async parse(response) callback, and the engine takes over scheduling, dispatching, and persistence (Source: README.md). The official Agent Skill documentation reinforces this by positioning the Spider as the top rung of an escalation ladder:
get / FetcherSession
โโ If JS required โ fetch / DynamicSession
โโ If blocked โ stealthy-fetch / StealthySession
โโ If multi-page โ Spider
(Source: agent-skill/Scrapling-Skill/examples/README.md)
Core Components and Architecture
The subsystem is organised into a small set of cooperating modules under scrapling/spiders/:
| Module | Responsibility |
|---|---|
__init__.py | Public re-exports of Spider, Request, and Response for from scrapling.spiders import โฆ |
spider.py | User-facing Spider base class with class-level configuration (name, start_urls, concurrent_requests) |
engine.py | Async execution loop that drives request scheduling, dispatches fetches, and invokes callbacks |
request.py | The Request object used to enqueue URLs and carry session/priority metadata |
session.py | Session Manager that registers named sessions (e.g. fast, stealth) and routes requests by ID |
scheduler.py | Queue, deduplication, and throttling logic |
(Source: scrapling/spiders/__init__.py, scrapling/spiders/spider.py, scrapling/spiders/engine.py, scrapling/spiders/request.py, scrapling/spiders/session.py, scrapling/spiders/scheduler.py)
Request Lifecycle
flowchart LR
A[Spider subclass<br/>start_urls] --> B[Scheduler<br/>dedup + throttle]
B --> C[Engine<br/>async loop]
C --> D{Session Manager<br/>pick by id}
D -->|fast| E[FetcherSession]
D -->|stealth| F[AsyncStealthySession]
E --> G[Response]
F --> G[Response]
G --> H[parse callback]
H -->|yield dict| I[Items collection]
H -->|yield Request| B
H -->|Ctrl+C| J[Checkpoint<br/>pause/resume]
I --> K[result.items.to_json/.to_jsonl]The spider's parse() coroutine returns plain Python dicts for terminal items, or Request objects that are pushed back into the scheduler for further traversal. This is the documented pattern in the README's multi-page quotes example (Source: README.md).
Configuration and Common Patterns
Concurrency and Throttling
Spider.concurrent_requests (default shown in the README example is 10) controls the size of the in-flight pool. Per-domain throttling and download delays are applied by the scheduler so a single aggressive target does not starve other hosts (Source: README.md).
Multi-Session Routing
Spiders can mix fast HTTP and stealthy headless browser sessions under a single crawl. The README's MultiSessionSpider example overrides configure_sessions(self, manager) to register profiles:
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
The lazy=True flag defers browser startup until the first request that actually needs it. Subsequent Request(url, session="stealth") calls are routed accordingly (Source: README.md).
Streaming and Export
Long-running crawls can be consumed incrementally with async for item in spider.stream(), which surfaces real-time stats suitable for UIs or pipelines. Terminal export is built in: result.items.to_json("quotes.json") and result.items.to_jsonl(...) write the collected dicts without extra plumbing (Source: README.md).
Persistence, Robots, and Caching
- Pause & Resume: Pressing
Ctrl+Ctriggers a graceful shutdown that writes a checkpoint; restarting the spider resumes from the same state (Source: README.md). - Robots.txt: Setting
robots_txt_obey=Truemakes the scheduler honourDisallow,Crawl-delay, andRequest-ratedirectives with per-domain caching (Source: README.md). - Development Mode: The first run caches responses to disk; later runs replay them so
parse()can be iterated without re-hitting target servers (Source: README.md). - Blocked Request Detection: The engine can auto-retry requests that look blocked, with user-customisable heuristics (Source: README.md).
Logging
Framework diagnostics are emitted through a dedicated scrapling logger configured by setup_logger() in core utilities. The logger is exposed via a ContextVar and a LoggerProxy, which lets the spider engine swap log handlers per async context without mutating global state (Source: scrapling/core/utils/_utils.py).
Community Considerations and Limitations
The Spider Framework deliberately leaves a few high-level concerns to the developer. Community issue #82 ("Add functionality to automatically detect pagination URLs") notes that spiders currently follow pagination only when the developer explicitly yields a Request via response.follow(); there is no automatic discovery of next-page links (Source: README.md follow example, community issue #82).
A related limitation surfaced in issue #100 is that StealthyFetcher can hang indefinitely on certain embedded Cloudflare Turnstile variants. Because the spider engine awaits the fetcher it dispatches, a hung Turnstile session blocks the entire in-flight pool for that session id. The standard workaround is to keep the stealth session lazy=True and isolate its use behind per-request timeouts or custom blocked-request detection logic (Source: community issue #100; README.md).
Issue #159 ("listening to browser requests and getting responses") requests a browser-side network interceptor; this is a spider-adjacent feature that does not currently exist in the spider module surface and would require engine-level hooks to be implemented safely (Source: community issue #159).
Finally, v0.4.9 added a --version CLI flag and refreshed all browser fingerprints, reminding users to run scrapling install --force so spider-managed stealth sessions pick up the new artefacts (Source: README release notes, README.md).
See Also
- Fetchers and Session Management โ the HTTP/stealth/dynamic primitives spiders route through
- Adaptive Parser and Selector Engine โ selection methods used inside
parse() - CLI and MCP Server โ
scrapling install --force, the interactive shell, and MCP integration - Agent Skill for Scrapling โ pre-packaged documentation for Claude Code / OpenClaw
Source: https://github.com/D4Vinci/Scrapling / Human Manual
AI Integration, CLI, Interactive Shell, and Agent Skill
Related topics: Scrapling Overview and System Architecture, Fetchers and the Stealth Engine
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Scrapling Overview and System Architecture, Fetchers and the Stealth Engine
AI Integration, CLI, Interactive Shell, and Agent Skill
Scrapling ships with a layered tooling surface that wraps its adaptive parser and fetchers for three different audiences: developers who want a terminal workflow (scrapling CLI), engineers who want a REPL inside Python (scrapling.core.shell), and AI agents / assistants that need a structured, machine-readable description of the library (the agent-skill package and MCP server). This page describes how those four surfaces fit together and where to look in the source.
1. Command-Line Interface (`scrapling`)
The CLI entry point is implemented in scrapling/cli.py. It is the supported way to bootstrap the heavy browser/fingerprint dependencies that the fetchers need.
1.1 `scrapling install`
The headline subcommand downloads all browser engines and their system dependencies that DynamicFetcher and StealthyFetcher require at runtime. Source: scrapling/cli.py:1-50.
pip install "scrapling[fetchers]"
scrapling install # normal install
scrapling install --force # force reinstall of browsers/fingerprints
Release v0.4.9 explicitly instructs users to run scrapling install --force after upgrading because all bundled browser binaries and fingerprints were refreshed. The CLI is also invokable programmatically, which is useful in CI and inside the interactive shell:
from scrapling.cli import install
install([], standalone_mode=False) # normal install
install(["--force"], standalone_mode=False) # force reinstall
Source: README.md:installation section.
1.2 `--version` flag and ergonomics
A --version flag was added to the CLI by a community contributor (@ETM-Code) in v0.4.9. Source: README.md:release notes v0.4.9. The CLI also exposes the MCP server installation entry point used by the AI integration, summarized in the badge block of the README.
2. Interactive Python Shell (`scrapling.core.shell`)
The interactive shell lives in scrapling/core/shell.py and its companion signature module scrapling/core/_shell_signatures.py. It is a Python REPL pre-configured with Scrapling's parser, fetchers, and helpers so users can prototype selectors and fetches interactively.
Supporting utilities, including colored output and command parsing, are kept in scrapling/core/utils/_shell.py. The _shell_signatures.py file declares the type signatures that the shell prints as help text when the user types a partial command, mirroring the API surface of the parser (Selector, Adaptor, and the text-processing methods).
The shell is intended for ad-hoc exploration: paste a URL, run page.css(...), inspect results, and iterate. Because it shares the same logger proxy defined in scrapling/core/utils/_utils.py, output is consistent with the CLI and spiders.
3. AI Integration and the Agent Skill
The AI surface has two parts: a programmatic integration and a packaged Agent Skill that follows the AgentSkill specification.
3.1 Programmatic AI module
scrapling/core/ai.py provides the AI-facing helpers (used internally by the MCP server and externally by users who want to embed Scrapling in LLM toolchains). It is exposed via the optional [ai] extra:
pip install "scrapling[ai]"
This extra enables the MCP server referenced from the README's documentation index (https://scrapling.readthedocs.io/en/latest/ai/mcp-server.html). Source: README.md:documentation links.
3.2 Agent Skill (`agent-skill/`)
The agent-skill/ directory ships a ready-to-install Agent Skill bundle. Per agent-skill/README.md, the skill:
- Aligns with the
AgentSkillspecification, so it is readable by OpenClaw, Claude Code, and other agentic tools. - Encapsulates almost all of the documentation site content in Markdown, so the agent does not need to guess.
- Is installable from a direct ZIP URL or via Clawhub.
clawhub install scrapling-official
This directly addresses the most engaged community request, Issue #142 "OpenClaw Support", which asked for first-class OpenClaw skill packaging. The skill also bundles runnable examples in agent-skill/Scrapling-Skill/examples/README.md, each example pairing with one of the four escalation tiers.
3.3 Translator (CSS โ XPath)
A small but important AI-adjacent utility is the selector translator at scrapling/core/translator.py. It is adapted from Parsel (BSD-licensed) and converts between CSS and XPath selectors. This is useful both for humans migrating selectors and for AI agents that need a normalized representation when reasoning over Scrapling's Selector objects.
4. Component Map and Community Context
The diagram below shows how the four surfaces share the same parser and fetcher core, and how the Agent Skill packages documentation for LLM agents.
flowchart LR
subgraph Users
Dev[Developer]
REPL[Power user]
Agent[AI Agent / OpenClaw / Claude Code]
end
CLI[scrapling CLI<br/>scrapling/cli.py]
Shell[Interactive shell<br/>scrapling/core/shell.py]
AI[AI module + MCP<br/>scrapling/core/ai.py]
Skill[Agent Skill bundle<br/>agent-skill/]
Trans[Translator<br/>scrapling/core/translator.py]
Core[Parser + Fetchers core]
Dev --> CLI
REPL --> Shell
Agent --> Skill
Agent --> AI
CLI --> Core
Shell --> Core
AI --> Core
Skill -.docs.-> Agent
Trans --> CoreCommunity context worth keeping in mind:
- #142 (OpenClaw Support) โ The
agent-skill/package with theclawhub install scrapling-officialworkflow is the direct response to this request. Source: agent-skill/README.md. - #159 (Listening to browser requests) โ Tied to the fetcher/stealthy surface rather than the AI surface; mention it here as adjacent work.
- #82 (Automatic pagination) โ Could be exposed through the Agent Skill as a documented helper in a future release; not yet shipped.
See Also
- Fetchers and Sessions โ the runtime layer the CLI installs browsers for.
- Parser and Selection Methods โ what the interactive shell preloads.
- Spider Architecture โ the multi-session crawling framework.
- MCP Server โ the LLM-facing server backed by
scrapling/core/ai.py. - Agent Skill specification โ format followed by
agent-skill/Scrapling-Skill/.
Source: https://github.com/D4Vinci/Scrapling / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 12 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.
1. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/294
2. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/350
3. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/299
4. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/348
5. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/D4Vinci/Scrapling
6. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/349
7. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/D4Vinci/Scrapling
8. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/D4Vinci/Scrapling
9. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | https://github.com/D4Vinci/Scrapling
10. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/295
11. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: issue_or_pr_quality=unknownใ
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/D4Vinci/Scrapling
12. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: release_recency=unknownใ
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/D4Vinci/Scrapling
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using Scrapling with real data or production workflows.
- No context persistence for code loaded using init_script - github / github_issue
- [[BUG] [Good First Issue] LinkExtractor never filters .tar.gz links](https://github.com/D4Vinci/Scrapling/issues/349) - github / github_issue
- linux chrome auto closed? - github / github_issue
- Allow specifying a custom Chromium browser - github / github_issue
- Community source 5 - github / github_issue
- [[Bug]Session-level proxy silently ignored, leaks real IP](https://github.com/D4Vinci/Scrapling/issues/295) - github / github_issue
- agent skill marketplace - github / github_issue
- [[Feature Request] Add
--versionflag to CLI](https://github.com/D4Vinci/Scrapling/issues/299) - github / github_issue - Bug Report :
init_script+user_data_dircauses `ERR_NAME_NOT_RESOLV - github / github_issue - Release v0.4.9 - github / github_release
- Release v0.4.8 - github / github_release
- Release v0.4.7 - github / github_release
Source: Project Pack community evidence and pitfall evidence