# https://github.com/D4Vinci/Scrapling Project Manual

Generated at: 2026-06-19 20:24:15 UTC

## Table of Contents

- [Scrapling Overview and System Architecture](#page-1)
- [Fetchers and the Stealth Engine](#page-2)
- [Spider Framework and Crawling Infrastructure](#page-3)
- [AI Integration, CLI, Interactive Shell, and Agent Skill](#page-4)

<a id='page-1'></a>

## Scrapling Overview and System Architecture

### Related Pages

Related topics: [Fetchers and the Stealth Engine](#page-2), [Spider Framework and Crawling Infrastructure](#page-3), [AI Integration, CLI, Interactive Shell, and Agent Skill](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)
- [agent-skill/README.md](https://github.com/D4Vinci/Scrapling/blob/main/agent-skill/README.md)
- [agent-skill/Scrapling-Skill/examples/README.md](https://github.com/D4Vinci/Scrapling/blob/main/agent-skill/Scrapling-Skill/examples/README.md)
- [scrapling/core/utils/_utils.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/utils/_utils.py)
</details>

# Scrapling Overview and System Architecture

## Introduction and Purpose

Scrapling is an adaptive web scraping framework that scales from a single HTTP request to a full, concurrent crawl. As described in the project README, the framework "handles everything from a single request to a full-scale crawl," with three distinguishing capabilities: an adaptive parser that relocates elements when a site redesigns, fetchers that bypass modern anti-bot systems (including Cloudflare Turnstile) out of the box, and a Scrapy-like spider framework for multi-session crawls. Source: [README.md:1-15]()

The framework is positioned for both ad-hoc scripters and production-scale operators. The README frames it as "Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone." Source: [README.md:10-15]() Performance is a primary design goal: the parser benchmark (averaging 100+ runs) shows Scrapling at 2.02 ms versus 1584.31 ms for BeautifulSoup with lxml, a ~784× speedup. Source: [README.md:55-70]()

## High-Level Architecture

Scrapling's design separates concerns into four cooperating subsystems: a parser engine, a layered set of fetchers, a spider framework, and a CLI/MCP integration layer. The following diagram shows how user code flows through these subsystems and the typical escalation path between fetchers.

```mermaid
flowchart TD
    A[User Code: Python or CLI] --> B{Choose Fetcher Layer}
    B -->|Lightweight HTTP| C[Fetcher / FetcherSession / AsyncFetcher]
    B -->|JS rendering required| D[DynamicFetcher / DynamicSession]
    B -->|Anti-bot bypass required| E[StealthyFetcher / StealthySession]
    C --> F[Response / Page Object]
    D --> F
    E --> F
    F --> G[Adaptive Parser<br/>css / xpath / regex]
    G -->|auto_save=True| H[(Element Store)]
    G -->|adaptive=True| I[Relocated Elements]
    F --> J[Spider Framework]
    J --> K[Concurrent Crawl +<br/>Pause/Resume + Proxy Rotation]
    H --> L[Structured Output]
    I --> L
    K --> L
```

The escalation guide shown in the examples README makes this layering explicit: start with `Fetcher`/`FetcherSession`, escalate to `DynamicSession` when JavaScript is required, escalate again to `StealthySession` when the site blocks the request, and finally use the `Spider` for multi-page crawls. Source: [agent-skill/Scrapling-Skill/examples/README.md:40-50]()

## Core Components

### Parser Engine

The parser is a lxml-based engine that exposes a Scrapy/Parsel-compatible API. The README notes that Scrapling includes "Code adapted from Parsel (BSD License) — Used for translator submodule," indicating that the CSS/XPath translator layer is reused rather than reimplemented. Source: [README.md:175-180]() Three features are highlighted in the README: enhanced text processing with regex and cleaning helpers, auto selector generation that produces robust CSS/XPath selectors, and a familiar API that uses the same pseudo-elements as Scrapy/Parsel. Source: [README.md:5-15]()

The adaptive element-finding feature is the parser's most distinctive capability. Setting `auto_save=True` on a selector call persists the matched element's signature; later, when the same call is invoked with `adaptive=True`, the parser uses element similarity scoring to relocate the element even after structural changes. The benchmark table shows this similarity search at 2.39 ms versus 12.45 ms for AutoScraper, a 5.2× speedup. Source: [README.md:75-85]()

### Fetcher Layer

Four fetchers are exposed through `scrapling.fetchers`: `Fetcher`, `AsyncFetcher`, `StealthyFetcher`, and `DynamicFetcher`. Each has a session counterpart (`FetcherSession`, `AsyncFetcherSession`, `StealthySession`, `AsyncStealthySession`, `DynamicSession`) that reuses a connection or browser across requests. The basic usage in the README demonstrates both patterns. Source: [README.md:25-40]()

The stealth fetcher is the most powerful member of this layer. As shown in the README's example, `StealthyFetcher.fetch(..., headless=True, network_idle=True)` can be used to fetch pages "under the radar." A known limitation tracked in the community is that the embedded Turnstile variant can cause `StealthyFetcher` to wait indefinitely. Source: [community issue #100]() This is an open failure mode worth noting for users scraping Turnstile-protected pages.

### Spider Framework

The spider framework is a Scrapy-compatible API with `start_urls`, async `parse` callbacks, and `Request`/`Response` objects. Key features listed in the README include configurable concurrency limits, per-domain throttling, download delays, multi-session support (HTTP and stealth headless browsers in a single spider, routed by session ID), checkpoint-based pause/resume (Ctrl+C triggers a graceful shutdown; restart resumes from the last checkpoint), and a streaming mode via `async for item in spider.stream()`. Source: [README.md:90-115]()

The spiders also implement automatic blocked-request detection and retry, and an optional `robots_txt_obey` flag that respects `Disallow`, `Crawl-delay`, and `Request-rate` directives with per-domain policies. Source: [README.md:115-120]()

### CLI, MCP, and Agent Skills

The CLI is the entry point for installation and quick extraction. After `pip install "scrapling[fetchers]"`, users run `scrapling install` (or `scrapling install --force`) to download browsers, system dependencies, and fingerprint-manipulation libraries. The `scrapling extract` subcommand supports three fetch modes — `get`, `fetch`, and `stealthy-fetch` — and writes Markdown or plain-text output, with optional `--css-selector` filtering and `--impersonate chrome` for TLS fingerprint spoofing. Source: [README.md:140-160]()

The project also ships an Agent Skill package under `agent-skill/` that "encapsulates almost all of the documentation website's content in Markdown" and conforms to the [AgentSkill](https://agentskills.io/specification) specification. The skill is consumable by OpenClaw, Claude Code, and other agentic tools, and is published on [Clawhub](https://clawhub.ai/D4Vinci/scrapling-official) where it can be installed with `clawhub install scrapling-official`. Source: [agent-skill/README.md:1-15]() Community discussion in issue #142 requested explicit OpenClaw support; the published skill is the implementation of that request.

## Installation and Observability

Scrapling requires Python 3.10 or higher. The base `pip install scrapling` only includes the parser engine; fetchers, spiders, and CLI tools require `pip install "scrapling[fetchers]"` followed by `scrapling install` to fetch browser binaries. Source: [README.md:80-105]()

A `LoggerProxy` is provided in `scrapling.core.utils._utils` to expose the standard `logging.Logger` interface through a context variable, enabling per-task log routing in async crawls. The default logger uses a single console handler with the format `[%(asctime)s] %(levelname)s: %(message)s` at INFO level. Source: [scrapling/core/utils/_utils.py:20-55]()

## Community-Driven Roadmap Items

Several open feature requests signal where the architecture is evolving. Issue #159 requests a "listen to browser requests and get responses" capability, which would extend the fetcher layer with a request-interception hook (useful for sites that expose data only via XHR/fetch calls). Issue #82 requests automatic pagination-URL detection, which would extend the spider's link-following logic. Source: [community issues #159, #82]() Users evaluating Scrapling for production use should track these for future releases.

## See Also

- [Selection Methods and Adaptive Parser](selection-methods.md)
- [Fetcher Selection Guide](fetchers.md)
- [Spider Architecture](spider-architecture.md)
- [Proxy Rotation and Blocking Detection](proxy-rotation.md)
- [CLI Reference](cli-overview.md)
- [MCP Server Integration](mcp-server.md)

---

<a id='page-2'></a>

## Fetchers and the Stealth Engine

### Related Pages

Related topics: [Scrapling Overview and System Architecture](#page-1), [Spider Framework and Crawling Infrastructure](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [scrapling/fetchers/__init__.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/fetchers/__init__.py)
- [scrapling/fetchers/requests.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/fetchers/requests.py)
- [scrapling/fetchers/chrome.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/fetchers/chrome.py)
- [scrapling/fetchers/stealth_chrome.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/fetchers/stealth_chrome.py)
- [scrapling/engines/static.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/engines/static.py)
- [scrapling/engines/constants.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/engines/constants.py)
- [scrapling/core/utils/_utils.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/utils/_utils.py)
- [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)
- [agent-skill/Scrapling-Skill/examples/README.md](https://github.com/D4Vinci/Scrapling/blob/main/agent-skill/Scrapling-Skill/examples/README.md)
</details>

# Fetchers and the Stealth Engine

## 1. Purpose and Scope

The "Fetchers" subsystem is the network boundary of Scrapling. It encapsulates everything between a Python call and a remote HTTP/HTTPS endpoint, providing three escalating levels of fidelity:

1. A pure-HTTP fetcher based on `curl_cffi` with TLS fingerprint impersonation.
2. A real browser fetcher driven by Playwright/CDP for JavaScript-rendered pages.
3. A stealth-tuned browser fetcher that injects patches, rotates fingerprints, and can solve Cloudflare Turnstile challenges without third-party services.

The "Stealth Engine" is the subset of that stack used by `StealthyFetcher` / `StealthySession`. It re-uses the browser automation engine but layers anti-detection logic, optional captcha solving, and adaptive behavior on top. The two concepts are intentionally overlapping: a fetcher is the *interface* a developer calls, while the stealth engine is the *implementation strategy* used to satisfy anti-bot systems. Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md) and [scrapling/fetchers/__init__.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/fetchers/__init__.py).

## 2. The Fetcher Hierarchy

All fetchers return a `Response`-like object, but they differ in cost and capability. The community-maintained skill recommends an "escalation" approach: start with the cheapest fetcher and only step up when the page refuses to load correctly. Source: [agent-skill/Scrapling-Skill/examples/README.md](https://github.com/D4Vinci/Scrapling/blob/main/agent-skill/Scrapling-Skill/examples/README.md).

```mermaid
flowchart TD
    A[Need data from URL] --> B{Plain HTTP enough?}
    B -- Yes --> C[Fetcher / FetcherSession<br/>curl_cffi TLS impersonation]
    B -- No --> D{JS rendering needed?}
    D -- Yes --> E[DynamicFetcher / DynamicSession<br/>Playwright/Chromium]
    D -- No --> F{Anti-bot blocking?}
    F -- Yes --> G[StealthyFetcher / StealthySession<br/>Stealth Engine + Turnstile solver]
    C --> H[Response]
    E --> H
    G --> H
    H --> I[Adaptive Parser<br/>auto_save / adaptive=True]
```

The same escalation model is exposed in the CLI as `extract get`, `extract fetch`, and `extract stealthy-fetch`. Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md).

### 2.1 `Fetcher` and `FetcherSession` — Static HTTP

`Fetcher` is the lightweight entry point. It performs a single request through the `curl_cffi` transport, optionally impersonating a real browser's TLS fingerprint (e.g. `impersonate='chrome'`). When you need connection reuse — cookies, keep-alive, multiple requests in a loop — wrap it in `FetcherSession` and use `session.get(...)`. Source: [scrapling/fetchers/requests.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/fetchers/requests.py).

The session also accepts a `stealthy_headers=True` flag, which is a cheap compromise: it does not run a browser, but it attaches header order and values that mimic a recent Chrome build. This is often sufficient against naive bot detectors and is significantly cheaper than launching Chromium. Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md).

### 2.2 `DynamicFetcher` and `DynamicSession` — Browser Automation

When the target page renders content client-side (SPAs, infinite scroll, lazy-loaded sections), you need a real browser. `DynamicFetcher` drives a headless Chromium via Playwright/CDP and returns the post-render DOM. Important toggles include `headless`, `network_idle` (wait for the network to settle), and `disable_resources` (skip stylesheets/images to save bandwidth). The async counterpart is `AsyncDynamicFetcher`. Source: [scrapling/fetchers/chrome.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/fetchers/chrome.py) and [scrapling/engines/static.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/engines/static.py).

### 2.3 `StealthyFetcher` and `StealthySession` — The Stealth Engine

This is the highest-cost, highest-success-rate tier. It launches a stealth-patched Chromium (Camoufox/Playwright under the hood) and applies a chain of anti-fingerprint techniques before navigation. Key features:

- `solve_cloudflare=True` — solves Cloudflare Turnstile (and related human-challenge pages) in-page, returning the cleared DOM. Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md).
- `headless=True` — runs the browser in headless mode while preserving the same fingerprint set as headed mode.
- `adaptive=True` — a class-level flag that, when combined with `auto_save=True` on parser calls, lets Scrapling persist the *current* element layout and re-locate the element after a future redesign.
- `google_search=False` — disables Scrapling's automatic Google search fallback (a convenience path used to reach heavily-shielded pages via a search referrer).

**Known limitation (community issue #100):** On the "embedded" variant of the Cloudflare Turnstile widget, `StealthyFetcher` can wait indefinitely because the embedded iframe does not emit the same challenge-completion signal that the standalone widget does. The workaround is to pass a stricter `wait_selector` or to use `DynamicFetcher` with a manual sleep, but a first-class fix is still pending. Source: community discussion #100 referenced in the prompt context.

## 3. Architecture and Data Flow

When a `StealthyFetcher.fetch(...)` call is made, the request travels through the following layers:

```mermaid
sequenceDiagram
    participant U as User Code
    participant F as StealthyFetcher
    participant E as Stealth Engine
    participant B as Chromium (CDP)
    participant W as Target Site
    U->>F: fetch(url, solve_cloudflare=True)
    F->>E: configure stealth args, fingerprints
    E->>B: launch patched browser context
    B->>W: GET url (with spoofed fingerprint)
    W-->>B: HTML + Turnstile iframe
    E->>B: detect challenge, run solver
    B-->>W: post challenge token
    W-->>B: cleared HTML
    B-->>E: serialized DOM
    E-->>F: Response
    F-->>U: Response (page.css, page.find_by_text, ...)
```

The stealth engine itself is a thin orchestrator: it does not invent new browser behavior, it composes existing tools (browser launch arguments, header sets, fingerprint databases) into a single, reproducible configuration. Constants such as default timeouts, user-agent pools, and challenge-detection selectors live in [scrapling/engines/constants.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/engines/constants.py), and the logger used to emit diagnostic events is configured in [scrapling/core/utils/_utils.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/utils/_utils.py) — the `setup_logger()` helper is wrapped in `lru_cache(1, typed=True)` to enforce a singleton pattern across the process.

## 4. Common Failure Modes and Mitigations

| Symptom | Likely Cause | Mitigation |
|---|---|---|
| `StealthyFetcher` hangs on Turnstile | Embedded widget variant doesn't emit completion signal (issue #100) | Use `DynamicFetcher` with explicit `wait_selector`; track upstream fix |
| Browser launches but page returns blank | JS-rendered SPA blocked by anti-bot | Escalate `Fetcher` → `DynamicFetcher` → `StealthyFetcher` |
| TLS impersonation rejected | Server pins a specific browser version | Update with `scrapling install --force` to refresh fingerprints (per release v0.4.9 notes) |
| Need to inspect intermediate requests | StealthyFetcher does not currently expose a request-interception hook (issue #159) | Drop to `DynamicFetcher` and use Playwright's `page.on("request", ...)` directly |
| Crawling paginated listings | Manual `response.follow()` required (issue #82) | Build a `Spider` subclass with an explicit next-page check |

Issue #159 ("listening to browser requests and getting responses") is an active feature request that, if implemented, would add a first-class event hook to the stealth engine so that callers can filter traffic without dropping to raw Playwright. Issue #82 ("automatic pagination detection") is orthogonal to the stealth engine but lives in the same response-handling layer.

## See Also

- Adaptive Parser (auto-save and adaptive selection)
- Spider Framework (multi-session routing, pause/resume)
- CLI: `scrapling extract` and `scrapling install`
- Agent Skill (OpenClaw / Claude Code integration — issue #142)

---

<a id='page-3'></a>

## Spider Framework and Crawling Infrastructure

### Related Pages

Related topics: [Scrapling Overview and System Architecture](#page-1), [Fetchers and the Stealth Engine](#page-2)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [scrapling/spiders/__init__.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/__init__.py)
- [scrapling/spiders/spider.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/spider.py)
- [scrapling/spiders/engine.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/engine.py)
- [scrapling/spiders/request.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/request.py)
- [scrapling/spiders/session.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/session.py)
- [scrapling/spiders/scheduler.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/scheduler.py)
- [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)
- [agent-skill/Scrapling-Skill/examples/README.md](https://github.com/D4Vinci/Scrapling/blob/main/agent-skill/Scrapling-Skill/examples/README.md)
- [scrapling/core/utils/_utils.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/utils/_utils.py)
</details>

# Spider Framework and Crawling Infrastructure

## Overview and Purpose

The Spider Framework is the highest-level orchestration layer of Scrapling. While the fetchers (`Fetcher`, `StealthyFetcher`, `DynamicFetcher`) handle individual page requests and the parser handles selection, the spider subsystem exists to scale those primitives into concurrent, stateful, multi-page crawls. It is the recommended tool when a job requires more than one HTTP exchange, multiple session profiles, or resumable execution (Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)).

The framework is described in the README as "A Full Crawling Framework" and exposes a Scrapy-like API: developers subclass `Spider`, define `start_urls` and an `async parse(response)` callback, and the engine takes over scheduling, dispatching, and persistence (Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)). The official Agent Skill documentation reinforces this by positioning the Spider as the top rung of an escalation ladder:

```
get / FetcherSession
  └─ If JS required → fetch / DynamicSession
       └─ If blocked → stealthy-fetch / StealthySession
            └─ If multi-page → Spider
```
(Source: [agent-skill/Scrapling-Skill/examples/README.md](https://github.com/D4Vinci/Scrapling/blob/main/agent-skill/Scrapling-Skill/examples/README.md))

## Core Components and Architecture

The subsystem is organised into a small set of cooperating modules under `scrapling/spiders/`:

| Module | Responsibility |
|--------|----------------|
| `__init__.py` | Public re-exports of `Spider`, `Request`, and `Response` for `from scrapling.spiders import …` |
| `spider.py` | User-facing `Spider` base class with class-level configuration (`name`, `start_urls`, `concurrent_requests`) |
| `engine.py` | Async execution loop that drives request scheduling, dispatches fetches, and invokes callbacks |
| `request.py` | The `Request` object used to enqueue URLs and carry session/priority metadata |
| `session.py` | Session Manager that registers named sessions (e.g. `fast`, `stealth`) and routes requests by ID |
| `scheduler.py` | Queue, deduplication, and throttling logic |

(Source: [scrapling/spiders/__init__.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/__init__.py), [scrapling/spiders/spider.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/spider.py), [scrapling/spiders/engine.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/engine.py), [scrapling/spiders/request.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/request.py), [scrapling/spiders/session.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/session.py), [scrapling/spiders/scheduler.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/spiders/scheduler.py))

### Request Lifecycle

```mermaid
flowchart LR
    A[Spider subclass<br/>start_urls] --> B[Scheduler<br/>dedup + throttle]
    B --> C[Engine<br/>async loop]
    C --> D{Session Manager<br/>pick by id}
    D -->|fast| E[FetcherSession]
    D -->|stealth| F[AsyncStealthySession]
    E --> G[Response]
    F --> G[Response]
    G --> H[parse callback]
    H -->|yield dict| I[Items collection]
    H -->|yield Request| B
    H -->|Ctrl+C| J[Checkpoint<br/>pause/resume]
    I --> K[result.items.to_json/.to_jsonl]
```

The spider's `parse()` coroutine returns plain Python dicts for terminal items, or `Request` objects that are pushed back into the scheduler for further traversal. This is the documented pattern in the README's multi-page quotes example (Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)).

## Configuration and Common Patterns

### Concurrency and Throttling

`Spider.concurrent_requests` (default shown in the README example is `10`) controls the size of the in-flight pool. Per-domain throttling and download delays are applied by the scheduler so a single aggressive target does not starve other hosts (Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)).

### Multi-Session Routing

Spiders can mix fast HTTP and stealthy headless browser sessions under a single crawl. The README's `MultiSessionSpider` example overrides `configure_sessions(self, manager)` to register profiles:

```python
def configure_sessions(self, manager):
    manager.add("fast", FetcherSession(impersonate="chrome"))
    manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
```
The `lazy=True` flag defers browser startup until the first request that actually needs it. Subsequent `Request(url, session="stealth")` calls are routed accordingly (Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)).

### Streaming and Export

Long-running crawls can be consumed incrementally with `async for item in spider.stream()`, which surfaces real-time stats suitable for UIs or pipelines. Terminal export is built in: `result.items.to_json("quotes.json")` and `result.items.to_jsonl(...)` write the collected dicts without extra plumbing (Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)).

### Persistence, Robots, and Caching

- **Pause & Resume**: Pressing `Ctrl+C` triggers a graceful shutdown that writes a checkpoint; restarting the spider resumes from the same state (Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)).
- **Robots.txt**: Setting `robots_txt_obey=True` makes the scheduler honour `Disallow`, `Crawl-delay`, and `Request-rate` directives with per-domain caching (Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)).
- **Development Mode**: The first run caches responses to disk; later runs replay them so `parse()` can be iterated without re-hitting target servers (Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)).
- **Blocked Request Detection**: The engine can auto-retry requests that look blocked, with user-customisable heuristics (Source: [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)).

### Logging

Framework diagnostics are emitted through a dedicated `scrapling` logger configured by `setup_logger()` in core utilities. The logger is exposed via a `ContextVar` and a `LoggerProxy`, which lets the spider engine swap log handlers per async context without mutating global state (Source: [scrapling/core/utils/_utils.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/utils/_utils.py)).

## Community Considerations and Limitations

The Spider Framework deliberately leaves a few high-level concerns to the developer. Community issue #82 ("Add functionality to automatically detect pagination URLs") notes that spiders currently follow pagination only when the developer explicitly yields a `Request` via `response.follow()`; there is no automatic discovery of next-page links (Source: [README.md follow example](https://github.com/D4Vinci/Scrapling/blob/main/README.md), community issue #82).

A related limitation surfaced in issue #100 is that `StealthyFetcher` can hang indefinitely on certain embedded Cloudflare Turnstile variants. Because the spider engine awaits the fetcher it dispatches, a hung Turnstile session blocks the entire in-flight pool for that session id. The standard workaround is to keep the stealth session `lazy=True` and isolate its use behind per-request timeouts or custom blocked-request detection logic (Source: community issue #100; [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)).

Issue #159 ("listening to browser requests and getting responses") requests a browser-side network interceptor; this is a spider-adjacent feature that does not currently exist in the spider module surface and would require engine-level hooks to be implemented safely (Source: community issue #159).

Finally, v0.4.9 added a `--version` CLI flag and refreshed all browser fingerprints, reminding users to run `scrapling install --force` so spider-managed stealth sessions pick up the new artefacts (Source: README release notes, [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)).

## See Also

- [Fetchers and Session Management](fetchers-and-sessions.md) — the HTTP/stealth/dynamic primitives spiders route through
- [Adaptive Parser and Selector Engine](parser-engine.md) — selection methods used inside `parse()`
- [CLI and MCP Server](cli-and-mcp.md) — `scrapling install --force`, the interactive shell, and MCP integration
- [Agent Skill for Scrapling](agent-skill.md) — pre-packaged documentation for Claude Code / OpenClaw

---

<a id='page-4'></a>

## AI Integration, CLI, Interactive Shell, and Agent Skill

### Related Pages

Related topics: [Scrapling Overview and System Architecture](#page-1), [Fetchers and the Stealth Engine](#page-2)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [scrapling/cli.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/cli.py)
- [scrapling/core/ai.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/ai.py)
- [scrapling/core/shell.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/shell.py)
- [scrapling/core/_shell_signatures.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/_shell_signatures.py)
- [scrapling/core/translator.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py)
- [scrapling/core/utils/_shell.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/utils/_shell.py)
- [agent-skill/README.md](https://github.com/D4Vinci/Scrapling/blob/main/agent-skill/README.md)
- [agent-skill/Scrapling-Skill/examples/README.md](https://github.com/D4Vinci/Scrapling/blob/main/agent-skill/Scrapling-Skill/examples/README.md)
- [README.md](https://github.com/D4Vinci/Scrapling/blob/main/README.md)
</details>

# AI Integration, CLI, Interactive Shell, and Agent Skill

Scrapling ships with a layered tooling surface that wraps its adaptive parser and fetchers for three different audiences: developers who want a terminal workflow (`scrapling` CLI), engineers who want a REPL inside Python (`scrapling.core.shell`), and AI agents / assistants that need a structured, machine-readable description of the library (the `agent-skill` package and MCP server). This page describes how those four surfaces fit together and where to look in the source.

## 1. Command-Line Interface (`scrapling`)

The CLI entry point is implemented in [scrapling/cli.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/cli.py). It is the supported way to bootstrap the heavy browser/fingerprint dependencies that the fetchers need.

### 1.1 `scrapling install`

The headline subcommand downloads all browser engines and their system dependencies that `DynamicFetcher` and `StealthyFetcher` require at runtime. Source: [scrapling/cli.py:1-50]().

```bash
pip install "scrapling[fetchers]"
scrapling install           # normal install
scrapling install  --force  # force reinstall of browsers/fingerprints
```

Release v0.4.9 explicitly instructs users to run `scrapling install --force` after upgrading because all bundled browser binaries and fingerprints were refreshed. The CLI is also invokable programmatically, which is useful in CI and inside the interactive shell:

```python
from scrapling.cli import install

install([], standalone_mode=False)           # normal install
install(["--force"], standalone_mode=False)  # force reinstall
```

Source: [README.md:installation section]().

### 1.2 `--version` flag and ergonomics

A `--version` flag was added to the CLI by a community contributor (@ETM-Code) in v0.4.9. Source: [README.md:release notes v0.4.9](). The CLI also exposes the MCP server installation entry point used by the AI integration, summarized in the badge block of the README.

## 2. Interactive Python Shell (`scrapling.core.shell`)

The interactive shell lives in [scrapling/core/shell.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/shell.py) and its companion signature module [scrapling/core/_shell_signatures.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/_shell_signatures.py). It is a Python REPL pre-configured with Scrapling's parser, fetchers, and helpers so users can prototype selectors and fetches interactively.

Supporting utilities, including colored output and command parsing, are kept in [scrapling/core/utils/_shell.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/utils/_shell.py). The `_shell_signatures.py` file declares the type signatures that the shell prints as help text when the user types a partial command, mirroring the API surface of the parser (`Selector`, `Adaptor`, and the text-processing methods).

The shell is intended for ad-hoc exploration: paste a URL, run `page.css(...)`, inspect results, and iterate. Because it shares the same logger proxy defined in [scrapling/core/utils/_utils.py](), output is consistent with the CLI and spiders.

## 3. AI Integration and the Agent Skill

The AI surface has two parts: a programmatic integration and a packaged Agent Skill that follows the [AgentSkill specification](https://agentskills.io/specification).

### 3.1 Programmatic AI module

[scrapling/core/ai.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/ai.py) provides the AI-facing helpers (used internally by the MCP server and externally by users who want to embed Scrapling in LLM toolchains). It is exposed via the optional `[ai]` extra:

```bash
pip install "scrapling[ai]"
```

This extra enables the MCP server referenced from the README's documentation index (`https://scrapling.readthedocs.io/en/latest/ai/mcp-server.html`). Source: [README.md:documentation links]().

### 3.2 Agent Skill (`agent-skill/`)

The `agent-skill/` directory ships a ready-to-install Agent Skill bundle. Per [agent-skill/README.md](https://github.com/D4Vinci/Scrapling/blob/main/agent-skill/README.md), the skill:

- Aligns with the `AgentSkill` specification, so it is readable by **OpenClaw**, **Claude Code**, and other agentic tools.
- Encapsulates almost all of the documentation site content in Markdown, so the agent does not need to guess.
- Is installable from a direct ZIP URL or via Clawhub.

```bash
clawhub install scrapling-official
```

This directly addresses the most engaged community request, **Issue #142 "OpenClaw Support"**, which asked for first-class OpenClaw skill packaging. The skill also bundles runnable examples in [agent-skill/Scrapling-Skill/examples/README.md](https://github.com/D4Vinci/Scrapling/blob/main/agent-skill/Scrapling-Skill/examples/README.md), each example pairing with one of the four escalation tiers.

### 3.3 Translator (CSS ↔ XPath)

A small but important AI-adjacent utility is the selector translator at [scrapling/core/translator.py](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py). It is adapted from Parsel (BSD-licensed) and converts between CSS and XPath selectors. This is useful both for humans migrating selectors and for AI agents that need a normalized representation when reasoning over Scrapling's `Selector` objects.

## 4. Component Map and Community Context

The diagram below shows how the four surfaces share the same parser and fetcher core, and how the Agent Skill packages documentation for LLM agents.

```mermaid
flowchart LR
  subgraph Users
    Dev[Developer]
    REPL[Power user]
    Agent[AI Agent / OpenClaw / Claude Code]
  end
  CLI[scrapling CLI<br/>scrapling/cli.py]
  Shell[Interactive shell<br/>scrapling/core/shell.py]
  AI[AI module + MCP<br/>scrapling/core/ai.py]
  Skill[Agent Skill bundle<br/>agent-skill/]
  Trans[Translator<br/>scrapling/core/translator.py]
  Core[Parser + Fetchers core]
  Dev --> CLI
  REPL --> Shell
  Agent --> Skill
  Agent --> AI
  CLI --> Core
  Shell --> Core
  AI --> Core
  Skill -.docs.-> Agent
  Trans --> Core
```

Community context worth keeping in mind:

- **#142 (OpenClaw Support)** — The `agent-skill/` package with the `clawhub install scrapling-official` workflow is the direct response to this request. Source: [agent-skill/README.md]().
- **#159 (Listening to browser requests)** — Tied to the fetcher/stealthy surface rather than the AI surface; mention it here as adjacent work.
- **#82 (Automatic pagination)** — Could be exposed through the Agent Skill as a documented helper in a future release; not yet shipped.

## See Also

- [Fetchers and Sessions](https://scrapling.readthedocs.io/en/latest/fetching/choosing.html) — the runtime layer the CLI installs browsers for.
- [Parser and Selection Methods](https://scrapling.readthedocs.io/en/latest/parsing/selection.html) — what the interactive shell preloads.
- [Spider Architecture](https://scrapling.readthedocs.io/en/latest/spiders/architecture.html) — the multi-session crawling framework.
- [MCP Server](https://scrapling.readthedocs.io/en/latest/ai/mcp-server.html) — the LLM-facing server backed by `scrapling/core/ai.py`.
- [Agent Skill specification](https://agentskills.io/specification) — format followed by `agent-skill/Scrapling-Skill/`.

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: D4Vinci/Scrapling

Summary: Found 12 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/294

## 2. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/350

## 3. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/299

## 4. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/348

## 5. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/D4Vinci/Scrapling

## 6. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/349

## 7. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/D4Vinci/Scrapling

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/D4Vinci/Scrapling

## 9. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/D4Vinci/Scrapling

## 10. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/D4Vinci/Scrapling/issues/295

## 11. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/D4Vinci/Scrapling

## 12. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/D4Vinci/Scrapling

<!-- canonical_name: D4Vinci/Scrapling; human_manual_source: deepwiki_human_wiki -->
