Doramagic Project Pack · Human Manual

crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Overview, Architecture, and Package Layout

Related topics: Crawler Hierarchy and HTTP Clients, Browser Pool, Launchers, and Fingerprinting, Storage, Sessions, Proxies, Autoscaling, and CLI

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Crawler Hierarchy and HTTP Clients, Browser Pool, Launchers, and Fingerprinting, Storage, Sessions, Proxies, Autoscaling, and CLI

Overview, Architecture, and Package Layout

Crawlee is a Node.js / TypeScript library for building reliable web scrapers and crawlers. The monorepo hosts a layered set of packages: a thin meta-package (crawlee) that re-exports the rest, a core engine, a pluggable browser-pool for headless automation, and crawler packages that bind the engine to specific HTTP or browser libraries. The latest published line is the 3.17.x series (current at 3.17.0, released 2026-06-04), with backports and fixes continuing through 3.15.x. Source: package.json:1-120

Purpose and Scope

Crawlee's purpose is to give developers a single, batteries-included API for crawling the web — from static HTML scraped with Cheerio, to fully rendered pages driven by Playwright or Puppeteer, to AI-driven flows using Stagehand. The official site frames it as "one API for headless and HTTP": users can switch between HTTP-only and headless crawlers without rewriting their handler code, and the Adaptive crawler can decide at runtime whether JavaScript rendering is required. Source: website/src/pages/js.js:1-80

Beyond crawling, the project publishes ready-made project templates so new users can scaffold a scraper with one command. Templates are available for Cheerio, Playwright, Puppeteer, and Camoufox, in both JavaScript and TypeScript variants. Source: packages/templates/templates/cheerio-ts/README.md:1-10 Source: packages/templates/templates/playwright-ts/README.md:1-10 Source: packages/templates/templates/puppeteer-ts/README.md:1-10 Source: packages/templates/templates/camoufox-ts/README.md:1-10

Repository Layout and Package Organization

The repository is a Yarn 4 + Lerna monorepo with Turborepo for task running, declared at the root. Node 24.17.0 is pinned via Volta, and Playwright 1.61.0 and Puppeteer 24.36.1 are aligned in resolutions to keep browser engines consistent across packages. Source: package.json:1-120

The packages/ directory contains the published libraries. The umbrella package crawlee simply re-exports a curated set of first-party packages (@crawlee/basic, @crawlee/browser, @crawlee/browser-pool, @crawlee/cheerio, @crawlee/cli, @crawlee/core, @crawlee/http, @crawlee/jsdom, @crawlee/linkedom, @crawlee/playwright, @crawlee/puppeteer, and others), all versioned together at 3.17.0. Source: packages/crawlee/package.json:1-80

Specialized packages add capabilities beyond core:

PackageRoleNotable Dependency
@crawlee/browser-poolLifecycle management for headless browsersplaywright, puppeteer
@crawlee/stagehandAI-driven crawling via Stagehand@browserbasehq/stagehand v3, zod
@crawlee/templatesProject scaffolds for new crawlersnone

Source: packages/stagehand-crawler/package.json:1-60 Source: packages/browser-pool/src/browser-pool.ts:1-40

Community discussion: an open proposal asks whether @crawlee/memory-storage should be merged into @crawlee/core to remove an extra dependency from the install graph (issue #3756). Separately, users have asked for a way to disable ImpitHttpClient's connection cache (issue #3769) and reported that Bun runtime support is still partial because browser-pool and memory-storage lag behind (issue #2046).

Core Architecture

At runtime, every crawler ultimately drives an HTTP fetch (static rendering) or a browser page (dynamic rendering). The shared core package supplies the request queue, request list, autoscaling, retry, statistics, and storage abstractions; crawler packages consume those abstractions and add their own transport layer.

For browser-based crawlers, @crawlee/browser-pool sits underneath BrowserCrawler, PlaywrightCrawler, and PuppeteerCrawler. BrowserPool orchestrates launching and retiring browsers, exposing lifecycle hooks (preLaunchHooks, postLaunchHooks, prePageCreateHooks, prePageCloseHooks, postPageCloseHooks) so user code can tweak launchOptions, register contexts, or perform cleanup. Source: packages/browser-pool/src/browser-pool.ts:1-80

Each live browser is wrapped by a BrowserController — an abstract class that holds the underlying automation-library browser, the BrowserPlugin that launched it, the LaunchContext, and an optional proxyTier for tiered proxy rotation. Concrete PuppeteerController and PlaywrightController subclasses add only library-specific private methods. Source: packages/browser-pool/src/abstract-classes/browser-controller.ts:1-60

flowchart TD
    User[User code / handler] --> Crawler[Crawler class<br/>HTTP / Browser]
    Crawler --> Core["@crawlee/core<br/>RequestQueue, Autoscaler, Stats"]
    Crawler -->|HTTP path| HttpCrawler[HttpCrawler / CheerioCrawler]
    Crawler -->|Headless path| BrowserCrawler[BrowserCrawler]
    BrowserCrawler --> Pool["@crawlee/browser-pool<br/>BrowserPool"]
    Pool --> Ctrl[BrowserController]
    Ctrl --> Plugin[BrowserPlugin<br/>Playwright / Puppeteer]
    Plugin --> Browser[Chromium / Firefox / WebKit]
    Crawler -.optional AI.-> Stagehand["@crawlee/stagehand<br/>LLM-driven"]

Crawler-specific utility functions build on top of the page abstraction. Playwright utils compile user-supplied JavaScript into a function executed in a secured VM with { page, request } in scope, plus a compileScript helper that throws if the compiled body is not a function. Source: packages/playwright-crawler/src/internals/utils/playwright-utils.ts:1-60 Puppeteer provides a parallel utility surface (e.g. intercept-and-click helpers for JS-heavy pages) documented in its utils module. Source: packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts:1-40

The Stagehand package layers an LLM-driven page.act(), page.extract() (Zod-typed), and page.observe() on top of BrowserCrawler. apiKey semantics depend on the env option: under LOCAL it is an OpenAI/Anthropic/Google key, while under BROWSERBASE it is a Browserbase key. Source: packages/stagehand-crawler/README.md:1-40

Project Templates and Getting Started

Each crawler ships a paired TypeScript and JavaScript template under packages/templates/templates/. All templates declare the umbrella crawlee package (^3.0.0) and tsx for dev runs, with typescript ~6.0.0 and @types/node ^24.0.0 on the TypeScript side. The build pipeline is intentionally minimal: start:dev runs through tsx, build invokes tsc, and start:prod executes the compiled output with Node. Source: packages/templates/templates/empty-ts/package.json:1-25

Template READMEs redirect users to the corresponding crawlee.dev guides (e.g. the Cheerio crawler tutorial, the Playwright examples page, and the PuppeteerCrawler class reference), so the templates double as curated entry points to the broader documentation set. Source: packages/templates/templates/cheerio-js/README.md:1-10 Source: packages/templates/templates/playwright-js/README.md:1-10 Source: packages/templates/templates/puppeteer-js/README.md:1-10

See Also

  • Browser Pool and BrowserController internals
  • Request Queue and Autoscaler in @crawlee/core
  • AdaptiveCrawler rendering-type detection
  • StagehandCrawler AI integration

Source: https://github.com/apify/crawlee / Human Manual

Crawler Hierarchy and HTTP Clients

Related topics: Overview, Architecture, and Package Layout, Browser Pool, Launchers, and Fingerprinting, Storage, Sessions, Proxies, Autoscaling, and CLI

Section Related Pages

Continue reading this section for the full explanation and source context.

Section HTTP Client Considerations

Continue reading this section for the full explanation and source context.

Related topics: Overview, Architecture, and Package Layout, Browser Pool, Launchers, and Fingerprinting, Storage, Sessions, Proxies, Autoscaling, and CLI

Crawler Hierarchy and HTTP Clients

Crawlee organizes its crawlers as an inheritance tree rooted in a single foundation class. Every crawler — from the lightweight HTTP scrapers to the full browser automation crawlers — descends from BasicCrawler and reuses the same request-queue, autoscaling, retry, and storage infrastructure. On top of that foundation sit two parallel branches: the HTTP-based crawlers (fast, no JavaScript execution) and the browser-based crawlers (full headless Chrome / Chromium). Choosing between them is the most consequential architectural decision a Crawlee user makes.

Overview and Workspace Layout

The repository is a Yarn/Turbo monorepo. Source: package.json:1-30. The root package.json declares "workspaces": ["packages/*"] and the version 3.17.0 is published synchronously across the workspace via turbo run build. Each crawler ships as its own npm package and is published under the @crawlee/* scope:

PackagePurpose
@crawlee/coreFoundation: BasicCrawler, Request, RequestList, RequestQueue, storage clients. Source: packages/core/package.json:1-15
@crawlee/basicBasicCrawler exports for users who want full control over fetching. Source: packages/basic-crawler/README.md:1-10
@crawlee/httpHttpCrawler, the base for all non-browser HTTP crawlers. Source: packages/http-crawler/package.json:1-10
@crawlee/cheerioCheerioCrawler — HTTP + cheerio parsing. Source: packages/cheerio-crawler/README.md:1-10
@crawlee/browserBrowserCrawler — headless browser base class. Source: packages/browser-crawler/README.md:1-10
@crawlee/puppeteerPuppeteerCrawler built on Puppeteer. Source: packages/puppeteer-crawler/README.md:1-10
@crawlee/playwrightPlaywrightCrawler built on Playwright. Source: packages/playwright-crawler/package.json:1-10
@crawlee/stagehandAI-driven browser automation via Stagehand. Source: packages/stagehand-crawler/README.md:1-15
crawleeMeta-package re-exporting everything for a single-install experience. Source: packages/crawlee/package.json:1-15

Crawler Class Hierarchy

The class hierarchy is intentionally narrow: one root, two specializations, and a handful of concrete crawlers on each side.

classDiagram
    class BasicCrawler {
        +requestHandler
        +requestList
        +requestQueue
        +autoscaling
    }
    class HttpCrawler {
        +httpClient
        +sendRequest()
    }
    class CheerioCrawler {
        +cheerio parsing
    }
    class JSDOMCrawler {
        +jsdom parsing
    }
    class BrowserCrawler {
        +browserPool
        +pre/postLaunchHooks
    }
    class PuppeteerCrawler {
        +PuppeteerCrawler
    }
    class PlaywrightCrawler {
        +PlaywrightCrawler
    }
    class StagehandCrawler {
        +page.act()
        +page.extract()
        +page.observe()
    }
    BasicCrawler <|-- HttpCrawler
    BasicCrawler <|-- BrowserCrawler
    HttpCrawler <|-- CheerioCrawler
    HttpCrawler <|-- JSDOMCrawler
    BrowserCrawler <|-- PuppeteerCrawler
    BrowserCrawler <|-- PlaywrightCrawler
    BrowserCrawler <|-- StagehandCrawler

BasicCrawler "invokes the user-provided requestHandler for each Request object," reads URLs from a RequestList or RequestQueue, and handles retries, statistics, and concurrency. Source: packages/basic-crawler/README.md:1-10. It is described as "a low-level tool that requires the user to implement the page download and data extraction functionality themselves." Source: packages/basic-crawler/README.md:5-10.

HTTP-Based Crawlers

HttpCrawler (in @crawlee/http) is the HTTP specialization. It owns the network layer — choosing the httpClient implementation (e.g. ImpitHttpClient or GotHttpClient) and applying timeouts, retries, and proxy rotation. The two most prominent subclasses are:

  • CheerioCrawler — "downloads each URL using a plain HTTP request, parses the HTML content using Cheerio and then invokes the user-provided requestHandler to extract page data using a jQuery-like interface." Source: packages/cheerio-crawler/README.md:1-10.
  • JSDOMCrawler — parses responses with a full jsdom DOM, useful when client-side scripts must be evaluated server-side.

The official guidance is unambiguous: if the target site does not require JavaScript, "consider using CheerioCrawler, which downloads the pages using raw HTTP requests and is about 10x faster." Source: packages/browser-crawler/README.md:3-8 and packages/puppeteer-crawler/README.md:3-8.

HTTP Client Considerations

The HTTP client layer sits between HttpCrawler.sendRequest() and the network. Recent release history shows this layer is actively maintained:

  • v3.15.0 fixed a bug so that "ImpitHttpClient respects the internal Request timeout." Source: community context for v3.15.0 release.
  • Community issue #3769 requests a cacheClients option on ImpitHttpClient to disable connection caching in getClient(). Source: community context. This is a live feature request in the @crawlee/impit-client package as of the captured context.
  • v3.17.0 added "network timeouts to discoverValidSitemaps to prevent indefinite hangs." Source: v3.17.0 release.

A separate robustness concern tracked by the community: a crawler can hang indefinitely if started with malformed requestLike input rather than failing fast. Source: issue #3764. Operationally this means callers should validate request shapes before handing them to crawler.run().

Browser-Based Crawlers

BrowserCrawler is the browser counterpart of HttpCrawler. It pulls a browser instance from a browserPool, runs the user handler inside a Playwright/Puppeteer Page context, and tears the page down afterward. Source: packages/browser-crawler/README.md:1-10.

  • PuppeteerCrawler — "uses headless Chrome to download web pages and extract data" via Puppeteer. Source: packages/puppeteer-crawler/README.md:1-8.
  • PlaywrightCrawler — the Playwright equivalent, exposed as a separate package. Source: packages/playwright-crawler/package.json:1-8.
  • StagehandCrawler — "AI-powered web crawling using Stagehand ... for natural language browser automation. The enhanced page object offers page.act() to perform actions with plain English, page.extract() to get structured data with Zod schemas, and page.observe() to discover available actions." Source: packages/stagehand-crawler/README.md:1-10.

StagehandCrawler requires an LLM API key when run locally (env: 'LOCAL') or a Browserbase key when run against the managed cloud (env: 'BROWSERBASE'). Source: packages/stagehand-crawler/README.md:11-22. The project README explicitly recommends PlaywrightCrawler for sites with stable selectors because it is "faster and doesn't require AI API keys." Source: packages/stagehand-crawler/README.md:5-8.

Recent fixes that affect browser crawlers include: not retiring browsers with long-running pre|postLaunchHooks prematurely (v3.14.0), correctly applying launchOptions with useIncognitoPages (v3.15.2), and respecting storage-class config to avoid memory leaks (v3.15.1).

Choosing a Crawler and Selecting a Template

The repository ships one template per major crawler in packages/templates/templates/* (e.g. cheerio-js, cheerio-ts, playwright-ts, puppeteer-ts). Each template README is a two-line pointer to the official docs and examples. Source: packages/templates/templates/playwright-ts/README.md:1-7, packages/templates/templates/cheerio-ts/README.md:1-7, packages/templates/templates/puppeteer-ts/README.md:1-7.

Decision rule of thumb, derived from the package READMEs:

  • Static HTML, no JS, maximum throughput → CheerioCrawler.
  • Static HTML but need a real DOM → JSDOMCrawler.
  • JS-rendered pages with stable selectors → PuppeteerCrawler or PlaywrightCrawler.
  • JS-rendered pages with fragile / changing selectors and an LLM budget → StagehandCrawler.
  • Custom fetching (e.g. third-party API) → BasicCrawler.

See Also

Source: https://github.com/apify/crawlee / Human Manual

Browser Pool, Launchers, and Fingerprinting

Related topics: Overview, Architecture, and Package Layout, Crawler Hierarchy and HTTP Clients, Storage, Sessions, Proxies, Autoscaling, and CLI

Section Related Pages

Continue reading this section for the full explanation and source context.

Section LaunchContext

Continue reading this section for the full explanation and source context.

Related topics: Overview, Architecture, and Package Layout, Crawler Hierarchy and HTTP Clients, Storage, Sessions, Proxies, Autoscaling, and CLI

Browser Pool, Launchers, and Fingerprinting

Overview and Purpose

@crawlee/browser-pool is a small, but powerful and extensible library that allows developers to seamlessly control multiple headless browsers concurrently through a unified API. The package was created to address a recurring operational concern: executing tasks in many headless browsers and their pages without having to manually manage browser launches, crashes, restarts, and the entire browser/page lifecycle (Source: packages/browser-pool/README.md).

The library supports both Puppeteer and Playwright out of the box, and can be extended with custom plugins. It is consumed by Crawlee's higher-level crawlers (PuppeteerCrawler, PlaywrightCrawler, AdaptivePlaywrightCrawler, and StagehandCrawler) to manage browser instances transparently while user code focuses on page-level data extraction.

The root project pins browser-automation dependencies at known compatible versions in package.json (for example playwright-core: 1.61.0 and @puppeteer/browsers: ^3.0.4) so that all downstream crawlers use a coherent set of browser engines (Source: package.json).

Core Architecture

The Browser Pool is organized around three primary abstractions:

  1. BrowserPool — the central orchestrator that tracks active LaunchContext and BrowserController instances and routes new page requests to the right browser.
  2. BrowserPlugin — a thin adapter wrapping a specific automation library (Puppeteer, Playwright, custom).
  3. BrowserController — an abstract handle that mediates all browser-level operations (closing, retrieving pages, creating contexts).
flowchart LR
    User[User code / Crawler] -->|newPage| BP[BrowserPool]
    BP --> LP[preLaunchHooks]
    LP --> BL[Browser launch]
    BL --> LC[LaunchContext]
    LC --> BC[BrowserController]
    BC -->|newPage| Pg[Page]
    BC -->|close| Ret[Retire browser]
    Ret -->|postLaunchHooks cleanup| BC

The BrowserPool constructor accepts a list of BrowserPlugin instances, allowing the pool to manage multiple browser engines simultaneously. A typical helper method, newPageWithEachPlugin, opens a page in every configured engine in parallel, which is useful for cross-browser testing (Source: packages/browser-pool/src/browser-pool.ts).

LaunchContext

LaunchContext holds information about a single browser launch. It exposes the resolved launchOptions, the proxy URL/tier, and any user-supplied custom values added through the extend function. This is the recommended place to store browser-scoped values such as session IDs (Source: packages/browser-pool/src/launch-context.ts).

Important flags that can be set on a LaunchContext include:

OptionEffect
browserPerProxyIf true, the pool launches a fresh browser per proxy URL. Improves isolation but may cause excessive browser spawning.
useIncognitoPagesEach page uses its own browser context, destroyed on close.
experimentalContainersPersistent contexts (cache reuse); works best with Firefox and is unstable on Chromium.
userDataDirPath to a User Data Directory for cookies and local storage.
proxyUrl / proxyTierRouting metadata consumed by the proxy chain.
ignoreSslErrorsIgnores TLS errors from upstream proxy, useful with self-signed HTTPS proxies.

The pool assigns each LaunchContext an id equal to the id of the page that triggered the launch, which makes log correlation straightforward (Source: packages/browser-pool/src/launch-context.ts).

Browser Plugins and Launchers

BrowserPlugin is the extension point that wraps a specific automation library. Each plugin provides a launch function returning a BrowserController, and a newPage function for creating a page within an already-launched browser.

Two official plugins are shipped:

  • PlaywrightPlugin — wraps playwright.chromium, playwright.firefox, or playwright.webkit.
  • PuppeteerPlugin — wraps a Puppeteer installation.

The BrowserPool class accepts an array of plugins and will round-robin across them if multiple are configured. Pages can be requested with a specific plugin via the browserPlugin option in BrowserPoolNewPageOptions (Source: packages/browser-pool/src/abstract-classes/browser-plugin.ts).

The StagehandCrawler extends BrowserCrawler and adds an AI-driven layer on top of the standard pool: a StagehandController exposes page.act(), page.extract(), and page.observe() for natural-language browser interaction. The apiKey is interpreted as an LLM provider key when env: 'LOCAL' (the default) or as a Browserbase API key when env: 'BROWSERBASE' (Source: packages/stagehand-crawler/README.md).

Lifecycle Hooks and Fingerprinting

The pool exposes several lifecycle hooks configurable in the BrowserPool constructor options: preLaunchHooks, postLaunchHooks, prePageCreateHooks, postPageCreateHooks, prePageCloseHooks, and postPageCloseHooks. Each hook receives the page ID plus the relevant LaunchContext or BrowserController and can mutate launch options, attach listeners, or schedule cleanup (Source: packages/browser-pool/src/browser-pool.ts).

Fingerprinting is integrated through BrowserFingerprintWithHeaders from fingerprint-generator, which is stored in LaunchContext alongside launchOptions. The header set produced by the fingerprint generator is propagated to HTTP requests through crawlers like PuppeteerCrawler, which call utilities such as enqueueLinksByClickingElements to click navigation triggers on JavaScript-heavy pages and intercept subsequent navigations (Source: packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts).

For Playwright, the compileScript helper allows executing arbitrary user-supplied function bodies inside a vm.runInNewContext sandbox with a deliberately emptied prototype chain, while still receiving the live page and request objects. This sandbox is not a fully secure boundary, so the function is intended for sanitized or trusted code only (Source: packages/playwright-crawler/src/internals/utils/playwright-utils.ts).

Known Caveats and Community Issues

Several community-reported issues are relevant when working with browser automation under the pool:

  • Bun runtime compatibility — Running Playwright/Puppeteer crawlers under Bun currently has unresolved issues in browser-pool and memory-storage integration (Source: Issue #2046).
  • Premature browser retirement — Long-running preLaunchHooks or postLaunchHooks could cause the pool to retire browsers too early. This was fixed in v3.14.0 to respect hook execution time (Source: v3.14.0 release notes).
  • launchOptions with useIncognitoPages — A bug caused launch options to be ignored when incognito pages were enabled. Resolved in v3.15.2 (Source: v3.15.2 release notes).
  • Memory leak in storage classes — A v3.15.1 fix corrected the storage class configuration to avoid memory leaks when many browser contexts are active (Source: v3.15.1 release notes).

See Also

  • Stagehand Crawler — AI-driven browser automation built on top of the Browser Pool.
  • Puppeteer Crawler — PuppeteerCrawler integration using the Browser Pool.
  • Playwright Crawler — PlaywrightCrawler integration using the Browser Pool.
  • Cheerio Crawler — A non-browser HTTP crawler for static sites.
  • Adaptive Crawler — Automatically decides between HTTP and headless rendering.

Source: https://github.com/apify/crawlee / Human Manual

Storage, Sessions, Proxies, Autoscaling, and CLI

Related topics: Overview, Architecture, and Package Layout, Crawler Hierarchy and HTTP Clients, Browser Pool, Launchers, and Fingerprinting

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Overview, Architecture, and Package Layout, Crawler Hierarchy and HTTP Clients, Browser Pool, Launchers, and Fingerprinting

Storage, Sessions, Proxies, Autoscaling, and CLI

Crawlee is a scalable web crawling and scraping library for Node.js, published as a Yarn-workspaces monorepo under packages/* (package.json). Five cross-cutting subsystems form the operational backbone of every crawler built on top of it: persistent Storage for input and output data, Sessions for cookie/user-data reuse, Proxies for outbound IP rotation, Autoscaling for adapting concurrency to runtime load, and the project-level CLI scripts used to build and ship the library. The sections below describe each subsystem using only the evidence available in the repository.

Architecture Overview

The following diagram shows how the five subsystems are layered on top of the crawler core. Storage is mounted by every crawler; Sessions and Proxies ride on the browser launch path; Autoscaling sits above the request queue; the CLI orchestrates the build pipeline.

flowchart TB
    CLI["CLI scripts<br/>(package.json)"] --> Build["turbo run build<br/>lerna workspaces"]
    Build --> Core["@crawlee/core"]
    Core --> Storage["Storage Manager<br/>Dataset / KV Store /<br/>RequestQueue / RequestList"]
    Core --> Auto["Autoscaling<br/>(ScalingCrawler)"]
    Auto --> RQ["Request Queue"]
    Storage --> RQ
    Storage --> DS["Datasets / KV Stores"]
    BrowserPool["@crawlee/browser-pool"] --> LaunchCtx["LaunchContext<br/>(userDataDir,<br/>proxyUrl, useIncognitoPages)"]
    LaunchCtx --> Sessions["Sessions"]
    LaunchCtx --> Proxies["Proxies<br/>(proxyUrl, proxyTier)"]

Storage

Crawlee's storage layer is implemented as a set of pluggable classes managed by a central StorageManager. The four built-in storage types live under packages/core/src/storages/: storage_manager.ts, dataset.ts, key_value_store.ts, request_list.ts, and request_queue.ts (packages/core/src/storages/storage_manager.ts). Datasets append structured records, Key-Value Stores hold named blobs (e.g. request state, screenshots), and RequestLists/RequestQueues track URLs to visit.

In-process storage is provided by the @crawlee/memory-storage companion package. A long-standing open proposal asks that this package be merged directly into @crawlee/core to remove the extra dependency (issue #3756). A related fix in v3.15.1 — *"use correct config for storage classes to avoid memory leaks"* — explicitly targeted the configuration surface of these storage classes, confirming that storage clients hold runtime configuration that must be passed correctly during construction (v3.15.1). HTTP crawlers consume the storage layer through got-scraping, as declared in packages/http-crawler/package.json.

Sessions

Session isolation for browser crawlers is configured through the browser-pool LaunchContext. Three fields on LaunchContext govern session behaviour (packages/browser-pool/src/launch-context.ts):

FieldPurpose
useIncognitoPagesBy default, pages share the same browser context. When true, each page uses its own context that is destroyed on close or crash.
experimentalContainersPersistent contexts for cache reuse. Works best with Firefox; unstable on Chromium. Marked @experimental.
userDataDirPath to a User Data Directory that stores cookies and local storage for reuse across runs.

The BrowserController mirrors the proxy fields it was launched with (proxyTier, proxyUrl) onto its instance so downstream code can introspect the resolved configuration (packages/browser-pool/src/abstract-classes/browser-controller.ts). A v3.15.2 release note — *"correctly apply launchOptions with useIncognitoPages"* — shows the option is exercised in combination with arbitrary launch flags (v3.15.2), and v3.15.1 added the note that storage class configuration must be plumbed through correctly to avoid leaks (v3.15.1).

Proxies

Proxy configuration is declared on LaunchContext and observed on BrowserController. From the source, LaunchContext exposes proxyUrl and proxyTier; BrowserController records proxyTier as undefined when no tiered proxy is used, and proxyUrl is set every time the controller uses a proxy — including tiered proxies (packages/browser-pool/src/launch-context.ts, packages/browser-pool/src/abstract-classes/browser-controller.ts).

Two related fields connect proxies to the rest of the system:

  • browserPerProxy: when true, the crawler respects the per-request proxy URL generated for a given request, aligning browser-based crawlers with HttpCrawler. The docstring warns this can cause Crawlee to launch too many browser instances (packages/browser-pool/src/launch-context.ts).
  • proxyUrls: a list that may contain null, meaning the crawler skips proxying for that entry. This null-tolerance was added explicitly in v3.15.0 (v3.15.0).

The companion HTTP client, ImpitHttpClient, ships its own connection-cache configuration. Community issue #3769 requests a cacheClients toggle to disable that cache for users who want strict per-request connection isolation (issue #3769).

Autoscaling and CLI

Autoscaling is a headline feature in Crawlee's marketing surface. The website homepage renders an *"Auto scaling"* card that links to /js/docs/guides/scaling-crawlers, advertising that crawlers "dynamically scale based on available resources and current load" (website/src/pages/js.js). Underneath, autoscaling is implemented by the ScalingCrawler mixin in @crawlee/core, which monitors system memory and CPU (the getMemoryInfoV2 and getCurrentCpuTicksV2 helpers are re-exported from @crawlee/utils for this purpose — packages/utils/src/index.ts) and adjusts worker concurrency accordingly.

The CLI surface is minimal but central to the development workflow. The root package.json defines a Yarn-workspaces monorepo with packages/* as workspaces, managed by lerna@^9.0.7 and turbo@^2.1.0. The relevant scripts are:

ScriptCommand
buildturbo run build && node ./scripts/typescript_fixes.mjs
ci:buildLerna/CI-aware build
cleanturbo run clean && rimraf .turbo packages/*/.turbo packages/*/*.tsbuildinfo
prepublishOnlyturbo run copy
postinstallnpx husky install

These scripts orchestrate TypeScript compilation, monorepo-wide caching via Turborepo, and Husky hook installation across all crawler packages (browser, cheerio, http, playwright, puppeteer, stagehand, templates) (package.json).

See Also

  • Browser Pool and Launch Context
  • HTTP Crawler and ImpitHttpClient
  • Templates and Project Scaffolding
  • Autoscaling Guide

Source: https://github.com/apify/crawlee / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Maintenance risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Configuration risk requires verification

Upgrade or migration may change expected behavior: v3.15.1

medium Configuration risk requires verification

Upgrade or migration may change expected behavior: v3.16.0

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 22 structured pitfall item(s), including 1 high/blocking item(s). Top priority: Maintenance risk - Maintenance risk requires verification.

1. Maintenance risk: Maintenance risk requires verification

  • Severity: high
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/apify/crawlee/issues/2046

2. Configuration risk: Configuration risk requires verification

  • Severity: medium
  • Finding: Developers should check this configuration risk before relying on the project: v3.15.1
  • User impact: Upgrade or migration may change expected behavior: v3.15.1
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v3.15.1. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.1

3. Configuration risk: Configuration risk requires verification

  • Severity: medium
  • Finding: Developers should check this configuration risk before relying on the project: v3.16.0
  • User impact: Upgrade or migration may change expected behavior: v3.16.0
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v3.16.0. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.16.0

4. Capability evidence risk: Capability evidence risk requires verification

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: capability.assumptions | https://github.com/apify/crawlee

5. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Developers should check this runtime risk before relying on the project: v3.15.0
  • User impact: Upgrade or migration may change expected behavior: v3.15.0
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v3.15.0. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.0

6. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Developers should check this runtime risk before relying on the project: v3.15.3
  • User impact: Upgrade or migration may change expected behavior: v3.15.3
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v3.15.3. Context: Observed when using playwright
  • Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.3

7. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Developers should check this runtime risk before relying on the project: v3.17.0
  • User impact: Upgrade or migration may change expected behavior: v3.17.0
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v3.17.0. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.17.0

8. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/apify/crawlee/issues/3764

9. Maintenance risk: Maintenance risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | https://github.com/apify/crawlee

10. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: downstream_validation.risk_items | https://github.com/apify/crawlee

11. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: risks.scoring_risks | https://github.com/apify/crawlee

12. Capability evidence risk: Capability evidence risk requires verification

  • Severity: low
  • Finding: Developers should check this capability risk before relying on the project: Allow disabling ImpitHttpClient client cache
  • User impact: Developers may hit a documented source-backed failure mode: Allow disabling ImpitHttpClient client cache
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Allow disabling ImpitHttpClient client cache. Context: Observed when using node
  • Evidence: failure_mode_cluster:github_issue | https://github.com/apify/crawlee/issues/3769

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using crawlee with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence