Doramagic Project Pack · Human Manual
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Overview, Architecture, and Package Layout
Related topics: Crawler Hierarchy and HTTP Clients, Browser Pool, Launchers, and Fingerprinting, Storage, Sessions, Proxies, Autoscaling, and CLI
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Crawler Hierarchy and HTTP Clients, Browser Pool, Launchers, and Fingerprinting, Storage, Sessions, Proxies, Autoscaling, and CLI
Overview, Architecture, and Package Layout
Crawlee is a Node.js / TypeScript library for building reliable web scrapers and crawlers. The monorepo hosts a layered set of packages: a thin meta-package (crawlee) that re-exports the rest, a core engine, a pluggable browser-pool for headless automation, and crawler packages that bind the engine to specific HTTP or browser libraries. The latest published line is the 3.17.x series (current at 3.17.0, released 2026-06-04), with backports and fixes continuing through 3.15.x. Source: package.json:1-120
Purpose and Scope
Crawlee's purpose is to give developers a single, batteries-included API for crawling the web — from static HTML scraped with Cheerio, to fully rendered pages driven by Playwright or Puppeteer, to AI-driven flows using Stagehand. The official site frames it as "one API for headless and HTTP": users can switch between HTTP-only and headless crawlers without rewriting their handler code, and the Adaptive crawler can decide at runtime whether JavaScript rendering is required. Source: website/src/pages/js.js:1-80
Beyond crawling, the project publishes ready-made project templates so new users can scaffold a scraper with one command. Templates are available for Cheerio, Playwright, Puppeteer, and Camoufox, in both JavaScript and TypeScript variants. Source: packages/templates/templates/cheerio-ts/README.md:1-10 Source: packages/templates/templates/playwright-ts/README.md:1-10 Source: packages/templates/templates/puppeteer-ts/README.md:1-10 Source: packages/templates/templates/camoufox-ts/README.md:1-10
Repository Layout and Package Organization
The repository is a Yarn 4 + Lerna monorepo with Turborepo for task running, declared at the root. Node 24.17.0 is pinned via Volta, and Playwright 1.61.0 and Puppeteer 24.36.1 are aligned in resolutions to keep browser engines consistent across packages. Source: package.json:1-120
The packages/ directory contains the published libraries. The umbrella package crawlee simply re-exports a curated set of first-party packages (@crawlee/basic, @crawlee/browser, @crawlee/browser-pool, @crawlee/cheerio, @crawlee/cli, @crawlee/core, @crawlee/http, @crawlee/jsdom, @crawlee/linkedom, @crawlee/playwright, @crawlee/puppeteer, and others), all versioned together at 3.17.0. Source: packages/crawlee/package.json:1-80
Specialized packages add capabilities beyond core:
| Package | Role | Notable Dependency |
|---|---|---|
@crawlee/browser-pool | Lifecycle management for headless browsers | playwright, puppeteer |
@crawlee/stagehand | AI-driven crawling via Stagehand | @browserbasehq/stagehand v3, zod |
@crawlee/templates | Project scaffolds for new crawlers | none |
Source: packages/stagehand-crawler/package.json:1-60 Source: packages/browser-pool/src/browser-pool.ts:1-40
Community discussion: an open proposal asks whether @crawlee/memory-storage should be merged into @crawlee/core to remove an extra dependency from the install graph (issue #3756). Separately, users have asked for a way to disable ImpitHttpClient's connection cache (issue #3769) and reported that Bun runtime support is still partial because browser-pool and memory-storage lag behind (issue #2046).
Core Architecture
At runtime, every crawler ultimately drives an HTTP fetch (static rendering) or a browser page (dynamic rendering). The shared core package supplies the request queue, request list, autoscaling, retry, statistics, and storage abstractions; crawler packages consume those abstractions and add their own transport layer.
For browser-based crawlers, @crawlee/browser-pool sits underneath BrowserCrawler, PlaywrightCrawler, and PuppeteerCrawler. BrowserPool orchestrates launching and retiring browsers, exposing lifecycle hooks (preLaunchHooks, postLaunchHooks, prePageCreateHooks, prePageCloseHooks, postPageCloseHooks) so user code can tweak launchOptions, register contexts, or perform cleanup. Source: packages/browser-pool/src/browser-pool.ts:1-80
Each live browser is wrapped by a BrowserController — an abstract class that holds the underlying automation-library browser, the BrowserPlugin that launched it, the LaunchContext, and an optional proxyTier for tiered proxy rotation. Concrete PuppeteerController and PlaywrightController subclasses add only library-specific private methods. Source: packages/browser-pool/src/abstract-classes/browser-controller.ts:1-60
flowchart TD
User[User code / handler] --> Crawler[Crawler class<br/>HTTP / Browser]
Crawler --> Core["@crawlee/core<br/>RequestQueue, Autoscaler, Stats"]
Crawler -->|HTTP path| HttpCrawler[HttpCrawler / CheerioCrawler]
Crawler -->|Headless path| BrowserCrawler[BrowserCrawler]
BrowserCrawler --> Pool["@crawlee/browser-pool<br/>BrowserPool"]
Pool --> Ctrl[BrowserController]
Ctrl --> Plugin[BrowserPlugin<br/>Playwright / Puppeteer]
Plugin --> Browser[Chromium / Firefox / WebKit]
Crawler -.optional AI.-> Stagehand["@crawlee/stagehand<br/>LLM-driven"]Crawler-specific utility functions build on top of the page abstraction. Playwright utils compile user-supplied JavaScript into a function executed in a secured VM with { page, request } in scope, plus a compileScript helper that throws if the compiled body is not a function. Source: packages/playwright-crawler/src/internals/utils/playwright-utils.ts:1-60 Puppeteer provides a parallel utility surface (e.g. intercept-and-click helpers for JS-heavy pages) documented in its utils module. Source: packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts:1-40
The Stagehand package layers an LLM-driven page.act(), page.extract() (Zod-typed), and page.observe() on top of BrowserCrawler. apiKey semantics depend on the env option: under LOCAL it is an OpenAI/Anthropic/Google key, while under BROWSERBASE it is a Browserbase key. Source: packages/stagehand-crawler/README.md:1-40
Project Templates and Getting Started
Each crawler ships a paired TypeScript and JavaScript template under packages/templates/templates/. All templates declare the umbrella crawlee package (^3.0.0) and tsx for dev runs, with typescript ~6.0.0 and @types/node ^24.0.0 on the TypeScript side. The build pipeline is intentionally minimal: start:dev runs through tsx, build invokes tsc, and start:prod executes the compiled output with Node. Source: packages/templates/templates/empty-ts/package.json:1-25
Template READMEs redirect users to the corresponding crawlee.dev guides (e.g. the Cheerio crawler tutorial, the Playwright examples page, and the PuppeteerCrawler class reference), so the templates double as curated entry points to the broader documentation set. Source: packages/templates/templates/cheerio-js/README.md:1-10 Source: packages/templates/templates/playwright-js/README.md:1-10 Source: packages/templates/templates/puppeteer-js/README.md:1-10
See Also
- Browser Pool and BrowserController internals
- Request Queue and Autoscaler in
@crawlee/core - AdaptiveCrawler rendering-type detection
- StagehandCrawler AI integration
Source: https://github.com/apify/crawlee / Human Manual
Crawler Hierarchy and HTTP Clients
Related topics: Overview, Architecture, and Package Layout, Browser Pool, Launchers, and Fingerprinting, Storage, Sessions, Proxies, Autoscaling, and CLI
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview, Architecture, and Package Layout, Browser Pool, Launchers, and Fingerprinting, Storage, Sessions, Proxies, Autoscaling, and CLI
Crawler Hierarchy and HTTP Clients
Crawlee organizes its crawlers as an inheritance tree rooted in a single foundation class. Every crawler — from the lightweight HTTP scrapers to the full browser automation crawlers — descends from BasicCrawler and reuses the same request-queue, autoscaling, retry, and storage infrastructure. On top of that foundation sit two parallel branches: the HTTP-based crawlers (fast, no JavaScript execution) and the browser-based crawlers (full headless Chrome / Chromium). Choosing between them is the most consequential architectural decision a Crawlee user makes.
Overview and Workspace Layout
The repository is a Yarn/Turbo monorepo. Source: package.json:1-30. The root package.json declares "workspaces": ["packages/*"] and the version 3.17.0 is published synchronously across the workspace via turbo run build. Each crawler ships as its own npm package and is published under the @crawlee/* scope:
| Package | Purpose |
|---|---|
@crawlee/core | Foundation: BasicCrawler, Request, RequestList, RequestQueue, storage clients. Source: packages/core/package.json:1-15 |
@crawlee/basic | BasicCrawler exports for users who want full control over fetching. Source: packages/basic-crawler/README.md:1-10 |
@crawlee/http | HttpCrawler, the base for all non-browser HTTP crawlers. Source: packages/http-crawler/package.json:1-10 |
@crawlee/cheerio | CheerioCrawler — HTTP + cheerio parsing. Source: packages/cheerio-crawler/README.md:1-10 |
@crawlee/browser | BrowserCrawler — headless browser base class. Source: packages/browser-crawler/README.md:1-10 |
@crawlee/puppeteer | PuppeteerCrawler built on Puppeteer. Source: packages/puppeteer-crawler/README.md:1-10 |
@crawlee/playwright | PlaywrightCrawler built on Playwright. Source: packages/playwright-crawler/package.json:1-10 |
@crawlee/stagehand | AI-driven browser automation via Stagehand. Source: packages/stagehand-crawler/README.md:1-15 |
crawlee | Meta-package re-exporting everything for a single-install experience. Source: packages/crawlee/package.json:1-15 |
Crawler Class Hierarchy
The class hierarchy is intentionally narrow: one root, two specializations, and a handful of concrete crawlers on each side.
classDiagram
class BasicCrawler {
+requestHandler
+requestList
+requestQueue
+autoscaling
}
class HttpCrawler {
+httpClient
+sendRequest()
}
class CheerioCrawler {
+cheerio parsing
}
class JSDOMCrawler {
+jsdom parsing
}
class BrowserCrawler {
+browserPool
+pre/postLaunchHooks
}
class PuppeteerCrawler {
+PuppeteerCrawler
}
class PlaywrightCrawler {
+PlaywrightCrawler
}
class StagehandCrawler {
+page.act()
+page.extract()
+page.observe()
}
BasicCrawler <|-- HttpCrawler
BasicCrawler <|-- BrowserCrawler
HttpCrawler <|-- CheerioCrawler
HttpCrawler <|-- JSDOMCrawler
BrowserCrawler <|-- PuppeteerCrawler
BrowserCrawler <|-- PlaywrightCrawler
BrowserCrawler <|-- StagehandCrawlerBasicCrawler "invokes the user-provided requestHandler for each Request object," reads URLs from a RequestList or RequestQueue, and handles retries, statistics, and concurrency. Source: packages/basic-crawler/README.md:1-10. It is described as "a low-level tool that requires the user to implement the page download and data extraction functionality themselves." Source: packages/basic-crawler/README.md:5-10.
HTTP-Based Crawlers
HttpCrawler (in @crawlee/http) is the HTTP specialization. It owns the network layer — choosing the httpClient implementation (e.g. ImpitHttpClient or GotHttpClient) and applying timeouts, retries, and proxy rotation. The two most prominent subclasses are:
CheerioCrawler— "downloads each URL using a plain HTTP request, parses the HTML content using Cheerio and then invokes the user-providedrequestHandlerto extract page data using a jQuery-like interface." Source: packages/cheerio-crawler/README.md:1-10.JSDOMCrawler— parses responses with a fulljsdomDOM, useful when client-side scripts must be evaluated server-side.
The official guidance is unambiguous: if the target site does not require JavaScript, "consider using CheerioCrawler, which downloads the pages using raw HTTP requests and is about 10x faster." Source: packages/browser-crawler/README.md:3-8 and packages/puppeteer-crawler/README.md:3-8.
HTTP Client Considerations
The HTTP client layer sits between HttpCrawler.sendRequest() and the network. Recent release history shows this layer is actively maintained:
- v3.15.0 fixed a bug so that "
ImpitHttpClientrespects the internalRequesttimeout." Source: community context for v3.15.0 release. - Community issue #3769 requests a
cacheClientsoption onImpitHttpClientto disable connection caching ingetClient(). Source: community context. This is a live feature request in the@crawlee/impit-clientpackage as of the captured context. - v3.17.0 added "network timeouts to
discoverValidSitemapsto prevent indefinite hangs." Source: v3.17.0 release.
A separate robustness concern tracked by the community: a crawler can hang indefinitely if started with malformed requestLike input rather than failing fast. Source: issue #3764. Operationally this means callers should validate request shapes before handing them to crawler.run().
Browser-Based Crawlers
BrowserCrawler is the browser counterpart of HttpCrawler. It pulls a browser instance from a browserPool, runs the user handler inside a Playwright/Puppeteer Page context, and tears the page down afterward. Source: packages/browser-crawler/README.md:1-10.
PuppeteerCrawler— "uses headless Chrome to download web pages and extract data" via Puppeteer. Source: packages/puppeteer-crawler/README.md:1-8.PlaywrightCrawler— the Playwright equivalent, exposed as a separate package. Source: packages/playwright-crawler/package.json:1-8.StagehandCrawler— "AI-powered web crawling using Stagehand ... for natural language browser automation. The enhanced page object offerspage.act()to perform actions with plain English,page.extract()to get structured data with Zod schemas, andpage.observe()to discover available actions." Source: packages/stagehand-crawler/README.md:1-10.
StagehandCrawler requires an LLM API key when run locally (env: 'LOCAL') or a Browserbase key when run against the managed cloud (env: 'BROWSERBASE'). Source: packages/stagehand-crawler/README.md:11-22. The project README explicitly recommends PlaywrightCrawler for sites with stable selectors because it is "faster and doesn't require AI API keys." Source: packages/stagehand-crawler/README.md:5-8.
Recent fixes that affect browser crawlers include: not retiring browsers with long-running pre|postLaunchHooks prematurely (v3.14.0), correctly applying launchOptions with useIncognitoPages (v3.15.2), and respecting storage-class config to avoid memory leaks (v3.15.1).
Choosing a Crawler and Selecting a Template
The repository ships one template per major crawler in packages/templates/templates/* (e.g. cheerio-js, cheerio-ts, playwright-ts, puppeteer-ts). Each template README is a two-line pointer to the official docs and examples. Source: packages/templates/templates/playwright-ts/README.md:1-7, packages/templates/templates/cheerio-ts/README.md:1-7, packages/templates/templates/puppeteer-ts/README.md:1-7.
Decision rule of thumb, derived from the package READMEs:
- Static HTML, no JS, maximum throughput →
CheerioCrawler. - Static HTML but need a real DOM →
JSDOMCrawler. - JS-rendered pages with stable selectors →
PuppeteerCrawlerorPlaywrightCrawler. - JS-rendered pages with fragile / changing selectors and an LLM budget →
StagehandCrawler. - Custom fetching (e.g. third-party API) →
BasicCrawler.
See Also
BasicCrawlerrequest handling and autoscaling — covered in @crawlee/basic package docs.RequestandRequestQueuemodel — covered in @crawlee/core package docs.- Adaptive crawling (HTTP-vs-browser auto-detection) — release notes reference
AdaptivePlaywrightCrawlerfixes in v3.13.10 and v3.16.0. - HTTP client configuration and
ImpitHttpClientissues — see issue #3769 and v3.15.0 release notes.
Source: https://github.com/apify/crawlee / Human Manual
Browser Pool, Launchers, and Fingerprinting
Related topics: Overview, Architecture, and Package Layout, Crawler Hierarchy and HTTP Clients, Storage, Sessions, Proxies, Autoscaling, and CLI
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview, Architecture, and Package Layout, Crawler Hierarchy and HTTP Clients, Storage, Sessions, Proxies, Autoscaling, and CLI
Browser Pool, Launchers, and Fingerprinting
Overview and Purpose
@crawlee/browser-pool is a small, but powerful and extensible library that allows developers to seamlessly control multiple headless browsers concurrently through a unified API. The package was created to address a recurring operational concern: executing tasks in many headless browsers and their pages without having to manually manage browser launches, crashes, restarts, and the entire browser/page lifecycle (Source: packages/browser-pool/README.md).
The library supports both Puppeteer and Playwright out of the box, and can be extended with custom plugins. It is consumed by Crawlee's higher-level crawlers (PuppeteerCrawler, PlaywrightCrawler, AdaptivePlaywrightCrawler, and StagehandCrawler) to manage browser instances transparently while user code focuses on page-level data extraction.
The root project pins browser-automation dependencies at known compatible versions in package.json (for example playwright-core: 1.61.0 and @puppeteer/browsers: ^3.0.4) so that all downstream crawlers use a coherent set of browser engines (Source: package.json).
Core Architecture
The Browser Pool is organized around three primary abstractions:
BrowserPool— the central orchestrator that tracks activeLaunchContextandBrowserControllerinstances and routes new page requests to the right browser.BrowserPlugin— a thin adapter wrapping a specific automation library (Puppeteer, Playwright, custom).BrowserController— an abstract handle that mediates all browser-level operations (closing, retrieving pages, creating contexts).
flowchart LR
User[User code / Crawler] -->|newPage| BP[BrowserPool]
BP --> LP[preLaunchHooks]
LP --> BL[Browser launch]
BL --> LC[LaunchContext]
LC --> BC[BrowserController]
BC -->|newPage| Pg[Page]
BC -->|close| Ret[Retire browser]
Ret -->|postLaunchHooks cleanup| BCThe BrowserPool constructor accepts a list of BrowserPlugin instances, allowing the pool to manage multiple browser engines simultaneously. A typical helper method, newPageWithEachPlugin, opens a page in every configured engine in parallel, which is useful for cross-browser testing (Source: packages/browser-pool/src/browser-pool.ts).
LaunchContext
LaunchContext holds information about a single browser launch. It exposes the resolved launchOptions, the proxy URL/tier, and any user-supplied custom values added through the extend function. This is the recommended place to store browser-scoped values such as session IDs (Source: packages/browser-pool/src/launch-context.ts).
Important flags that can be set on a LaunchContext include:
| Option | Effect |
|---|---|
browserPerProxy | If true, the pool launches a fresh browser per proxy URL. Improves isolation but may cause excessive browser spawning. |
useIncognitoPages | Each page uses its own browser context, destroyed on close. |
experimentalContainers | Persistent contexts (cache reuse); works best with Firefox and is unstable on Chromium. |
userDataDir | Path to a User Data Directory for cookies and local storage. |
proxyUrl / proxyTier | Routing metadata consumed by the proxy chain. |
ignoreSslErrors | Ignores TLS errors from upstream proxy, useful with self-signed HTTPS proxies. |
The pool assigns each LaunchContext an id equal to the id of the page that triggered the launch, which makes log correlation straightforward (Source: packages/browser-pool/src/launch-context.ts).
Browser Plugins and Launchers
BrowserPlugin is the extension point that wraps a specific automation library. Each plugin provides a launch function returning a BrowserController, and a newPage function for creating a page within an already-launched browser.
Two official plugins are shipped:
PlaywrightPlugin— wrapsplaywright.chromium,playwright.firefox, orplaywright.webkit.PuppeteerPlugin— wraps a Puppeteer installation.
The BrowserPool class accepts an array of plugins and will round-robin across them if multiple are configured. Pages can be requested with a specific plugin via the browserPlugin option in BrowserPoolNewPageOptions (Source: packages/browser-pool/src/abstract-classes/browser-plugin.ts).
The StagehandCrawler extends BrowserCrawler and adds an AI-driven layer on top of the standard pool: a StagehandController exposes page.act(), page.extract(), and page.observe() for natural-language browser interaction. The apiKey is interpreted as an LLM provider key when env: 'LOCAL' (the default) or as a Browserbase API key when env: 'BROWSERBASE' (Source: packages/stagehand-crawler/README.md).
Lifecycle Hooks and Fingerprinting
The pool exposes several lifecycle hooks configurable in the BrowserPool constructor options: preLaunchHooks, postLaunchHooks, prePageCreateHooks, postPageCreateHooks, prePageCloseHooks, and postPageCloseHooks. Each hook receives the page ID plus the relevant LaunchContext or BrowserController and can mutate launch options, attach listeners, or schedule cleanup (Source: packages/browser-pool/src/browser-pool.ts).
Fingerprinting is integrated through BrowserFingerprintWithHeaders from fingerprint-generator, which is stored in LaunchContext alongside launchOptions. The header set produced by the fingerprint generator is propagated to HTTP requests through crawlers like PuppeteerCrawler, which call utilities such as enqueueLinksByClickingElements to click navigation triggers on JavaScript-heavy pages and intercept subsequent navigations (Source: packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts).
For Playwright, the compileScript helper allows executing arbitrary user-supplied function bodies inside a vm.runInNewContext sandbox with a deliberately emptied prototype chain, while still receiving the live page and request objects. This sandbox is not a fully secure boundary, so the function is intended for sanitized or trusted code only (Source: packages/playwright-crawler/src/internals/utils/playwright-utils.ts).
Known Caveats and Community Issues
Several community-reported issues are relevant when working with browser automation under the pool:
- Bun runtime compatibility — Running Playwright/Puppeteer crawlers under Bun currently has unresolved issues in
browser-poolandmemory-storageintegration (Source: Issue #2046). - Premature browser retirement — Long-running
preLaunchHooksorpostLaunchHookscould cause the pool to retire browsers too early. This was fixed in v3.14.0 to respect hook execution time (Source: v3.14.0 release notes). launchOptionswithuseIncognitoPages— A bug caused launch options to be ignored when incognito pages were enabled. Resolved in v3.15.2 (Source: v3.15.2 release notes).- Memory leak in storage classes — A v3.15.1 fix corrected the storage class configuration to avoid memory leaks when many browser contexts are active (Source: v3.15.1 release notes).
See Also
- Stagehand Crawler — AI-driven browser automation built on top of the Browser Pool.
- Puppeteer Crawler —
PuppeteerCrawlerintegration using the Browser Pool. - Playwright Crawler —
PlaywrightCrawlerintegration using the Browser Pool. - Cheerio Crawler — A non-browser HTTP crawler for static sites.
- Adaptive Crawler — Automatically decides between HTTP and headless rendering.
Source: https://github.com/apify/crawlee / Human Manual
Storage, Sessions, Proxies, Autoscaling, and CLI
Related topics: Overview, Architecture, and Package Layout, Crawler Hierarchy and HTTP Clients, Browser Pool, Launchers, and Fingerprinting
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview, Architecture, and Package Layout, Crawler Hierarchy and HTTP Clients, Browser Pool, Launchers, and Fingerprinting
Storage, Sessions, Proxies, Autoscaling, and CLI
Crawlee is a scalable web crawling and scraping library for Node.js, published as a Yarn-workspaces monorepo under packages/* (package.json). Five cross-cutting subsystems form the operational backbone of every crawler built on top of it: persistent Storage for input and output data, Sessions for cookie/user-data reuse, Proxies for outbound IP rotation, Autoscaling for adapting concurrency to runtime load, and the project-level CLI scripts used to build and ship the library. The sections below describe each subsystem using only the evidence available in the repository.
Architecture Overview
The following diagram shows how the five subsystems are layered on top of the crawler core. Storage is mounted by every crawler; Sessions and Proxies ride on the browser launch path; Autoscaling sits above the request queue; the CLI orchestrates the build pipeline.
flowchart TB
CLI["CLI scripts<br/>(package.json)"] --> Build["turbo run build<br/>lerna workspaces"]
Build --> Core["@crawlee/core"]
Core --> Storage["Storage Manager<br/>Dataset / KV Store /<br/>RequestQueue / RequestList"]
Core --> Auto["Autoscaling<br/>(ScalingCrawler)"]
Auto --> RQ["Request Queue"]
Storage --> RQ
Storage --> DS["Datasets / KV Stores"]
BrowserPool["@crawlee/browser-pool"] --> LaunchCtx["LaunchContext<br/>(userDataDir,<br/>proxyUrl, useIncognitoPages)"]
LaunchCtx --> Sessions["Sessions"]
LaunchCtx --> Proxies["Proxies<br/>(proxyUrl, proxyTier)"]Storage
Crawlee's storage layer is implemented as a set of pluggable classes managed by a central StorageManager. The four built-in storage types live under packages/core/src/storages/: storage_manager.ts, dataset.ts, key_value_store.ts, request_list.ts, and request_queue.ts (packages/core/src/storages/storage_manager.ts). Datasets append structured records, Key-Value Stores hold named blobs (e.g. request state, screenshots), and RequestLists/RequestQueues track URLs to visit.
In-process storage is provided by the @crawlee/memory-storage companion package. A long-standing open proposal asks that this package be merged directly into @crawlee/core to remove the extra dependency (issue #3756). A related fix in v3.15.1 — *"use correct config for storage classes to avoid memory leaks"* — explicitly targeted the configuration surface of these storage classes, confirming that storage clients hold runtime configuration that must be passed correctly during construction (v3.15.1). HTTP crawlers consume the storage layer through got-scraping, as declared in packages/http-crawler/package.json.
Sessions
Session isolation for browser crawlers is configured through the browser-pool LaunchContext. Three fields on LaunchContext govern session behaviour (packages/browser-pool/src/launch-context.ts):
| Field | Purpose |
|---|---|
useIncognitoPages | By default, pages share the same browser context. When true, each page uses its own context that is destroyed on close or crash. |
experimentalContainers | Persistent contexts for cache reuse. Works best with Firefox; unstable on Chromium. Marked @experimental. |
userDataDir | Path to a User Data Directory that stores cookies and local storage for reuse across runs. |
The BrowserController mirrors the proxy fields it was launched with (proxyTier, proxyUrl) onto its instance so downstream code can introspect the resolved configuration (packages/browser-pool/src/abstract-classes/browser-controller.ts). A v3.15.2 release note — *"correctly apply launchOptions with useIncognitoPages"* — shows the option is exercised in combination with arbitrary launch flags (v3.15.2), and v3.15.1 added the note that storage class configuration must be plumbed through correctly to avoid leaks (v3.15.1).
Proxies
Proxy configuration is declared on LaunchContext and observed on BrowserController. From the source, LaunchContext exposes proxyUrl and proxyTier; BrowserController records proxyTier as undefined when no tiered proxy is used, and proxyUrl is set every time the controller uses a proxy — including tiered proxies (packages/browser-pool/src/launch-context.ts, packages/browser-pool/src/abstract-classes/browser-controller.ts).
Two related fields connect proxies to the rest of the system:
browserPerProxy: whentrue, the crawler respects the per-request proxy URL generated for a given request, aligning browser-based crawlers withHttpCrawler. The docstring warns this can cause Crawlee to launch too many browser instances (packages/browser-pool/src/launch-context.ts).proxyUrls: a list that may containnull, meaning the crawler skips proxying for that entry. This null-tolerance was added explicitly in v3.15.0 (v3.15.0).
The companion HTTP client, ImpitHttpClient, ships its own connection-cache configuration. Community issue #3769 requests a cacheClients toggle to disable that cache for users who want strict per-request connection isolation (issue #3769).
Autoscaling and CLI
Autoscaling is a headline feature in Crawlee's marketing surface. The website homepage renders an *"Auto scaling"* card that links to /js/docs/guides/scaling-crawlers, advertising that crawlers "dynamically scale based on available resources and current load" (website/src/pages/js.js). Underneath, autoscaling is implemented by the ScalingCrawler mixin in @crawlee/core, which monitors system memory and CPU (the getMemoryInfoV2 and getCurrentCpuTicksV2 helpers are re-exported from @crawlee/utils for this purpose — packages/utils/src/index.ts) and adjusts worker concurrency accordingly.
The CLI surface is minimal but central to the development workflow. The root package.json defines a Yarn-workspaces monorepo with packages/* as workspaces, managed by lerna@^9.0.7 and turbo@^2.1.0. The relevant scripts are:
| Script | Command |
|---|---|
build | turbo run build && node ./scripts/typescript_fixes.mjs |
ci:build | Lerna/CI-aware build |
clean | turbo run clean && rimraf .turbo packages/*/.turbo packages/*/*.tsbuildinfo |
prepublishOnly | turbo run copy |
postinstall | npx husky install |
These scripts orchestrate TypeScript compilation, monorepo-wide caching via Turborepo, and Husky hook installation across all crawler packages (browser, cheerio, http, playwright, puppeteer, stagehand, templates) (package.json).
See Also
- Browser Pool and Launch Context
- HTTP Crawler and ImpitHttpClient
- Templates and Project Scaffolding
- Autoscaling Guide
Source: https://github.com/apify/crawlee / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
Upgrade or migration may change expected behavior: v3.15.1
Upgrade or migration may change expected behavior: v3.16.0
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 22 structured pitfall item(s), including 1 high/blocking item(s). Top priority: Maintenance risk - Maintenance risk requires verification.
1. Maintenance risk: Maintenance risk requires verification
- Severity: high
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/apify/crawlee/issues/2046
2. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: v3.15.1
- User impact: Upgrade or migration may change expected behavior: v3.15.1
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v3.15.1. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.1
3. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: v3.16.0
- User impact: Upgrade or migration may change expected behavior: v3.16.0
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v3.16.0. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.16.0
4. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/apify/crawlee
5. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Developers should check this runtime risk before relying on the project: v3.15.0
- User impact: Upgrade or migration may change expected behavior: v3.15.0
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v3.15.0. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.0
6. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Developers should check this runtime risk before relying on the project: v3.15.3
- User impact: Upgrade or migration may change expected behavior: v3.15.3
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v3.15.3. Context: Observed when using playwright
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.3
7. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Developers should check this runtime risk before relying on the project: v3.17.0
- User impact: Upgrade or migration may change expected behavior: v3.17.0
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: v3.17.0. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.17.0
8. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/apify/crawlee/issues/3764
9. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/apify/crawlee
10. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/apify/crawlee
11. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | https://github.com/apify/crawlee
12. Capability evidence risk: Capability evidence risk requires verification
- Severity: low
- Finding: Developers should check this capability risk before relying on the project: Allow disabling ImpitHttpClient client cache
- User impact: Developers may hit a documented source-backed failure mode: Allow disabling ImpitHttpClient client cache
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Allow disabling ImpitHttpClient client cache. Context: Observed when using node
- Evidence: failure_mode_cluster:github_issue | https://github.com/apify/crawlee/issues/3769
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using crawlee with real data or production workflows.
- Allow disabling ImpitHttpClient client cache - github / github_issue
- Add support for Bun runtime - Issue with
browser-pooland `memory-stor - github / github_issue - Crawler hangs forever when given malformed request input (e.g. invalid ` - github / github_issue
- Merge @crawlee/memory-storage package into @crawlee/core - github / github_issue
- v3.17.0 - github / github_release
- v3.16.0 - github / github_release
- v3.15.3 - github / github_release
- v3.15.2 - github / github_release
- v3.15.1 - github / github_release
- v3.15.0 - github / github_release
- v3.14.1 - github / github_release
- v3.14.0 - github / github_release
Source: Project Pack community evidence and pitfall evidence