# https://github.com/apify/crawlee Project Manual Generated at: 2026-06-22 06:40:05 UTC ## Table of Contents - [Overview, Architecture, and Package Layout](#page-overview) - [Crawler Hierarchy and HTTP Clients](#page-crawlers-http) - [Browser Pool, Launchers, and Fingerprinting](#page-browser-pool) - [Storage, Sessions, Proxies, Autoscaling, and CLI](#page-runtime-ops) ## Overview, Architecture, and Package Layout ### Related Pages Related topics: [Crawler Hierarchy and HTTP Clients](#page-crawlers-http), [Browser Pool, Launchers, and Fingerprinting](#page-browser-pool), [Storage, Sessions, Proxies, Autoscaling, and CLI](#page-runtime-ops)

Related Source Files

The following source files were used to generate this page: - [package.json](https://github.com/apify/crawlee/blob/main/package.json) - [packages/crawlee/package.json](https://github.com/apify/crawlee/blob/main/packages/crawlee/package.json) - [packages/stagehand-crawler/package.json](https://github.com/apify/crawlee/blob/main/packages/stagehand-crawler/package.json) - [packages/stagehand-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/stagehand-crawler/README.md) - [packages/browser-pool/src/browser-pool.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/browser-pool.ts) - [packages/browser-pool/src/abstract-classes/browser-controller.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-controller.ts) - [packages/playwright-crawler/src/internals/utils/playwright-utils.ts](https://github.com/apify/crawlee/blob/main/packages/playwright-crawler/src/internals/utils/playwright-utils.ts) - [packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts](https://github.com/apify/crawlee/blob/main/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts) - [packages/templates/templates/empty-ts/package.json](https://github.com/apify/crawlee/blob/main/packages/templates/templates/empty-ts/package.json) - [packages/templates/templates/cheerio-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/cheerio-ts/README.md) - [packages/templates/templates/playwright-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/playwright-ts/README.md) - [packages/templates/templates/puppeteer-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/puppeteer-ts/README.md) - [packages/templates/templates/camoufox-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/camoufox-ts/README.md) - [website/src/pages/js.js](https://github.com/apify/crawlee/blob/main/website/src/pages/js.js)

# Overview, Architecture, and Package Layout Crawlee is a Node.js / TypeScript library for building reliable web scrapers and crawlers. The monorepo hosts a layered set of packages: a thin meta-package (`crawlee`) that re-exports the rest, a `core` engine, a pluggable `browser-pool` for headless automation, and crawler packages that bind the engine to specific HTTP or browser libraries. The latest published line is the `3.17.x` series (current at `3.17.0`, released 2026-06-04), with backports and fixes continuing through `3.15.x`. Source: [package.json:1-120]() ## Purpose and Scope Crawlee's purpose is to give developers a single, batteries-included API for crawling the web — from static HTML scraped with Cheerio, to fully rendered pages driven by Playwright or Puppeteer, to AI-driven flows using Stagehand. The official site frames it as "one API for headless and HTTP": users can switch between HTTP-only and headless crawlers without rewriting their handler code, and the Adaptive crawler can decide at runtime whether JavaScript rendering is required. Source: [website/src/pages/js.js:1-80]() Beyond crawling, the project publishes ready-made project templates so new users can scaffold a scraper with one command. Templates are available for Cheerio, Playwright, Puppeteer, and Camoufox, in both JavaScript and TypeScript variants. Source: [packages/templates/templates/cheerio-ts/README.md:1-10]() Source: [packages/templates/templates/playwright-ts/README.md:1-10]() Source: [packages/templates/templates/puppeteer-ts/README.md:1-10]() Source: [packages/templates/templates/camoufox-ts/README.md:1-10]() ## Repository Layout and Package Organization The repository is a Yarn 4 + Lerna monorepo with Turborepo for task running, declared at the root. Node `24.17.0` is pinned via Volta, and Playwright `1.61.0` and Puppeteer `24.36.1` are aligned in `resolutions` to keep browser engines consistent across packages. Source: [package.json:1-120]() The `packages/` directory contains the published libraries. The umbrella package `crawlee` simply re-exports a curated set of first-party packages (`@crawlee/basic`, `@crawlee/browser`, `@crawlee/browser-pool`, `@crawlee/cheerio`, `@crawlee/cli`, `@crawlee/core`, `@crawlee/http`, `@crawlee/jsdom`, `@crawlee/linkedom`, `@crawlee/playwright`, `@crawlee/puppeteer`, and others), all versioned together at `3.17.0`. Source: [packages/crawlee/package.json:1-80]() Specialized packages add capabilities beyond core: | Package | Role | Notable Dependency | |---|---|---| | `@crawlee/browser-pool` | Lifecycle management for headless browsers | `playwright`, `puppeteer` | | `@crawlee/stagehand` | AI-driven crawling via Stagehand | `@browserbasehq/stagehand` v3, `zod` | | `@crawlee/templates` | Project scaffolds for new crawlers | none | Source: [packages/stagehand-crawler/package.json:1-60]() Source: [packages/browser-pool/src/browser-pool.ts:1-40]() Community discussion: an open proposal asks whether `@crawlee/memory-storage` should be merged into `@crawlee/core` to remove an extra dependency from the install graph (issue #3756). Separately, users have asked for a way to disable ImpitHttpClient's connection cache (issue #3769) and reported that Bun runtime support is still partial because `browser-pool` and `memory-storage` lag behind (issue #2046). ## Core Architecture At runtime, every crawler ultimately drives an HTTP fetch (static rendering) or a browser page (dynamic rendering). The shared `core` package supplies the request queue, request list, autoscaling, retry, statistics, and storage abstractions; crawler packages consume those abstractions and add their own transport layer. For browser-based crawlers, `@crawlee/browser-pool` sits underneath `BrowserCrawler`, `PlaywrightCrawler`, and `PuppeteerCrawler`. `BrowserPool` orchestrates launching and retiring browsers, exposing lifecycle hooks (`preLaunchHooks`, `postLaunchHooks`, `prePageCreateHooks`, `prePageCloseHooks`, `postPageCloseHooks`) so user code can tweak `launchOptions`, register contexts, or perform cleanup. Source: [packages/browser-pool/src/browser-pool.ts:1-80]() Each live browser is wrapped by a `BrowserController` — an abstract class that holds the underlying automation-library browser, the `BrowserPlugin` that launched it, the `LaunchContext`, and an optional `proxyTier` for tiered proxy rotation. Concrete `PuppeteerController` and `PlaywrightController` subclasses add only library-specific private methods. Source: [packages/browser-pool/src/abstract-classes/browser-controller.ts:1-60]() ```mermaid flowchart TD User[User code / handler] --> Crawler[Crawler class
HTTP / Browser] Crawler --> Core["@crawlee/core
RequestQueue, Autoscaler, Stats"] Crawler -->|HTTP path| HttpCrawler[HttpCrawler / CheerioCrawler] Crawler -->|Headless path| BrowserCrawler[BrowserCrawler] BrowserCrawler --> Pool["@crawlee/browser-pool
BrowserPool"] Pool --> Ctrl[BrowserController] Ctrl --> Plugin[BrowserPlugin
Playwright / Puppeteer] Plugin --> Browser[Chromium / Firefox / WebKit] Crawler -.optional AI.-> Stagehand["@crawlee/stagehand
LLM-driven"] ``` Crawler-specific utility functions build on top of the page abstraction. Playwright utils compile user-supplied JavaScript into a function executed in a secured VM with `{ page, request }` in scope, plus a `compileScript` helper that throws if the compiled body is not a function. Source: [packages/playwright-crawler/src/internals/utils/playwright-utils.ts:1-60]() Puppeteer provides a parallel utility surface (e.g. intercept-and-click helpers for JS-heavy pages) documented in its utils module. Source: [packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts:1-40]() The Stagehand package layers an LLM-driven `page.act()`, `page.extract()` (Zod-typed), and `page.observe()` on top of `BrowserCrawler`. `apiKey` semantics depend on the `env` option: under `LOCAL` it is an OpenAI/Anthropic/Google key, while under `BROWSERBASE` it is a Browserbase key. Source: [packages/stagehand-crawler/README.md:1-40]() ## Project Templates and Getting Started Each crawler ships a paired TypeScript and JavaScript template under `packages/templates/templates/`. All templates declare the umbrella `crawlee` package (`^3.0.0`) and `tsx` for dev runs, with `typescript ~6.0.0` and `@types/node ^24.0.0` on the TypeScript side. The build pipeline is intentionally minimal: `start:dev` runs through `tsx`, `build` invokes `tsc`, and `start:prod` executes the compiled output with Node. Source: [packages/templates/templates/empty-ts/package.json:1-25]() Template READMEs redirect users to the corresponding `crawlee.dev` guides (e.g. the Cheerio crawler tutorial, the Playwright examples page, and the PuppeteerCrawler class reference), so the templates double as curated entry points to the broader documentation set. Source: [packages/templates/templates/cheerio-js/README.md:1-10]() Source: [packages/templates/templates/playwright-js/README.md:1-10]() Source: [packages/templates/templates/puppeteer-js/README.md:1-10]() ## See Also - Browser Pool and BrowserController internals - Request Queue and Autoscaler in `@crawlee/core` - AdaptiveCrawler rendering-type detection - StagehandCrawler AI integration --- ## Crawler Hierarchy and HTTP Clients ### Related Pages Related topics: [Overview, Architecture, and Package Layout](#page-overview), [Browser Pool, Launchers, and Fingerprinting](#page-browser-pool), [Storage, Sessions, Proxies, Autoscaling, and CLI](#page-runtime-ops)

Related Source Files

The following source files were used to generate this page: - [packages/basic-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/basic-crawler/README.md) - [packages/browser-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/browser-crawler/README.md) - [packages/cheerio-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/cheerio-crawler/README.md) - [packages/puppeteer-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/puppeteer-crawler/README.md) - [packages/stagehand-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/stagehand-crawler/README.md) - [packages/http-crawler/package.json](https://github.com/apify/crawlee/blob/main/packages/http-crawler/package.json) - [packages/core/package.json](https://github.com/apify/crawlee/blob/main/packages/core/package.json) - [packages/crawlee/package.json](https://github.com/apify/crawlee/blob/main/packages/crawlee/package.json) - [package.json](https://github.com/apify/crawlee/blob/main/package.json) - [packages/utils/package.json](https://github.com/apify/crawlee/blob/main/packages/utils/package.json) - [packages/templates/templates/cheerio-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/cheerio-ts/README.md) - [packages/templates/templates/playwright-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/playwright-ts/README.md) - [packages/templates/templates/puppeteer-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/puppeteer-ts/README.md)

# Crawler Hierarchy and HTTP Clients Crawlee organizes its crawlers as an inheritance tree rooted in a single foundation class. Every crawler — from the lightweight HTTP scrapers to the full browser automation crawlers — descends from `BasicCrawler` and reuses the same request-queue, autoscaling, retry, and storage infrastructure. On top of that foundation sit two parallel branches: the **HTTP-based crawlers** (fast, no JavaScript execution) and the **browser-based crawlers** (full headless Chrome / Chromium). Choosing between them is the most consequential architectural decision a Crawlee user makes. ## Overview and Workspace Layout The repository is a Yarn/Turbo monorepo. Source: [package.json:1-30](). The root `package.json` declares `"workspaces": ["packages/*"]` and the version `3.17.0` is published synchronously across the workspace via `turbo run build`. Each crawler ships as its own npm package and is published under the `@crawlee/*` scope: | Package | Purpose | | --- | --- | | `@crawlee/core` | Foundation: `BasicCrawler`, `Request`, `RequestList`, `RequestQueue`, storage clients. Source: [packages/core/package.json:1-15]() | | `@crawlee/basic` | `BasicCrawler` exports for users who want full control over fetching. Source: [packages/basic-crawler/README.md:1-10]() | | `@crawlee/http` | `HttpCrawler`, the base for all non-browser HTTP crawlers. Source: [packages/http-crawler/package.json:1-10]() | | `@crawlee/cheerio` | `CheerioCrawler` — HTTP + cheerio parsing. Source: [packages/cheerio-crawler/README.md:1-10]() | | `@crawlee/browser` | `BrowserCrawler` — headless browser base class. Source: [packages/browser-crawler/README.md:1-10]() | | `@crawlee/puppeteer` | `PuppeteerCrawler` built on Puppeteer. Source: [packages/puppeteer-crawler/README.md:1-10]() | | `@crawlee/playwright` | `PlaywrightCrawler` built on Playwright. Source: [packages/playwright-crawler/package.json:1-10]() | | `@crawlee/stagehand` | AI-driven browser automation via Stagehand. Source: [packages/stagehand-crawler/README.md:1-15]() | | `crawlee` | Meta-package re-exporting everything for a single-install experience. Source: [packages/crawlee/package.json:1-15]() | ## Crawler Class Hierarchy The class hierarchy is intentionally narrow: one root, two specializations, and a handful of concrete crawlers on each side. ```mermaid classDiagram class BasicCrawler { +requestHandler +requestList +requestQueue +autoscaling } class HttpCrawler { +httpClient +sendRequest() } class CheerioCrawler { +cheerio parsing } class JSDOMCrawler { +jsdom parsing } class BrowserCrawler { +browserPool +pre/postLaunchHooks } class PuppeteerCrawler { +PuppeteerCrawler } class PlaywrightCrawler { +PlaywrightCrawler } class StagehandCrawler { +page.act() +page.extract() +page.observe() } BasicCrawler <|-- HttpCrawler BasicCrawler <|-- BrowserCrawler HttpCrawler <|-- CheerioCrawler HttpCrawler <|-- JSDOMCrawler BrowserCrawler <|-- PuppeteerCrawler BrowserCrawler <|-- PlaywrightCrawler BrowserCrawler <|-- StagehandCrawler ``` `BasicCrawler` "invokes the user-provided `requestHandler` for each `Request` object," reads URLs from a `RequestList` or `RequestQueue`, and handles retries, statistics, and concurrency. Source: [packages/basic-crawler/README.md:1-10](). It is described as "a low-level tool that requires the user to implement the page download and data extraction functionality themselves." Source: [packages/basic-crawler/README.md:5-10](). ## HTTP-Based Crawlers `HttpCrawler` (in `@crawlee/http`) is the HTTP specialization. It owns the network layer — choosing the `httpClient` implementation (e.g. `ImpitHttpClient` or `GotHttpClient`) and applying timeouts, retries, and proxy rotation. The two most prominent subclasses are: - **`CheerioCrawler`** — "downloads each URL using a plain HTTP request, parses the HTML content using Cheerio and then invokes the user-provided `requestHandler` to extract page data using a jQuery-like interface." Source: [packages/cheerio-crawler/README.md:1-10](). - **`JSDOMCrawler`** — parses responses with a full `jsdom` DOM, useful when client-side scripts must be evaluated server-side. The official guidance is unambiguous: if the target site does not require JavaScript, "consider using `CheerioCrawler`, which downloads the pages using raw HTTP requests and is about **10x faster**." Source: [packages/browser-crawler/README.md:3-8]() and [packages/puppeteer-crawler/README.md:3-8](). ### HTTP Client Considerations The HTTP client layer sits between `HttpCrawler.sendRequest()` and the network. Recent release history shows this layer is actively maintained: - v3.15.0 fixed a bug so that "`ImpitHttpClient` respects the internal `Request` timeout." Source: community context for [v3.15.0 release](https://github.com/apify/crawlee/releases/tag/v3.15.0). - Community issue #3769 requests a `cacheClients` option on `ImpitHttpClient` to disable connection caching in `getClient()`. Source: [community context](https://github.com/apify/crawlee/issues/3769). This is a live feature request in the `@crawlee/impit-client` package as of the captured context. - v3.17.0 added "network timeouts to `discoverValidSitemaps` to prevent indefinite hangs." Source: [v3.17.0 release](https://github.com/apify/crawlee/releases/tag/v3.17.0). A separate robustness concern tracked by the community: a crawler can hang indefinitely if started with malformed `requestLike` input rather than failing fast. Source: [issue #3764](https://github.com/apify/crawlee/issues/3764). Operationally this means callers should validate request shapes before handing them to `crawler.run()`. ## Browser-Based Crawlers `BrowserCrawler` is the browser counterpart of `HttpCrawler`. It pulls a browser instance from a `browserPool`, runs the user handler inside a Playwright/Puppeteer `Page` context, and tears the page down afterward. Source: [packages/browser-crawler/README.md:1-10](). - **`PuppeteerCrawler`** — "uses headless Chrome to download web pages and extract data" via Puppeteer. Source: [packages/puppeteer-crawler/README.md:1-8](). - **`PlaywrightCrawler`** — the Playwright equivalent, exposed as a separate package. Source: [packages/playwright-crawler/package.json:1-8](). - **`StagehandCrawler`** — "AI-powered web crawling using Stagehand ... for natural language browser automation. The enhanced page object offers `page.act()` to perform actions with plain English, `page.extract()` to get structured data with Zod schemas, and `page.observe()` to discover available actions." Source: [packages/stagehand-crawler/README.md:1-10](). `StagehandCrawler` requires an LLM API key when run locally (`env: 'LOCAL'`) or a Browserbase key when run against the managed cloud (`env: 'BROWSERBASE'`). Source: [packages/stagehand-crawler/README.md:11-22](). The project README explicitly recommends `PlaywrightCrawler` for sites with stable selectors because it is "faster and doesn't require AI API keys." Source: [packages/stagehand-crawler/README.md:5-8](). Recent fixes that affect browser crawlers include: not retiring browsers with long-running `pre|postLaunchHooks` prematurely (v3.14.0), correctly applying `launchOptions` with `useIncognitoPages` (v3.15.2), and respecting storage-class config to avoid memory leaks (v3.15.1). ## Choosing a Crawler and Selecting a Template The repository ships one template per major crawler in `packages/templates/templates/*` (e.g. `cheerio-js`, `cheerio-ts`, `playwright-ts`, `puppeteer-ts`). Each template README is a two-line pointer to the official docs and examples. Source: [packages/templates/templates/playwright-ts/README.md:1-7](), [packages/templates/templates/cheerio-ts/README.md:1-7](), [packages/templates/templates/puppeteer-ts/README.md:1-7](). Decision rule of thumb, derived from the package READMEs: - Static HTML, no JS, maximum throughput → `CheerioCrawler`. - Static HTML but need a real DOM → `JSDOMCrawler`. - JS-rendered pages with stable selectors → `PuppeteerCrawler` or `PlaywrightCrawler`. - JS-rendered pages with fragile / changing selectors and an LLM budget → `StagehandCrawler`. - Custom fetching (e.g. third-party API) → `BasicCrawler`. ## See Also - `BasicCrawler` request handling and autoscaling — covered in [@crawlee/basic package docs](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler). - `Request` and `RequestQueue` model — covered in [@crawlee/core package docs](https://crawlee.dev/js/api/core/class/Request). - Adaptive crawling (HTTP-vs-browser auto-detection) — release notes reference `AdaptivePlaywrightCrawler` fixes in v3.13.10 and v3.16.0. - HTTP client configuration and `ImpitHttpClient` issues — see [issue #3769](https://github.com/apify/crawlee/issues/3769) and [v3.15.0 release notes](https://github.com/apify/crawlee/releases/tag/v3.15.0). --- ## Browser Pool, Launchers, and Fingerprinting ### Related Pages Related topics: [Overview, Architecture, and Package Layout](#page-overview), [Crawler Hierarchy and HTTP Clients](#page-crawlers-http), [Storage, Sessions, Proxies, Autoscaling, and CLI](#page-runtime-ops)

Related Source Files

The following source files were used to generate this page: - [packages/browser-pool/src/browser-pool.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/browser-pool.ts) - [packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts) - [packages/browser-pool/README.md](https://github.com/apify/crawlee/blob/main/packages/browser-pool/README.md) - [packages/browser-pool/src/abstract-classes/browser-plugin.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-plugin.ts) - [packages/browser-pool/src/abstract-classes/browser-controller.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-controller.ts) - [packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts](https://github.com/apify/crawlee/blob/main/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts) - [packages/playwright-crawler/src/internals/utils/playwright-utils.ts](https://github.com/apify/crawlee/blob/main/packages/playwright-crawler/src/internals/utils/playwright-utils.ts) - [packages/stagehand-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/stagehand-crawler/README.md) - [package.json](https://github.com/apify/crawlee/blob/main/package.json)

# Browser Pool, Launchers, and Fingerprinting ## Overview and Purpose `@crawlee/browser-pool` is a small, but powerful and extensible library that allows developers to seamlessly control multiple headless browsers concurrently through a unified API. The package was created to address a recurring operational concern: executing tasks in many headless browsers and their pages without having to manually manage browser launches, crashes, restarts, and the entire browser/page lifecycle (Source: [packages/browser-pool/README.md](https://github.com/apify/crawlee/blob/main/packages/browser-pool/README.md)). The library supports both Puppeteer and Playwright out of the box, and can be extended with custom plugins. It is consumed by Crawlee's higher-level crawlers (`PuppeteerCrawler`, `PlaywrightCrawler`, `AdaptivePlaywrightCrawler`, and `StagehandCrawler`) to manage browser instances transparently while user code focuses on page-level data extraction. The root project pins browser-automation dependencies at known compatible versions in `package.json` (for example `playwright-core: 1.61.0` and `@puppeteer/browsers: ^3.0.4`) so that all downstream crawlers use a coherent set of browser engines (Source: [package.json](https://github.com/apify/crawlee/blob/main/package.json)). ## Core Architecture The Browser Pool is organized around three primary abstractions: 1. **`BrowserPool`** — the central orchestrator that tracks active `LaunchContext` and `BrowserController` instances and routes new page requests to the right browser. 2. **`BrowserPlugin`** — a thin adapter wrapping a specific automation library (Puppeteer, Playwright, custom). 3. **`BrowserController`** — an abstract handle that mediates all browser-level operations (closing, retrieving pages, creating contexts). ```mermaid flowchart LR User[User code / Crawler] -->|newPage| BP[BrowserPool] BP --> LP[preLaunchHooks] LP --> BL[Browser launch] BL --> LC[LaunchContext] LC --> BC[BrowserController] BC -->|newPage| Pg[Page] BC -->|close| Ret[Retire browser] Ret -->|postLaunchHooks cleanup| BC ``` The `BrowserPool` constructor accepts a list of `BrowserPlugin` instances, allowing the pool to manage multiple browser engines simultaneously. A typical helper method, `newPageWithEachPlugin`, opens a page in every configured engine in parallel, which is useful for cross-browser testing (Source: [packages/browser-pool/src/browser-pool.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/browser-pool.ts)). ### LaunchContext `LaunchContext` holds information about a single browser launch. It exposes the resolved `launchOptions`, the proxy URL/tier, and any user-supplied custom values added through the `extend` function. This is the recommended place to store browser-scoped values such as session IDs (Source: [packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts)). Important flags that can be set on a `LaunchContext` include: | Option | Effect | | --- | --- | | `browserPerProxy` | If `true`, the pool launches a fresh browser per proxy URL. Improves isolation but may cause excessive browser spawning. | | `useIncognitoPages` | Each page uses its own browser context, destroyed on close. | | `experimentalContainers` | Persistent contexts (cache reuse); works best with Firefox and is unstable on Chromium. | | `userDataDir` | Path to a User Data Directory for cookies and local storage. | | `proxyUrl` / `proxyTier` | Routing metadata consumed by the proxy chain. | | `ignoreSslErrors` | Ignores TLS errors from upstream proxy, useful with self-signed HTTPS proxies. | The pool assigns each `LaunchContext` an `id` equal to the `id` of the page that triggered the launch, which makes log correlation straightforward (Source: [packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts)). ## Browser Plugins and Launchers `BrowserPlugin` is the extension point that wraps a specific automation library. Each plugin provides a `launch` function returning a `BrowserController`, and a `newPage` function for creating a page within an already-launched browser. Two official plugins are shipped: - **`PlaywrightPlugin`** — wraps `playwright.chromium`, `playwright.firefox`, or `playwright.webkit`. - **`PuppeteerPlugin`** — wraps a Puppeteer installation. The `BrowserPool` class accepts an array of plugins and will round-robin across them if multiple are configured. Pages can be requested with a specific plugin via the `browserPlugin` option in `BrowserPoolNewPageOptions` (Source: [packages/browser-pool/src/abstract-classes/browser-plugin.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-plugin.ts)). The `StagehandCrawler` extends `BrowserCrawler` and adds an AI-driven layer on top of the standard pool: a `StagehandController` exposes `page.act()`, `page.extract()`, and `page.observe()` for natural-language browser interaction. The `apiKey` is interpreted as an LLM provider key when `env: 'LOCAL'` (the default) or as a Browserbase API key when `env: 'BROWSERBASE'` (Source: [packages/stagehand-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/stagehand-crawler/README.md)). ## Lifecycle Hooks and Fingerprinting The pool exposes several lifecycle hooks configurable in the `BrowserPool` constructor options: `preLaunchHooks`, `postLaunchHooks`, `prePageCreateHooks`, `postPageCreateHooks`, `prePageCloseHooks`, and `postPageCloseHooks`. Each hook receives the page ID plus the relevant `LaunchContext` or `BrowserController` and can mutate launch options, attach listeners, or schedule cleanup (Source: [packages/browser-pool/src/browser-pool.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/browser-pool.ts)). Fingerprinting is integrated through `BrowserFingerprintWithHeaders` from `fingerprint-generator`, which is stored in `LaunchContext` alongside `launchOptions`. The header set produced by the fingerprint generator is propagated to HTTP requests through crawlers like `PuppeteerCrawler`, which call utilities such as `enqueueLinksByClickingElements` to click navigation triggers on JavaScript-heavy pages and intercept subsequent navigations (Source: [packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts](https://github.com/apify/crawlee/blob/main/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts)). For Playwright, the `compileScript` helper allows executing arbitrary user-supplied function bodies inside a `vm.runInNewContext` sandbox with a deliberately emptied prototype chain, while still receiving the live `page` and `request` objects. This sandbox is **not** a fully secure boundary, so the function is intended for sanitized or trusted code only (Source: [packages/playwright-crawler/src/internals/utils/playwright-utils.ts](https://github.com/apify/crawlee/blob/main/packages/playwright-crawler/src/internals/utils/playwright-utils.ts)). ## Known Caveats and Community Issues Several community-reported issues are relevant when working with browser automation under the pool: - **Bun runtime compatibility** — Running Playwright/Puppeteer crawlers under Bun currently has unresolved issues in `browser-pool` and `memory-storage` integration (Source: [Issue #2046](https://github.com/apify/crawlee/issues/2046)). - **Premature browser retirement** — Long-running `preLaunchHooks` or `postLaunchHooks` could cause the pool to retire browsers too early. This was fixed in v3.14.0 to respect hook execution time (Source: [v3.14.0 release notes](https://github.com/apify/crawlee/releases/tag/v3.14.0)). - **`launchOptions` with `useIncognitoPages`** — A bug caused launch options to be ignored when incognito pages were enabled. Resolved in v3.15.2 (Source: [v3.15.2 release notes](https://github.com/apify/crawlee/releases/tag/v3.15.2)). - **Memory leak in storage classes** — A v3.15.1 fix corrected the storage class configuration to avoid memory leaks when many browser contexts are active (Source: [v3.15.1 release notes](https://github.com/apify/crawlee/releases/tag/v3.15.1)). ## See Also - [Stagehand Crawler](./Stagehand-Crawler.md) — AI-driven browser automation built on top of the Browser Pool. - [Puppeteer Crawler](./Puppeteer-Crawler.md) — `PuppeteerCrawler` integration using the Browser Pool. - [Playwright Crawler](./Playwright-Crawler.md) — `PlaywrightCrawler` integration using the Browser Pool. - [Cheerio Crawler](./Cheerio-Crawler.md) — A non-browser HTTP crawler for static sites. - [Adaptive Crawler](./Adaptive-Crawler.md) — Automatically decides between HTTP and headless rendering. --- ## Storage, Sessions, Proxies, Autoscaling, and CLI ### Related Pages Related topics: [Overview, Architecture, and Package Layout](#page-overview), [Crawler Hierarchy and HTTP Clients](#page-crawlers-http), [Browser Pool, Launchers, and Fingerprinting](#page-browser-pool)

Related Source Files

The following source files were used to generate this page: - [packages/core/src/storages/storage_manager.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/storage_manager.ts) - [packages/core/src/storages/dataset.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/dataset.ts) - [packages/core/src/storages/key_value_store.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/key_value_store.ts) - [packages/core/src/storages/request_list.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/request_list.ts) - [packages/core/src/storages/request_queue.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/request_queue.ts) - [packages/browser-pool/src/browser-pool.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/browser-pool.ts) - [packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts) - [packages/browser-pool/src/abstract-classes/browser-controller.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-controller.ts) - [package.json](https://github.com/apify/crawlee/blob/main/package.json) - [website/src/pages/js.js](https://github.com/apify/crawlee/blob/main/website/src/pages/js.js)

# Storage, Sessions, Proxies, Autoscaling, and CLI Crawlee is a scalable web crawling and scraping library for Node.js, published as a Yarn-workspaces monorepo under `packages/*` ([package.json](https://github.com/apify/crawlee/blob/main/package.json)). Five cross-cutting subsystems form the operational backbone of every crawler built on top of it: persistent **Storage** for input and output data, **Sessions** for cookie/user-data reuse, **Proxies** for outbound IP rotation, **Autoscaling** for adapting concurrency to runtime load, and the project-level **CLI** scripts used to build and ship the library. The sections below describe each subsystem using only the evidence available in the repository. ## Architecture Overview The following diagram shows how the five subsystems are layered on top of the crawler core. Storage is mounted by every crawler; Sessions and Proxies ride on the browser launch path; Autoscaling sits above the request queue; the CLI orchestrates the build pipeline. ```mermaid flowchart TB CLI["CLI scripts
(package.json)"] --> Build["turbo run build
lerna workspaces"] Build --> Core["@crawlee/core"] Core --> Storage["Storage Manager
Dataset / KV Store /
RequestQueue / RequestList"] Core --> Auto["Autoscaling
(ScalingCrawler)"] Auto --> RQ["Request Queue"] Storage --> RQ Storage --> DS["Datasets / KV Stores"] BrowserPool["@crawlee/browser-pool"] --> LaunchCtx["LaunchContext
(userDataDir,
proxyUrl, useIncognitoPages)"] LaunchCtx --> Sessions["Sessions"] LaunchCtx --> Proxies["Proxies
(proxyUrl, proxyTier)"] ``` ## Storage Crawlee's storage layer is implemented as a set of pluggable classes managed by a central `StorageManager`. The four built-in storage types live under `packages/core/src/storages/`: `storage_manager.ts`, `dataset.ts`, `key_value_store.ts`, `request_list.ts`, and `request_queue.ts` ([packages/core/src/storages/storage_manager.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/storage_manager.ts)). Datasets append structured records, Key-Value Stores hold named blobs (e.g. request state, screenshots), and RequestLists/RequestQueues track URLs to visit. In-process storage is provided by the `@crawlee/memory-storage` companion package. A long-standing open proposal asks that this package be merged directly into `@crawlee/core` to remove the extra dependency ([issue #3756](https://github.com/apify/crawlee/issues/3756)). A related fix in v3.15.1 — *"use correct config for storage classes to avoid memory leaks"* — explicitly targeted the configuration surface of these storage classes, confirming that storage clients hold runtime configuration that must be passed correctly during construction ([v3.15.1](https://github.com/apify/crawlee/releases/tag/v3.15.1)). HTTP crawlers consume the storage layer through `got-scraping`, as declared in [packages/http-crawler/package.json](https://github.com/apify/crawlee/blob/main/packages/http-crawler/package.json). ## Sessions Session isolation for browser crawlers is configured through the browser-pool `LaunchContext`. Three fields on `LaunchContext` govern session behaviour ([packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts)): | Field | Purpose | | --- | --- | | `useIncognitoPages` | By default, pages share the same browser context. When `true`, each page uses its own context that is destroyed on close or crash. | | `experimentalContainers` | Persistent contexts for cache reuse. Works best with Firefox; unstable on Chromium. Marked `@experimental`. | | `userDataDir` | Path to a User Data Directory that stores cookies and local storage for reuse across runs. | The `BrowserController` mirrors the proxy fields it was launched with (`proxyTier`, `proxyUrl`) onto its instance so downstream code can introspect the resolved configuration ([packages/browser-pool/src/abstract-classes/browser-controller.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-controller.ts)). A v3.15.2 release note — *"correctly apply `launchOptions` with `useIncognitoPages`"* — shows the option is exercised in combination with arbitrary launch flags ([v3.15.2](https://github.com/apify/crawlee/releases/tag/v3.15.2)), and v3.15.1 added the note that storage class configuration must be plumbed through correctly to avoid leaks ([v3.15.1](https://github.com/apify/crawlee/releases/tag/v3.15.1)). ## Proxies Proxy configuration is declared on `LaunchContext` and observed on `BrowserController`. From the source, `LaunchContext` exposes `proxyUrl` and `proxyTier`; `BrowserController` records `proxyTier` as `undefined` when no tiered proxy is used, and `proxyUrl` is set every time the controller uses a proxy — including tiered proxies ([packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts), [packages/browser-pool/src/abstract-classes/browser-controller.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-controller.ts)). Two related fields connect proxies to the rest of the system: - `browserPerProxy`: when `true`, the crawler respects the per-request proxy URL generated for a given request, aligning browser-based crawlers with `HttpCrawler`. The docstring warns this can cause Crawlee to launch too many browser instances ([packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts)). - `proxyUrls`: a list that may contain `null`, meaning the crawler skips proxying for that entry. This null-tolerance was added explicitly in v3.15.0 ([v3.15.0](https://github.com/apify/crawlee/releases/tag/v3.15.0)). The companion HTTP client, `ImpitHttpClient`, ships its own connection-cache configuration. Community issue #3769 requests a `cacheClients` toggle to disable that cache for users who want strict per-request connection isolation ([issue #3769](https://github.com/apify/crawlee/issues/3769)). ## Autoscaling and CLI Autoscaling is a headline feature in Crawlee's marketing surface. The website homepage renders an *"Auto scaling"* card that links to `/js/docs/guides/scaling-crawlers`, advertising that crawlers "dynamically scale based on available resources and current load" ([website/src/pages/js.js](https://github.com/apify/crawlee/blob/main/website/src/pages/js.js)). Underneath, autoscaling is implemented by the `ScalingCrawler` mixin in `@crawlee/core`, which monitors system memory and CPU (the `getMemoryInfoV2` and `getCurrentCpuTicksV2` helpers are re-exported from `@crawlee/utils` for this purpose — [packages/utils/src/index.ts](https://github.com/apify/crawlee/blob/main/packages/utils/src/index.ts)) and adjusts worker concurrency accordingly. The CLI surface is minimal but central to the development workflow. The root [package.json](https://github.com/apify/crawlee/blob/main/package.json) defines a Yarn-workspaces monorepo with `packages/*` as workspaces, managed by `lerna@^9.0.7` and `turbo@^2.1.0`. The relevant scripts are: | Script | Command | | --- | --- | | `build` | `turbo run build && node ./scripts/typescript_fixes.mjs` | | `ci:build` | Lerna/CI-aware build | | `clean` | `turbo run clean && rimraf .turbo packages/*/.turbo packages/*/*.tsbuildinfo` | | `prepublishOnly` | `turbo run copy` | | `postinstall` | `npx husky install` | These scripts orchestrate TypeScript compilation, monorepo-wide caching via Turborepo, and Husky hook installation across all crawler packages (browser, cheerio, http, playwright, puppeteer, stagehand, templates) ([package.json](https://github.com/apify/crawlee/blob/main/package.json)). ## See Also - [Browser Pool and Launch Context](./browser-pool-and-launch-context.md) - [HTTP Crawler and ImpitHttpClient](./http-crawler.md) - [Templates and Project Scaffolding](./templates.md) - [Autoscaling Guide](./autoscaling-guide.md) --- --- ## Pitfall Log Project: apify/crawlee Summary: Found 22 structured pitfall item(s), including 1 high/blocking item(s). Top priority: Maintenance risk - Maintenance risk requires verification. ## 1. Maintenance risk - Maintenance risk requires verification - Severity: high - Evidence strength: source_linked - Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/apify/crawlee/issues/2046 ## 2. Configuration risk - Configuration risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Developers should check this configuration risk before relying on the project: v3.15.1 - User impact: Upgrade or migration may change expected behavior: v3.15.1 - Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.1 ## 3. Configuration risk - Configuration risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Developers should check this configuration risk before relying on the project: v3.16.0 - User impact: Upgrade or migration may change expected behavior: v3.16.0 - Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.16.0 ## 4. Capability evidence risk - Capability evidence risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: README/documentation is current enough for a first validation pass. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: capability.assumptions | https://github.com/apify/crawlee ## 5. Runtime risk - Runtime risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Developers should check this runtime risk before relying on the project: v3.15.0 - User impact: Upgrade or migration may change expected behavior: v3.15.0 - Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.0 ## 6. Runtime risk - Runtime risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Developers should check this runtime risk before relying on the project: v3.15.3 - User impact: Upgrade or migration may change expected behavior: v3.15.3 - Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.3 ## 7. Runtime risk - Runtime risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Developers should check this runtime risk before relying on the project: v3.17.0 - User impact: Upgrade or migration may change expected behavior: v3.17.0 - Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.17.0 ## 8. Runtime risk - Runtime risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/apify/crawlee/issues/3764 ## 9. Maintenance risk - Maintenance risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: evidence.maintainer_signals | https://github.com/apify/crawlee ## 10. Security or permission risk - Security or permission risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: no_demo - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: downstream_validation.risk_items | https://github.com/apify/crawlee ## 11. Security or permission risk - Security or permission risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: no_demo - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: risks.scoring_risks | https://github.com/apify/crawlee ## 12. Capability evidence risk - Capability evidence risk requires verification - Severity: low - Evidence strength: source_linked - Finding: Developers should check this capability risk before relying on the project: Allow disabling ImpitHttpClient client cache - User impact: Developers may hit a documented source-backed failure mode: Allow disabling ImpitHttpClient client cache - Evidence: failure_mode_cluster:github_issue | https://github.com/apify/crawlee/issues/3769 ## 13. Capability evidence risk - Capability evidence risk requires verification - Severity: low - Evidence strength: source_linked - Finding: Developers should check this capability risk before relying on the project: Crawler hangs forever when given malformed request input (e.g. invalid `userData` shape) - User impact: Developers may hit a documented source-backed failure mode: Crawler hangs forever when given malformed request input (e.g. invalid `userData` shape) - Evidence: failure_mode_cluster:github_issue | https://github.com/apify/crawlee/issues/3764 ## 14. Runtime risk - Runtime risk requires verification - Severity: low - Evidence strength: source_linked - Finding: Developers should check this performance risk before relying on the project: Add support for Bun runtime - Issue with `browser-pool` and `memory-storage` packages - User impact: Developers may hit a documented source-backed failure mode: Add support for Bun runtime - Issue with `browser-pool` and `memory-storage` packages - Evidence: failure_mode_cluster:github_issue | https://github.com/apify/crawlee/issues/2046 ## 15. Runtime risk - Runtime risk requires verification - Severity: low - Evidence strength: source_linked - Finding: Developers should check this performance risk before relying on the project: Merge @crawlee/memory-storage package into @crawlee/core - User impact: Developers may hit a documented source-backed failure mode: Merge @crawlee/memory-storage package into @crawlee/core - Evidence: failure_mode_cluster:github_issue | https://github.com/apify/crawlee/issues/3756 ## 16. Runtime risk - Runtime risk requires verification - Severity: low - Evidence strength: source_linked - Finding: Developers should check this performance risk before relying on the project: v3.15.2 - User impact: Upgrade or migration may change expected behavior: v3.15.2 - Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.2 ## 17. Maintenance risk - Maintenance risk requires verification - Severity: low - Evidence strength: source_linked - Finding: issue_or_pr_quality=unknown。 - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: evidence.maintainer_signals | https://github.com/apify/crawlee ## 18. Maintenance risk - Maintenance risk requires verification - Severity: low - Evidence strength: source_linked - Finding: release_recency=unknown。 - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: evidence.maintainer_signals | https://github.com/apify/crawlee ## 19. Maintenance risk - Maintenance risk requires verification - Severity: low - Evidence strength: source_linked - Finding: Developers should check this maintenance risk before relying on the project: v3.13.10 - User impact: Upgrade or migration may change expected behavior: v3.13.10 - Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.13.10 ## 20. Maintenance risk - Maintenance risk requires verification - Severity: low - Evidence strength: source_linked - Finding: Developers should check this maintenance risk before relying on the project: v3.13.9 - User impact: Upgrade or migration may change expected behavior: v3.13.9 - Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.13.9 ## 21. Maintenance risk - Maintenance risk requires verification - Severity: low - Evidence strength: source_linked - Finding: Developers should check this maintenance risk before relying on the project: v3.14.0 - User impact: Upgrade or migration may change expected behavior: v3.14.0 - Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.14.0 ## 22. Maintenance risk - Maintenance risk requires verification - Severity: low - Evidence strength: source_linked - Finding: Developers should check this maintenance risk before relying on the project: v3.14.1 - User impact: Upgrade or migration may change expected behavior: v3.14.1 - Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.14.1