# https://github.com/apify/crawlee Project Manual

Generated at: 2026-06-22 06:40:05 UTC

## Table of Contents

- [Overview, Architecture, and Package Layout](#page-overview)
- [Crawler Hierarchy and HTTP Clients](#page-crawlers-http)
- [Browser Pool, Launchers, and Fingerprinting](#page-browser-pool)
- [Storage, Sessions, Proxies, Autoscaling, and CLI](#page-runtime-ops)

<a id='page-overview'></a>

## Overview, Architecture, and Package Layout

### Related Pages

Related topics: [Crawler Hierarchy and HTTP Clients](#page-crawlers-http), [Browser Pool, Launchers, and Fingerprinting](#page-browser-pool), [Storage, Sessions, Proxies, Autoscaling, and CLI](#page-runtime-ops)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [package.json](https://github.com/apify/crawlee/blob/main/package.json)
- [packages/crawlee/package.json](https://github.com/apify/crawlee/blob/main/packages/crawlee/package.json)
- [packages/stagehand-crawler/package.json](https://github.com/apify/crawlee/blob/main/packages/stagehand-crawler/package.json)
- [packages/stagehand-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/stagehand-crawler/README.md)
- [packages/browser-pool/src/browser-pool.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/browser-pool.ts)
- [packages/browser-pool/src/abstract-classes/browser-controller.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-controller.ts)
- [packages/playwright-crawler/src/internals/utils/playwright-utils.ts](https://github.com/apify/crawlee/blob/main/packages/playwright-crawler/src/internals/utils/playwright-utils.ts)
- [packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts](https://github.com/apify/crawlee/blob/main/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts)
- [packages/templates/templates/empty-ts/package.json](https://github.com/apify/crawlee/blob/main/packages/templates/templates/empty-ts/package.json)
- [packages/templates/templates/cheerio-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/cheerio-ts/README.md)
- [packages/templates/templates/playwright-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/playwright-ts/README.md)
- [packages/templates/templates/puppeteer-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/puppeteer-ts/README.md)
- [packages/templates/templates/camoufox-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/camoufox-ts/README.md)
- [website/src/pages/js.js](https://github.com/apify/crawlee/blob/main/website/src/pages/js.js)
</details>

# Overview, Architecture, and Package Layout

Crawlee is a Node.js / TypeScript library for building reliable web scrapers and crawlers. The monorepo hosts a layered set of packages: a thin meta-package (`crawlee`) that re-exports the rest, a `core` engine, a pluggable `browser-pool` for headless automation, and crawler packages that bind the engine to specific HTTP or browser libraries. The latest published line is the `3.17.x` series (current at `3.17.0`, released 2026-06-04), with backports and fixes continuing through `3.15.x`. Source: [package.json:1-120]()

## Purpose and Scope

Crawlee's purpose is to give developers a single, batteries-included API for crawling the web — from static HTML scraped with Cheerio, to fully rendered pages driven by Playwright or Puppeteer, to AI-driven flows using Stagehand. The official site frames it as "one API for headless and HTTP": users can switch between HTTP-only and headless crawlers without rewriting their handler code, and the Adaptive crawler can decide at runtime whether JavaScript rendering is required. Source: [website/src/pages/js.js:1-80]()

Beyond crawling, the project publishes ready-made project templates so new users can scaffold a scraper with one command. Templates are available for Cheerio, Playwright, Puppeteer, and Camoufox, in both JavaScript and TypeScript variants. Source: [packages/templates/templates/cheerio-ts/README.md:1-10]() Source: [packages/templates/templates/playwright-ts/README.md:1-10]() Source: [packages/templates/templates/puppeteer-ts/README.md:1-10]() Source: [packages/templates/templates/camoufox-ts/README.md:1-10]()

## Repository Layout and Package Organization

The repository is a Yarn 4 + Lerna monorepo with Turborepo for task running, declared at the root. Node `24.17.0` is pinned via Volta, and Playwright `1.61.0` and Puppeteer `24.36.1` are aligned in `resolutions` to keep browser engines consistent across packages. Source: [package.json:1-120]()

The `packages/` directory contains the published libraries. The umbrella package `crawlee` simply re-exports a curated set of first-party packages (`@crawlee/basic`, `@crawlee/browser`, `@crawlee/browser-pool`, `@crawlee/cheerio`, `@crawlee/cli`, `@crawlee/core`, `@crawlee/http`, `@crawlee/jsdom`, `@crawlee/linkedom`, `@crawlee/playwright`, `@crawlee/puppeteer`, and others), all versioned together at `3.17.0`. Source: [packages/crawlee/package.json:1-80]()

Specialized packages add capabilities beyond core:

| Package | Role | Notable Dependency |
|---|---|---|
| `@crawlee/browser-pool` | Lifecycle management for headless browsers | `playwright`, `puppeteer` |
| `@crawlee/stagehand` | AI-driven crawling via Stagehand | `@browserbasehq/stagehand` v3, `zod` |
| `@crawlee/templates` | Project scaffolds for new crawlers | none |

Source: [packages/stagehand-crawler/package.json:1-60]() Source: [packages/browser-pool/src/browser-pool.ts:1-40]()

Community discussion: an open proposal asks whether `@crawlee/memory-storage` should be merged into `@crawlee/core` to remove an extra dependency from the install graph (issue #3756). Separately, users have asked for a way to disable ImpitHttpClient's connection cache (issue #3769) and reported that Bun runtime support is still partial because `browser-pool` and `memory-storage` lag behind (issue #2046).

## Core Architecture

At runtime, every crawler ultimately drives an HTTP fetch (static rendering) or a browser page (dynamic rendering). The shared `core` package supplies the request queue, request list, autoscaling, retry, statistics, and storage abstractions; crawler packages consume those abstractions and add their own transport layer.

For browser-based crawlers, `@crawlee/browser-pool` sits underneath `BrowserCrawler`, `PlaywrightCrawler`, and `PuppeteerCrawler`. `BrowserPool` orchestrates launching and retiring browsers, exposing lifecycle hooks (`preLaunchHooks`, `postLaunchHooks`, `prePageCreateHooks`, `prePageCloseHooks`, `postPageCloseHooks`) so user code can tweak `launchOptions`, register contexts, or perform cleanup. Source: [packages/browser-pool/src/browser-pool.ts:1-80]()

Each live browser is wrapped by a `BrowserController` — an abstract class that holds the underlying automation-library browser, the `BrowserPlugin` that launched it, the `LaunchContext`, and an optional `proxyTier` for tiered proxy rotation. Concrete `PuppeteerController` and `PlaywrightController` subclasses add only library-specific private methods. Source: [packages/browser-pool/src/abstract-classes/browser-controller.ts:1-60]()

```mermaid
flowchart TD
    User[User code / handler] --> Crawler[Crawler class<br/>HTTP / Browser]
    Crawler --> Core["@crawlee/core<br/>RequestQueue, Autoscaler, Stats"]
    Crawler -->|HTTP path| HttpCrawler[HttpCrawler / CheerioCrawler]
    Crawler -->|Headless path| BrowserCrawler[BrowserCrawler]
    BrowserCrawler --> Pool["@crawlee/browser-pool<br/>BrowserPool"]
    Pool --> Ctrl[BrowserController]
    Ctrl --> Plugin[BrowserPlugin<br/>Playwright / Puppeteer]
    Plugin --> Browser[Chromium / Firefox / WebKit]
    Crawler -.optional AI.-> Stagehand["@crawlee/stagehand<br/>LLM-driven"]
```

Crawler-specific utility functions build on top of the page abstraction. Playwright utils compile user-supplied JavaScript into a function executed in a secured VM with `{ page, request }` in scope, plus a `compileScript` helper that throws if the compiled body is not a function. Source: [packages/playwright-crawler/src/internals/utils/playwright-utils.ts:1-60]() Puppeteer provides a parallel utility surface (e.g. intercept-and-click helpers for JS-heavy pages) documented in its utils module. Source: [packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts:1-40]()

The Stagehand package layers an LLM-driven `page.act()`, `page.extract()` (Zod-typed), and `page.observe()` on top of `BrowserCrawler`. `apiKey` semantics depend on the `env` option: under `LOCAL` it is an OpenAI/Anthropic/Google key, while under `BROWSERBASE` it is a Browserbase key. Source: [packages/stagehand-crawler/README.md:1-40]()

## Project Templates and Getting Started

Each crawler ships a paired TypeScript and JavaScript template under `packages/templates/templates/`. All templates declare the umbrella `crawlee` package (`^3.0.0`) and `tsx` for dev runs, with `typescript ~6.0.0` and `@types/node ^24.0.0` on the TypeScript side. The build pipeline is intentionally minimal: `start:dev` runs through `tsx`, `build` invokes `tsc`, and `start:prod` executes the compiled output with Node. Source: [packages/templates/templates/empty-ts/package.json:1-25]()

Template READMEs redirect users to the corresponding `crawlee.dev` guides (e.g. the Cheerio crawler tutorial, the Playwright examples page, and the PuppeteerCrawler class reference), so the templates double as curated entry points to the broader documentation set. Source: [packages/templates/templates/cheerio-js/README.md:1-10]() Source: [packages/templates/templates/playwright-js/README.md:1-10]() Source: [packages/templates/templates/puppeteer-js/README.md:1-10]()

## See Also

- Browser Pool and BrowserController internals
- Request Queue and Autoscaler in `@crawlee/core`
- AdaptiveCrawler rendering-type detection
- StagehandCrawler AI integration

---

<a id='page-crawlers-http'></a>

## Crawler Hierarchy and HTTP Clients

### Related Pages

Related topics: [Overview, Architecture, and Package Layout](#page-overview), [Browser Pool, Launchers, and Fingerprinting](#page-browser-pool), [Storage, Sessions, Proxies, Autoscaling, and CLI](#page-runtime-ops)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/basic-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/basic-crawler/README.md)
- [packages/browser-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/browser-crawler/README.md)
- [packages/cheerio-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/cheerio-crawler/README.md)
- [packages/puppeteer-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/puppeteer-crawler/README.md)
- [packages/stagehand-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/stagehand-crawler/README.md)
- [packages/http-crawler/package.json](https://github.com/apify/crawlee/blob/main/packages/http-crawler/package.json)
- [packages/core/package.json](https://github.com/apify/crawlee/blob/main/packages/core/package.json)
- [packages/crawlee/package.json](https://github.com/apify/crawlee/blob/main/packages/crawlee/package.json)
- [package.json](https://github.com/apify/crawlee/blob/main/package.json)
- [packages/utils/package.json](https://github.com/apify/crawlee/blob/main/packages/utils/package.json)
- [packages/templates/templates/cheerio-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/cheerio-ts/README.md)
- [packages/templates/templates/playwright-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/playwright-ts/README.md)
- [packages/templates/templates/puppeteer-ts/README.md](https://github.com/apify/crawlee/blob/main/packages/templates/templates/puppeteer-ts/README.md)
</details>

# Crawler Hierarchy and HTTP Clients

Crawlee organizes its crawlers as an inheritance tree rooted in a single foundation class. Every crawler — from the lightweight HTTP scrapers to the full browser automation crawlers — descends from `BasicCrawler` and reuses the same request-queue, autoscaling, retry, and storage infrastructure. On top of that foundation sit two parallel branches: the **HTTP-based crawlers** (fast, no JavaScript execution) and the **browser-based crawlers** (full headless Chrome / Chromium). Choosing between them is the most consequential architectural decision a Crawlee user makes.

## Overview and Workspace Layout

The repository is a Yarn/Turbo monorepo. Source: [package.json:1-30](). The root `package.json` declares `"workspaces": ["packages/*"]` and the version `3.17.0` is published synchronously across the workspace via `turbo run build`. Each crawler ships as its own npm package and is published under the `@crawlee/*` scope:

| Package | Purpose |
| --- | --- |
| `@crawlee/core` | Foundation: `BasicCrawler`, `Request`, `RequestList`, `RequestQueue`, storage clients. Source: [packages/core/package.json:1-15]() |
| `@crawlee/basic` | `BasicCrawler` exports for users who want full control over fetching. Source: [packages/basic-crawler/README.md:1-10]() |
| `@crawlee/http` | `HttpCrawler`, the base for all non-browser HTTP crawlers. Source: [packages/http-crawler/package.json:1-10]() |
| `@crawlee/cheerio` | `CheerioCrawler` — HTTP + cheerio parsing. Source: [packages/cheerio-crawler/README.md:1-10]() |
| `@crawlee/browser` | `BrowserCrawler` — headless browser base class. Source: [packages/browser-crawler/README.md:1-10]() |
| `@crawlee/puppeteer` | `PuppeteerCrawler` built on Puppeteer. Source: [packages/puppeteer-crawler/README.md:1-10]() |
| `@crawlee/playwright` | `PlaywrightCrawler` built on Playwright. Source: [packages/playwright-crawler/package.json:1-10]() |
| `@crawlee/stagehand` | AI-driven browser automation via Stagehand. Source: [packages/stagehand-crawler/README.md:1-15]() |
| `crawlee` | Meta-package re-exporting everything for a single-install experience. Source: [packages/crawlee/package.json:1-15]() |

## Crawler Class Hierarchy

The class hierarchy is intentionally narrow: one root, two specializations, and a handful of concrete crawlers on each side.

```mermaid
classDiagram
    class BasicCrawler {
        +requestHandler
        +requestList
        +requestQueue
        +autoscaling
    }
    class HttpCrawler {
        +httpClient
        +sendRequest()
    }
    class CheerioCrawler {
        +cheerio parsing
    }
    class JSDOMCrawler {
        +jsdom parsing
    }
    class BrowserCrawler {
        +browserPool
        +pre/postLaunchHooks
    }
    class PuppeteerCrawler {
        +PuppeteerCrawler
    }
    class PlaywrightCrawler {
        +PlaywrightCrawler
    }
    class StagehandCrawler {
        +page.act()
        +page.extract()
        +page.observe()
    }
    BasicCrawler <|-- HttpCrawler
    BasicCrawler <|-- BrowserCrawler
    HttpCrawler <|-- CheerioCrawler
    HttpCrawler <|-- JSDOMCrawler
    BrowserCrawler <|-- PuppeteerCrawler
    BrowserCrawler <|-- PlaywrightCrawler
    BrowserCrawler <|-- StagehandCrawler
```

`BasicCrawler` "invokes the user-provided `requestHandler` for each `Request` object," reads URLs from a `RequestList` or `RequestQueue`, and handles retries, statistics, and concurrency. Source: [packages/basic-crawler/README.md:1-10](). It is described as "a low-level tool that requires the user to implement the page download and data extraction functionality themselves." Source: [packages/basic-crawler/README.md:5-10]().

## HTTP-Based Crawlers

`HttpCrawler` (in `@crawlee/http`) is the HTTP specialization. It owns the network layer — choosing the `httpClient` implementation (e.g. `ImpitHttpClient` or `GotHttpClient`) and applying timeouts, retries, and proxy rotation. The two most prominent subclasses are:

- **`CheerioCrawler`** — "downloads each URL using a plain HTTP request, parses the HTML content using Cheerio and then invokes the user-provided `requestHandler` to extract page data using a jQuery-like interface." Source: [packages/cheerio-crawler/README.md:1-10]().
- **`JSDOMCrawler`** — parses responses with a full `jsdom` DOM, useful when client-side scripts must be evaluated server-side.

The official guidance is unambiguous: if the target site does not require JavaScript, "consider using `CheerioCrawler`, which downloads the pages using raw HTTP requests and is about **10x faster**." Source: [packages/browser-crawler/README.md:3-8]() and [packages/puppeteer-crawler/README.md:3-8]().

### HTTP Client Considerations

The HTTP client layer sits between `HttpCrawler.sendRequest()` and the network. Recent release history shows this layer is actively maintained:

- v3.15.0 fixed a bug so that "`ImpitHttpClient` respects the internal `Request` timeout." Source: community context for [v3.15.0 release](https://github.com/apify/crawlee/releases/tag/v3.15.0).
- Community issue #3769 requests a `cacheClients` option on `ImpitHttpClient` to disable connection caching in `getClient()`. Source: [community context](https://github.com/apify/crawlee/issues/3769). This is a live feature request in the `@crawlee/impit-client` package as of the captured context.
- v3.17.0 added "network timeouts to `discoverValidSitemaps` to prevent indefinite hangs." Source: [v3.17.0 release](https://github.com/apify/crawlee/releases/tag/v3.17.0).

A separate robustness concern tracked by the community: a crawler can hang indefinitely if started with malformed `requestLike` input rather than failing fast. Source: [issue #3764](https://github.com/apify/crawlee/issues/3764). Operationally this means callers should validate request shapes before handing them to `crawler.run()`.

## Browser-Based Crawlers

`BrowserCrawler` is the browser counterpart of `HttpCrawler`. It pulls a browser instance from a `browserPool`, runs the user handler inside a Playwright/Puppeteer `Page` context, and tears the page down afterward. Source: [packages/browser-crawler/README.md:1-10]().

- **`PuppeteerCrawler`** — "uses headless Chrome to download web pages and extract data" via Puppeteer. Source: [packages/puppeteer-crawler/README.md:1-8]().
- **`PlaywrightCrawler`** — the Playwright equivalent, exposed as a separate package. Source: [packages/playwright-crawler/package.json:1-8]().
- **`StagehandCrawler`** — "AI-powered web crawling using Stagehand ... for natural language browser automation. The enhanced page object offers `page.act()` to perform actions with plain English, `page.extract()` to get structured data with Zod schemas, and `page.observe()` to discover available actions." Source: [packages/stagehand-crawler/README.md:1-10]().

`StagehandCrawler` requires an LLM API key when run locally (`env: 'LOCAL'`) or a Browserbase key when run against the managed cloud (`env: 'BROWSERBASE'`). Source: [packages/stagehand-crawler/README.md:11-22](). The project README explicitly recommends `PlaywrightCrawler` for sites with stable selectors because it is "faster and doesn't require AI API keys." Source: [packages/stagehand-crawler/README.md:5-8]().

Recent fixes that affect browser crawlers include: not retiring browsers with long-running `pre|postLaunchHooks` prematurely (v3.14.0), correctly applying `launchOptions` with `useIncognitoPages` (v3.15.2), and respecting storage-class config to avoid memory leaks (v3.15.1).

## Choosing a Crawler and Selecting a Template

The repository ships one template per major crawler in `packages/templates/templates/*` (e.g. `cheerio-js`, `cheerio-ts`, `playwright-ts`, `puppeteer-ts`). Each template README is a two-line pointer to the official docs and examples. Source: [packages/templates/templates/playwright-ts/README.md:1-7](), [packages/templates/templates/cheerio-ts/README.md:1-7](), [packages/templates/templates/puppeteer-ts/README.md:1-7]().

Decision rule of thumb, derived from the package READMEs:

- Static HTML, no JS, maximum throughput → `CheerioCrawler`.
- Static HTML but need a real DOM → `JSDOMCrawler`.
- JS-rendered pages with stable selectors → `PuppeteerCrawler` or `PlaywrightCrawler`.
- JS-rendered pages with fragile / changing selectors and an LLM budget → `StagehandCrawler`.
- Custom fetching (e.g. third-party API) → `BasicCrawler`.

## See Also

- `BasicCrawler` request handling and autoscaling — covered in [@crawlee/basic package docs](https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler).
- `Request` and `RequestQueue` model — covered in [@crawlee/core package docs](https://crawlee.dev/js/api/core/class/Request).
- Adaptive crawling (HTTP-vs-browser auto-detection) — release notes reference `AdaptivePlaywrightCrawler` fixes in v3.13.10 and v3.16.0.
- HTTP client configuration and `ImpitHttpClient` issues — see [issue #3769](https://github.com/apify/crawlee/issues/3769) and [v3.15.0 release notes](https://github.com/apify/crawlee/releases/tag/v3.15.0).

---

<a id='page-browser-pool'></a>

## Browser Pool, Launchers, and Fingerprinting

### Related Pages

Related topics: [Overview, Architecture, and Package Layout](#page-overview), [Crawler Hierarchy and HTTP Clients](#page-crawlers-http), [Storage, Sessions, Proxies, Autoscaling, and CLI](#page-runtime-ops)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/browser-pool/src/browser-pool.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/browser-pool.ts)
- [packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts)
- [packages/browser-pool/README.md](https://github.com/apify/crawlee/blob/main/packages/browser-pool/README.md)
- [packages/browser-pool/src/abstract-classes/browser-plugin.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-plugin.ts)
- [packages/browser-pool/src/abstract-classes/browser-controller.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-controller.ts)
- [packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts](https://github.com/apify/crawlee/blob/main/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts)
- [packages/playwright-crawler/src/internals/utils/playwright-utils.ts](https://github.com/apify/crawlee/blob/main/packages/playwright-crawler/src/internals/utils/playwright-utils.ts)
- [packages/stagehand-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/stagehand-crawler/README.md)
- [package.json](https://github.com/apify/crawlee/blob/main/package.json)
</details>

# Browser Pool, Launchers, and Fingerprinting

## Overview and Purpose

`@crawlee/browser-pool` is a small, but powerful and extensible library that allows developers to seamlessly control multiple headless browsers concurrently through a unified API. The package was created to address a recurring operational concern: executing tasks in many headless browsers and their pages without having to manually manage browser launches, crashes, restarts, and the entire browser/page lifecycle (Source: [packages/browser-pool/README.md](https://github.com/apify/crawlee/blob/main/packages/browser-pool/README.md)).

The library supports both Puppeteer and Playwright out of the box, and can be extended with custom plugins. It is consumed by Crawlee's higher-level crawlers (`PuppeteerCrawler`, `PlaywrightCrawler`, `AdaptivePlaywrightCrawler`, and `StagehandCrawler`) to manage browser instances transparently while user code focuses on page-level data extraction.

The root project pins browser-automation dependencies at known compatible versions in `package.json` (for example `playwright-core: 1.61.0` and `@puppeteer/browsers: ^3.0.4`) so that all downstream crawlers use a coherent set of browser engines (Source: [package.json](https://github.com/apify/crawlee/blob/main/package.json)).

## Core Architecture

The Browser Pool is organized around three primary abstractions:

1. **`BrowserPool`** — the central orchestrator that tracks active `LaunchContext` and `BrowserController` instances and routes new page requests to the right browser.
2. **`BrowserPlugin`** — a thin adapter wrapping a specific automation library (Puppeteer, Playwright, custom).
3. **`BrowserController`** — an abstract handle that mediates all browser-level operations (closing, retrieving pages, creating contexts).

```mermaid
flowchart LR
    User[User code / Crawler] -->|newPage| BP[BrowserPool]
    BP --> LP[preLaunchHooks]
    LP --> BL[Browser launch]
    BL --> LC[LaunchContext]
    LC --> BC[BrowserController]
    BC -->|newPage| Pg[Page]
    BC -->|close| Ret[Retire browser]
    Ret -->|postLaunchHooks cleanup| BC
```

The `BrowserPool` constructor accepts a list of `BrowserPlugin` instances, allowing the pool to manage multiple browser engines simultaneously. A typical helper method, `newPageWithEachPlugin`, opens a page in every configured engine in parallel, which is useful for cross-browser testing (Source: [packages/browser-pool/src/browser-pool.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/browser-pool.ts)).

### LaunchContext

`LaunchContext` holds information about a single browser launch. It exposes the resolved `launchOptions`, the proxy URL/tier, and any user-supplied custom values added through the `extend` function. This is the recommended place to store browser-scoped values such as session IDs (Source: [packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts)).

Important flags that can be set on a `LaunchContext` include:

| Option | Effect |
| --- | --- |
| `browserPerProxy` | If `true`, the pool launches a fresh browser per proxy URL. Improves isolation but may cause excessive browser spawning. |
| `useIncognitoPages` | Each page uses its own browser context, destroyed on close. |
| `experimentalContainers` | Persistent contexts (cache reuse); works best with Firefox and is unstable on Chromium. |
| `userDataDir` | Path to a User Data Directory for cookies and local storage. |
| `proxyUrl` / `proxyTier` | Routing metadata consumed by the proxy chain. |
| `ignoreSslErrors` | Ignores TLS errors from upstream proxy, useful with self-signed HTTPS proxies. |

The pool assigns each `LaunchContext` an `id` equal to the `id` of the page that triggered the launch, which makes log correlation straightforward (Source: [packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts)).

## Browser Plugins and Launchers

`BrowserPlugin` is the extension point that wraps a specific automation library. Each plugin provides a `launch` function returning a `BrowserController`, and a `newPage` function for creating a page within an already-launched browser.

Two official plugins are shipped:

- **`PlaywrightPlugin`** — wraps `playwright.chromium`, `playwright.firefox`, or `playwright.webkit`.
- **`PuppeteerPlugin`** — wraps a Puppeteer installation.

The `BrowserPool` class accepts an array of plugins and will round-robin across them if multiple are configured. Pages can be requested with a specific plugin via the `browserPlugin` option in `BrowserPoolNewPageOptions` (Source: [packages/browser-pool/src/abstract-classes/browser-plugin.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-plugin.ts)).

The `StagehandCrawler` extends `BrowserCrawler` and adds an AI-driven layer on top of the standard pool: a `StagehandController` exposes `page.act()`, `page.extract()`, and `page.observe()` for natural-language browser interaction. The `apiKey` is interpreted as an LLM provider key when `env: 'LOCAL'` (the default) or as a Browserbase API key when `env: 'BROWSERBASE'` (Source: [packages/stagehand-crawler/README.md](https://github.com/apify/crawlee/blob/main/packages/stagehand-crawler/README.md)).

## Lifecycle Hooks and Fingerprinting

The pool exposes several lifecycle hooks configurable in the `BrowserPool` constructor options: `preLaunchHooks`, `postLaunchHooks`, `prePageCreateHooks`, `postPageCreateHooks`, `prePageCloseHooks`, and `postPageCloseHooks`. Each hook receives the page ID plus the relevant `LaunchContext` or `BrowserController` and can mutate launch options, attach listeners, or schedule cleanup (Source: [packages/browser-pool/src/browser-pool.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/browser-pool.ts)).

Fingerprinting is integrated through `BrowserFingerprintWithHeaders` from `fingerprint-generator`, which is stored in `LaunchContext` alongside `launchOptions`. The header set produced by the fingerprint generator is propagated to HTTP requests through crawlers like `PuppeteerCrawler`, which call utilities such as `enqueueLinksByClickingElements` to click navigation triggers on JavaScript-heavy pages and intercept subsequent navigations (Source: [packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts](https://github.com/apify/crawlee/blob/main/packages/puppeteer-crawler/src/internals/utils/puppeteer_utils.ts)).

For Playwright, the `compileScript` helper allows executing arbitrary user-supplied function bodies inside a `vm.runInNewContext` sandbox with a deliberately emptied prototype chain, while still receiving the live `page` and `request` objects. This sandbox is **not** a fully secure boundary, so the function is intended for sanitized or trusted code only (Source: [packages/playwright-crawler/src/internals/utils/playwright-utils.ts](https://github.com/apify/crawlee/blob/main/packages/playwright-crawler/src/internals/utils/playwright-utils.ts)).

## Known Caveats and Community Issues

Several community-reported issues are relevant when working with browser automation under the pool:

- **Bun runtime compatibility** — Running Playwright/Puppeteer crawlers under Bun currently has unresolved issues in `browser-pool` and `memory-storage` integration (Source: [Issue #2046](https://github.com/apify/crawlee/issues/2046)).
- **Premature browser retirement** — Long-running `preLaunchHooks` or `postLaunchHooks` could cause the pool to retire browsers too early. This was fixed in v3.14.0 to respect hook execution time (Source: [v3.14.0 release notes](https://github.com/apify/crawlee/releases/tag/v3.14.0)).
- **`launchOptions` with `useIncognitoPages`** — A bug caused launch options to be ignored when incognito pages were enabled. Resolved in v3.15.2 (Source: [v3.15.2 release notes](https://github.com/apify/crawlee/releases/tag/v3.15.2)).
- **Memory leak in storage classes** — A v3.15.1 fix corrected the storage class configuration to avoid memory leaks when many browser contexts are active (Source: [v3.15.1 release notes](https://github.com/apify/crawlee/releases/tag/v3.15.1)).

## See Also

- [Stagehand Crawler](./Stagehand-Crawler.md) — AI-driven browser automation built on top of the Browser Pool.
- [Puppeteer Crawler](./Puppeteer-Crawler.md) — `PuppeteerCrawler` integration using the Browser Pool.
- [Playwright Crawler](./Playwright-Crawler.md) — `PlaywrightCrawler` integration using the Browser Pool.
- [Cheerio Crawler](./Cheerio-Crawler.md) — A non-browser HTTP crawler for static sites.
- [Adaptive Crawler](./Adaptive-Crawler.md) — Automatically decides between HTTP and headless rendering.

---

<a id='page-runtime-ops'></a>

## Storage, Sessions, Proxies, Autoscaling, and CLI

### Related Pages

Related topics: [Overview, Architecture, and Package Layout](#page-overview), [Crawler Hierarchy and HTTP Clients](#page-crawlers-http), [Browser Pool, Launchers, and Fingerprinting](#page-browser-pool)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/core/src/storages/storage_manager.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/storage_manager.ts)
- [packages/core/src/storages/dataset.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/dataset.ts)
- [packages/core/src/storages/key_value_store.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/key_value_store.ts)
- [packages/core/src/storages/request_list.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/request_list.ts)
- [packages/core/src/storages/request_queue.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/request_queue.ts)
- [packages/browser-pool/src/browser-pool.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/browser-pool.ts)
- [packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts)
- [packages/browser-pool/src/abstract-classes/browser-controller.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-controller.ts)
- [package.json](https://github.com/apify/crawlee/blob/main/package.json)
- [website/src/pages/js.js](https://github.com/apify/crawlee/blob/main/website/src/pages/js.js)

</details>

# Storage, Sessions, Proxies, Autoscaling, and CLI

Crawlee is a scalable web crawling and scraping library for Node.js, published as a Yarn-workspaces monorepo under `packages/*` ([package.json](https://github.com/apify/crawlee/blob/main/package.json)). Five cross-cutting subsystems form the operational backbone of every crawler built on top of it: persistent **Storage** for input and output data, **Sessions** for cookie/user-data reuse, **Proxies** for outbound IP rotation, **Autoscaling** for adapting concurrency to runtime load, and the project-level **CLI** scripts used to build and ship the library. The sections below describe each subsystem using only the evidence available in the repository.

## Architecture Overview

The following diagram shows how the five subsystems are layered on top of the crawler core. Storage is mounted by every crawler; Sessions and Proxies ride on the browser launch path; Autoscaling sits above the request queue; the CLI orchestrates the build pipeline.

```mermaid
flowchart TB
    CLI["CLI scripts<br/>(package.json)"] --> Build["turbo run build<br/>lerna workspaces"]
    Build --> Core["@crawlee/core"]
    Core --> Storage["Storage Manager<br/>Dataset / KV Store /<br/>RequestQueue / RequestList"]
    Core --> Auto["Autoscaling<br/>(ScalingCrawler)"]
    Auto --> RQ["Request Queue"]
    Storage --> RQ
    Storage --> DS["Datasets / KV Stores"]
    BrowserPool["@crawlee/browser-pool"] --> LaunchCtx["LaunchContext<br/>(userDataDir,<br/>proxyUrl, useIncognitoPages)"]
    LaunchCtx --> Sessions["Sessions"]
    LaunchCtx --> Proxies["Proxies<br/>(proxyUrl, proxyTier)"]
```

## Storage

Crawlee's storage layer is implemented as a set of pluggable classes managed by a central `StorageManager`. The four built-in storage types live under `packages/core/src/storages/`: `storage_manager.ts`, `dataset.ts`, `key_value_store.ts`, `request_list.ts`, and `request_queue.ts` ([packages/core/src/storages/storage_manager.ts](https://github.com/apify/crawlee/blob/main/packages/core/src/storages/storage_manager.ts)). Datasets append structured records, Key-Value Stores hold named blobs (e.g. request state, screenshots), and RequestLists/RequestQueues track URLs to visit.

In-process storage is provided by the `@crawlee/memory-storage` companion package. A long-standing open proposal asks that this package be merged directly into `@crawlee/core` to remove the extra dependency ([issue #3756](https://github.com/apify/crawlee/issues/3756)). A related fix in v3.15.1 — *"use correct config for storage classes to avoid memory leaks"* — explicitly targeted the configuration surface of these storage classes, confirming that storage clients hold runtime configuration that must be passed correctly during construction ([v3.15.1](https://github.com/apify/crawlee/releases/tag/v3.15.1)). HTTP crawlers consume the storage layer through `got-scraping`, as declared in [packages/http-crawler/package.json](https://github.com/apify/crawlee/blob/main/packages/http-crawler/package.json).

## Sessions

Session isolation for browser crawlers is configured through the browser-pool `LaunchContext`. Three fields on `LaunchContext` govern session behaviour ([packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts)):

| Field | Purpose |
| --- | --- |
| `useIncognitoPages` | By default, pages share the same browser context. When `true`, each page uses its own context that is destroyed on close or crash. |
| `experimentalContainers` | Persistent contexts for cache reuse. Works best with Firefox; unstable on Chromium. Marked `@experimental`. |
| `userDataDir` | Path to a User Data Directory that stores cookies and local storage for reuse across runs. |

The `BrowserController` mirrors the proxy fields it was launched with (`proxyTier`, `proxyUrl`) onto its instance so downstream code can introspect the resolved configuration ([packages/browser-pool/src/abstract-classes/browser-controller.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-controller.ts)). A v3.15.2 release note — *"correctly apply `launchOptions` with `useIncognitoPages`"* — shows the option is exercised in combination with arbitrary launch flags ([v3.15.2](https://github.com/apify/crawlee/releases/tag/v3.15.2)), and v3.15.1 added the note that storage class configuration must be plumbed through correctly to avoid leaks ([v3.15.1](https://github.com/apify/crawlee/releases/tag/v3.15.1)).

## Proxies

Proxy configuration is declared on `LaunchContext` and observed on `BrowserController`. From the source, `LaunchContext` exposes `proxyUrl` and `proxyTier`; `BrowserController` records `proxyTier` as `undefined` when no tiered proxy is used, and `proxyUrl` is set every time the controller uses a proxy — including tiered proxies ([packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts), [packages/browser-pool/src/abstract-classes/browser-controller.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/abstract-classes/browser-controller.ts)).

Two related fields connect proxies to the rest of the system:

- `browserPerProxy`: when `true`, the crawler respects the per-request proxy URL generated for a given request, aligning browser-based crawlers with `HttpCrawler`. The docstring warns this can cause Crawlee to launch too many browser instances ([packages/browser-pool/src/launch-context.ts](https://github.com/apify/crawlee/blob/main/packages/browser-pool/src/launch-context.ts)).
- `proxyUrls`: a list that may contain `null`, meaning the crawler skips proxying for that entry. This null-tolerance was added explicitly in v3.15.0 ([v3.15.0](https://github.com/apify/crawlee/releases/tag/v3.15.0)).

The companion HTTP client, `ImpitHttpClient`, ships its own connection-cache configuration. Community issue #3769 requests a `cacheClients` toggle to disable that cache for users who want strict per-request connection isolation ([issue #3769](https://github.com/apify/crawlee/issues/3769)).

## Autoscaling and CLI

Autoscaling is a headline feature in Crawlee's marketing surface. The website homepage renders an *"Auto scaling"* card that links to `/js/docs/guides/scaling-crawlers`, advertising that crawlers "dynamically scale based on available resources and current load" ([website/src/pages/js.js](https://github.com/apify/crawlee/blob/main/website/src/pages/js.js)). Underneath, autoscaling is implemented by the `ScalingCrawler` mixin in `@crawlee/core`, which monitors system memory and CPU (the `getMemoryInfoV2` and `getCurrentCpuTicksV2` helpers are re-exported from `@crawlee/utils` for this purpose — [packages/utils/src/index.ts](https://github.com/apify/crawlee/blob/main/packages/utils/src/index.ts)) and adjusts worker concurrency accordingly.

The CLI surface is minimal but central to the development workflow. The root [package.json](https://github.com/apify/crawlee/blob/main/package.json) defines a Yarn-workspaces monorepo with `packages/*` as workspaces, managed by `lerna@^9.0.7` and `turbo@^2.1.0`. The relevant scripts are:

| Script | Command |
| --- | --- |
| `build` | `turbo run build && node ./scripts/typescript_fixes.mjs` |
| `ci:build` | Lerna/CI-aware build |
| `clean` | `turbo run clean && rimraf .turbo packages/*/.turbo packages/*/*.tsbuildinfo` |
| `prepublishOnly` | `turbo run copy` |
| `postinstall` | `npx husky install` |

These scripts orchestrate TypeScript compilation, monorepo-wide caching via Turborepo, and Husky hook installation across all crawler packages (browser, cheerio, http, playwright, puppeteer, stagehand, templates) ([package.json](https://github.com/apify/crawlee/blob/main/package.json)).

## See Also

- [Browser Pool and Launch Context](./browser-pool-and-launch-context.md)
- [HTTP Crawler and ImpitHttpClient](./http-crawler.md)
- [Templates and Project Scaffolding](./templates.md)
- [Autoscaling Guide](./autoscaling-guide.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: apify/crawlee

Summary: Found 22 structured pitfall item(s), including 1 high/blocking item(s). Top priority: Maintenance risk - Maintenance risk requires verification.

## 1. Maintenance risk - Maintenance risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/apify/crawlee/issues/2046

## 2. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: v3.15.1
- User impact: Upgrade or migration may change expected behavior: v3.15.1
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.1

## 3. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: v3.16.0
- User impact: Upgrade or migration may change expected behavior: v3.16.0
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.16.0

## 4. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/apify/crawlee

## 5. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: v3.15.0
- User impact: Upgrade or migration may change expected behavior: v3.15.0
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.0

## 6. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: v3.15.3
- User impact: Upgrade or migration may change expected behavior: v3.15.3
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.3

## 7. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: v3.17.0
- User impact: Upgrade or migration may change expected behavior: v3.17.0
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.17.0

## 8. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/apify/crawlee/issues/3764

## 9. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/apify/crawlee

## 10. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/apify/crawlee

## 11. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/apify/crawlee

## 12. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: Allow disabling ImpitHttpClient client cache
- User impact: Developers may hit a documented source-backed failure mode: Allow disabling ImpitHttpClient client cache
- Evidence: failure_mode_cluster:github_issue | https://github.com/apify/crawlee/issues/3769

## 13. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: Crawler hangs forever when given malformed request input (e.g. invalid `userData` shape)
- User impact: Developers may hit a documented source-backed failure mode: Crawler hangs forever when given malformed request input (e.g. invalid `userData` shape)
- Evidence: failure_mode_cluster:github_issue | https://github.com/apify/crawlee/issues/3764

## 14. Runtime risk - Runtime risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: Add support for Bun runtime - Issue with `browser-pool` and `memory-storage` packages
- User impact: Developers may hit a documented source-backed failure mode: Add support for Bun runtime - Issue with `browser-pool` and `memory-storage` packages
- Evidence: failure_mode_cluster:github_issue | https://github.com/apify/crawlee/issues/2046

## 15. Runtime risk - Runtime risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: Merge @crawlee/memory-storage package into @crawlee/core
- User impact: Developers may hit a documented source-backed failure mode: Merge @crawlee/memory-storage package into @crawlee/core
- Evidence: failure_mode_cluster:github_issue | https://github.com/apify/crawlee/issues/3756

## 16. Runtime risk - Runtime risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: v3.15.2
- User impact: Upgrade or migration may change expected behavior: v3.15.2
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.15.2

## 17. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/apify/crawlee

## 18. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/apify/crawlee

## 19. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: v3.13.10
- User impact: Upgrade or migration may change expected behavior: v3.13.10
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.13.10

## 20. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: v3.13.9
- User impact: Upgrade or migration may change expected behavior: v3.13.9
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.13.9

## 21. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: v3.14.0
- User impact: Upgrade or migration may change expected behavior: v3.14.0
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.14.0

## 22. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: v3.14.1
- User impact: Upgrade or migration may change expected behavior: v3.14.1
- Evidence: failure_mode_cluster:github_release | https://github.com/apify/crawlee/releases/tag/v3.14.1

<!-- canonical_name: apify/crawlee; human_manual_source: deepwiki_human_wiki -->
