crawl4ai Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

crawl4ai

Related topics: Installation Guide, Quick Start Guide

Introduction to Crawl4AI

Related topics: Installation Guide, Quick Start Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Processing Pipeline

Continue reading this section for the full explanation and source context.

Section 1. AI-Powered Extraction

Continue reading this section for the full explanation and source context.

Section 2. Multiple Output Strategies

Continue reading this section for the full explanation and source context.

Related topics: Installation Guide, Quick Start Guide

Introduction to Crawl4AI

Overview

Crawl4AI is an open-source AI-powered web crawling framework designed to extract structured data from web pages and deliver clean, LLM-ready output. It serves as a modern alternative to traditional web scraping tools by combining intelligent crawling capabilities with AI-driven content extraction and formatting.

The project emphasizes ease of use, providing both programmatic APIs and command-line interfaces for rapid integration into data pipelines, research workflows, and AI applications.

Purpose and Scope

Crawl4AI addresses the fundamental challenge of extracting meaningful data from unstructured web content. While conventional web crawlers focus on fetching page content, Crawl4AI goes further by:

Semantic Understanding: Analyzing page content to identify and extract relevant information based on context rather than rigid selectors
Structured Output: Delivering data in formats optimized for large language model consumption, including Markdown, JSON, and structured extractions
Performance Optimization: Enabling high-throughput crawling with configurable browser automation and connection pooling
flexibility: Supporting various output strategies including simple crawling, chunked extraction, and memory-aware processing

The scope encompasses web crawling, content extraction, link navigation, media handling, and output formatting—providing an end-to-end solution from URL input to structured data output.

Core Architecture

Crawl4AI follows a modular architecture composed of distinct processing stages:

graph TD
    A[URL Input] --> B[Crawl Strategy]
    B --> C[Browser Automation]
    C --> D[Content Extraction]
    D --> E[AI Processing]
    E --> F[Output Formatter]
    F --> G[Structured Output]
    
    C -->|JS Rendering| H[JavaScript Executor]
    D -->|Media| I[Media Handler]
    E -->|Memory| J[Memory Manager]

Processing Pipeline

Stage	Component	Description
Input	URL Parser	Validates and normalizes target URLs
Crawl	Strategy Engine	Selects crawling approach based on configuration
Render	Browser Pool	Manages headless browser instances
Extract	AI Extractor	Uses ML models to identify relevant content
Format	Output Serializer	Converts to target format (JSON/Markdown/HTML)

Key Features

1. AI-Powered Extraction

Crawl4AI leverages machine learning models to understand page content semantically. Rather than relying solely on CSS selectors or XPath expressions, the extractor can identify:

Main article content and metadata
Structured data elements (tables, lists, forms)
Semantic sections and their relationships
Relevant versus boilerplate content

2. Multiple Output Strategies

The framework supports various extraction strategies optimized for different use cases:

Strategy	Use Case	Output
`default`	General purpose	Clean Markdown with metadata
`cosine`	Semantic clustering	Grouped content chunks
`no-cache`	Fresh data	Bypass internal caching
`passive`	Low resource	Minimal processing
`brainless`	Simple fetch	Raw HTML without AI processing

3. Browser Automation

Integrated headless browser support enables:

JavaScript rendering for single-page applications
Cookie and session management
Custom headers and authentication
Screenshot capture
PDF generation

4. Media Handling

Crawl4AI processes various media types during extraction:

Images: Download, compress, and embed with alt-text preservation
Videos: Extract metadata and embed URLs
Audio: Handle media references for podcasts and audio content
Documents: Process embedded PDFs and downloadable files

Installation and Setup

Prerequisites

Python 3.9 or higher
Chrome/Chromium browser (for browser automation features)
pip or poetry package manager

Basic Installation

pip install crawl4ai

Development Installation

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e ".[dev]"

Verify Installation

from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
    print(result.markdown)

Basic Usage Patterns

Simple Crawl

from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
    print(f"Title: {result.metadata.get('title')}")
    print(f"Content: {result.markdown}")

Configured Extraction

from crawl4ai import CrawlerRunConfig, AsyncWebCrawler

config = CrawlerRunConfig(
    mode="aggressive",
    word_count_threshold=10,
    remove_hidden_text=True,
    process_iframes=True,
    scroll_delay=1.0
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com/article",
        config=config
    )
    print(result.markdown)

Batch Crawling

from crawl4ai import AsyncWebCrawler

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(urls=urls)
    for result in results:
        print(f"URL: {result.url}")
        print(f"Status: {result.status_code}")

Configuration Reference

CrawlerRunConfig Parameters

Parameter	Type	Default	Description
`mode`	str	`"default"`	Extraction strategy
`headless`	bool	`True`	Run browser in headless mode
`verbose`	bool	`False`	Enable verbose logging
`text_threshold`	int		Minimum text length filter
`word_count_threshold`	int		Words per chunk threshold
`skip_download_images`	bool	`False`	Skip image downloads
`page_timeout`	int	`30000`	Page load timeout (ms)
`scroll_delay`	float	`0`	Delay between scrolls

Browser Configuration

Parameter	Type	Description
`browser_type`	str	Chromium/Firefox/WebKit
`headless`	bool	Headless mode toggle
`proxy`	dict	Proxy configuration
`user_agent`	str	Custom user agent string

Project Structure

crawl4ai/
├── src/crawl4ai/          # Main package source
│   ├── core/              # Core crawling engine
│   ├── extractors/        # Content extraction strategies
│   ├── formatters/        # Output formatters
│   └── utils/             # Utility functions
├── examples/              # Usage examples
├── tests/                 # Test suite
├── docs/                  # Documentation
├── scripts/               # Build and utility scripts
└── sbom/                  # Software Bill of Materials

Dependencies and SBOM

Crawl4AI maintains a comprehensive Software Bill of Materials (SBOM) documenting all direct and transitive dependencies. This SBOM is generated using CycloneDX format and regenerated on a best-effort basis through automated scripts.

Regenerating SBOM

./scripts/gen-sbom.sh

The SBOM provides visibility into the project's dependency tree, supporting security audits and license compliance verification.

Extensibility

Custom Extractors

Extend the extraction framework by implementing the base extractor interface:

from crawl4ai.extractors import BaseExtractor

class CustomExtractor(BaseExtractor):
    async def extract(self, html: str, url: str) -> dict:
        # Custom extraction logic
        return {"content": html, "custom_field": "value"}

Output Formatters

Create custom output formats by implementing the formatter interface:

from crawl4ai.formatters import BaseFormatter

class CustomFormatter(BaseFormatter):
    def format(self, result: CrawlResult) -> str:
        # Custom formatting logic
        return custom_string

Contributing

The project welcomes contributions from the community. Developers interested in contributing should:

Fork the repository
Create a feature branch
Follow the established coding standards
Add tests for new functionality
Submit a pull request with clear documentation

Resources and Documentation

Resource	Location
Source Code	GitHub Repository
Documentation	`/docs` directory
Examples	`/examples` directory
SBOM	`/sbom` directory
Issue Tracker	GitHub Issues

License

Crawl4AI is released under open-source licensing terms. Refer to the LICENSE file in the repository for specific terms and conditions.

Source: https://github.com/unclecode/crawl4ai / Human Manual

Installation Guide

Overview

This guide covers all supported methods for installing crawl4ai, a powerful web crawling and data extraction framework. The installation process handles automatic setup of core dependencies including Playwright for browser automation, supporting multiple installation scenarios from simple pip installations to Docker containerized deployments.

System Requirements

Hardware Requirements

Component	Minimum	Recommended
RAM	4 GB	8 GB
Disk Space	2 GB	5 GB
CPU	2 cores	4+ cores

Software Prerequisites

Requirement	Version	Notes
Python	>= 3.9	Tested up to 3.12
pip	Latest	For pip installations
Docker	20.10+	For Docker installations
Chrome/Chromium	Latest	Auto-installed by Playwright

Installation Methods

Method 1: pip Installation (Recommended)

The simplest and most common installation method uses pip package manager.

pip install crawl4ai

After pip installation, you must run the post-installation setup to configure browser dependencies:

python -m crawl4ai install

This command installs Playwright browsers and configures the necessary system dependencies. Sources: crawl4ai/install.py:1-50

Method 2: Docker Installation

Docker provides an isolated environment with all dependencies pre-configured.

#### Pulling the Official Image

docker pull unclecode/crawl4ai

#### Running with Docker

docker run -d \
  --name crawl4ai \
  -p 8000:8000 \
  unclecode/crawl4ai

#### Building Custom Docker Image

You can build your own image using the provided Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "-m", "crawl4ai"]

Sources: Dockerfile

Method 3: Installation from Source

For development or customization purposes, install from the source repository:

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

Dependencies Management

Core Dependencies

The project defines dependencies in setup.py for package distribution and requirements.txt for development.

setup.py dependencies:

Package	Purpose
playwright	Browser automation
asyncio	Async operations
aiohttp	HTTP client
beautifulsoup4	HTML parsing
lxml	XML/HTML processing

Sources: setup.py

Runtime Dependency Installation

The crawl4ai/install.py module handles automatic installation of runtime dependencies:

# Key installation steps in install.py
import subprocess
import sys

def install_browsers():
    subprocess.check_call([
        sys.executable, "-m", "playwright", "install", "chromium"
    ])

Sources: crawl4ai/install.py:20-30

Installing Optional Dependencies

Extra	Command	Description
All extras	`pip install crawl4ai[all]`	Install all optional packages
Dev tools	`pip install crawl4ai[dev]`	Development dependencies
LLM support	`pip install crawl4ai[llm]`	Language model integration

Installation Workflow

graph TD
    A[Start Installation] --> B{Installation Method}
    B -->|pip| C[Run pip install]
    B -->|Docker| D[Pull/Build Image]
    B -->|Source| E[Clone Repository]
    C --> F[Run post-install]
    F --> G[Install Playwright Browsers]
    D --> H[Run Container]
    E --> I[Install in Editable Mode]
    I --> F
    G --> J[Verify Installation]
    H --> J
    J --> K{Success?}
    K -->|Yes| L[Installation Complete]
    K -->|No| M[Troubleshoot]
    M --> G

Post-Installation Verification

Verify that crawl4ai is installed correctly by checking the installation status:

python -m crawl4ai --version

Or test the installation programmatically:

import crawl4ai
print(crawl4ai.__version__)

Browser Verification

Ensure Playwright browsers are properly installed:

python -m playwright install-deps chromium
python -m playwright install chromium

Environment Configuration

Environment Variables

Variable	Default	Description
`CRAWL4AI_BROWSER_HEADLESS`	`true`	Run browser in headless mode
`CRAWL4AI_MAX_CONCURRENT`	`5`	Maximum concurrent crawls
`CRAWL4AI_CACHE_DIR`	`~/.crawl4ai/cache`	Cache directory path

Configuration File

Create ~/.crawl4ai/config.json for persistent configuration:

{
  "browser": {
    "headless": true,
    "viewport": {"width": 1920, "height": 1080}
  },
  "cache": {
    "enabled": true,
    "ttl": 3600
  }
}

Common Installation Issues

Issue: Playwright Installation Fails

Solution: Install system dependencies manually:

# Ubuntu/Debian
apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 libcairo2

# Then retry browser installation
python -m playwright install chromium

Issue: Permission Denied

Solution: Use virtual environment or --user flag:

python -m venv venv
source venv/bin/activate  # Linux/Mac
pip install crawl4ai

Issue: Import Errors After Installation

Solution: Verify Python path and reinstall:

pip uninstall crawl4ai
pip install crawl4ai --force-reinstall

Docker-Specific Configuration

Volume Mounts

Mount local directories for persistent data:

docker run -v /path/to/data:/data crawl4ai

Network Configuration

For web crawling behind proxies:

docker run -e HTTP_PROXY=http://proxy:8080 \
           -e HTTPS_PROXY=https://proxy:8080 \
           crawl4ai

Upgrading crawl4ai

pip Upgrade

pip install crawl4ai --upgrade

Docker Upgrade

docker pull unclecode/crawl4ai
docker stop old_container
docker rm old_container
docker run crawl4ai

Source Upgrade

git pull origin main
pip install -e . --force-reinstall

Next Steps

After successful installation, proceed to:

Quick Start - Run your first crawl operation
Configuration Guide - Customize crawl behavior
API Reference - Explore available methods and options
Examples - Review usage patterns and best practices

Sources: Dockerfile

Quick Start Guide

Related topics: Async Web Crawler, Markdown Generation

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Purpose and Scope

Continue reading this section for the full explanation and source context.

Section Prerequisites

Continue reading this section for the full explanation and source context.

Section Installation Command

Continue reading this section for the full explanation and source context.

Related topics: Async Web Crawler, Markdown Generation

Quick Start Guide

Overview

The Quick Start Guide provides developers with a rapid introduction to using crawl4ai for web crawling and data extraction tasks. It serves as the entry point for users who want to begin extracting structured data from websites within minutes of installation.

Purpose and Scope

The Quick Start Guide is designed to:

Demonstrate the simplest possible usage pattern for crawling web pages
Show how to extract and structure content from HTML pages
Provide copy-paste-ready code examples for immediate experimentation
Bridge the gap between installation and production usage

Installation

Prerequisites

Requirement	Description
Python	Version 3.8 or higher
pip	Latest version recommended
Browser	Chrome/Chromium (for JavaScript rendering)

Installation Command

pip install crawl4ai

Basic Usage Pattern

The fundamental workflow in crawl4ai follows a simple three-step pattern:

graph TD
    A[Create AsyncWebCrawler Instance] --> B[Configure Parameters]
    B --> C[Call crawl Method with URL]
    C --> D[Process Result Object]
    D --> E[Extract Content/Markdown/HTML]

Hello World Example

The simplest possible usage demonstrates core functionality:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.crawl(url="https://example.com")
        
        if result.success:
            print(f"Content: {result.markdown}")
            print(f"Links found: {len(result.links)}")
        else:
            print(f"Crawl failed: {result.error_message}")

asyncio.run(main())

Core Components

AsyncWebCrawler

The primary entry point for all crawling operations:

Parameter	Type	Description
`verbose`	bool	Enable detailed logging output
`headless`	bool	Run browser in headless mode
`browser_type`	str	Specify browser engine

Sources: crawl4ai/__init__.py

CrawlResult Object

The return value from crawler.crawl() contains extracted data:

Property	Type	Description
`success`	bool	Whether crawl completed successfully
`markdown`	str	Extracted content as markdown
`html`	str	Raw HTML content
`links`	dict	Dictionary of internal/external links
`media`	dict	Images, videos, and other media
`error_message`	str	Error details if `success` is False

Common Usage Patterns

Pattern 1: Simple Content Extraction

import asyncio
from crawl4ai import AsyncWebCrawler

async def extract_content():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.crawl(
            url="https://example.com",
            word_count_threshold=10
        )
        
        if result.success:
            return result.markdown
        return None

content = asyncio.run(extract_content())

Pattern 2: Batch Crawling

import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_multiple(urls):
    async with AsyncWebCrawler() as crawler:
        tasks = [crawler.crawl(url=url) for url in urls]
        results = await asyncio.gather(*tasks)
        return [r for r in results if r.success]

urls = ["https://example.com", "https://example.org"]
successful_results = asyncio.run(crawl_multiple(urls))

Configuration Options

Browser Configuration

from crawl4ai import BrowserConfig, CrawlerRunConfig

browser_config = BrowserConfig(
    headless=True,
    verbose=False
)

run_config = CrawlerRunConfig(
    word_count_threshold=10,
    page_timeout=30000
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.crawl(
        url="https://example.com",
        config=run_config
    )

Error Handling

Always check the success property before accessing extracted content:

result = await crawler.crawl(url="https://example.com")

if result.success:
    process_data(result.markdown)
else:
    log_error(f"Crawl failed: {result.error_message}")
    handle_failure()

Next Steps

After completing the Quick Start Guide, users should explore:

Advanced extraction strategies with CSS selectors and XPath
JavaScript-heavy page crawling
Rate limiting and polite crawling practices
Integration with AI/LLM pipelines for content analysis

Sources: crawl4ai/__init__.py

System Architecture

Related topics: Browser Management, Async Web Crawler

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Key Parameters

Continue reading this section for the full explanation and source context.

Section Workflow

Continue reading this section for the full explanation and source context.

Section Browser Lifecycle

Continue reading this section for the full explanation and source context.

Related topics: Browser Management, Async Web Crawler

System Architecture

Overview

Crawl4AI is a high-performance web crawling framework designed for AI applications. It enables efficient extraction of web content along with metadata, supporting both single-page crawling and large-scale asynchronous crawling operations. The architecture emphasizes separation of concerns between browser management, crawling logic, and result processing.

Core Components

The system is built around three primary modules that work in coordination:

Component	File	Responsibility
AsyncWebCrawler	`crawl4ai/async_webcrawler.py`	Main entry point for crawling operations
BrowserManager	`crawl4ai/browser_manager.py`	Handles browser lifecycle and page interactions
AsyncDispatcher	`crawl4ai/async_dispatcher.py`	Manages concurrent crawling tasks

Component Architecture

graph TD
    A[User / API Client] --> B[AsyncWebCrawler]
    B --> C[BrowserManager]
    B --> D[AsyncDispatcher]
    C --> E[Browser Instance]
    D --> F[Task Queue]
    F --> E
    E --> G[Content Extraction]
    G --> H[Result Models]
    H --> B

AsyncWebCrawler

The AsyncWebCrawler class serves as the primary interface for initiating crawl operations. It accepts configuration parameters and coordinates the crawling workflow.

Key Parameters

Parameter	Type	Description
config	CrawlerRunConfig	Configuration for the crawl session
browser_manager	BrowserManager	Shared browser manager instance
dispatcher	AsyncDispatcher	Task dispatcher for async operations

Sources: crawl4ai/async_webcrawler.py

Workflow

graph LR
    A[Initialize Crawler] --> B[Configure Browser]
    B --> C[Create BrowserContext]
    C --> D[Navigate to URL]
    D --> E[Extract Content]
    E --> F[Return CrawlResult]

BrowserManager

The BrowserManager handles the lifecycle of browser instances, managing Chrome/Chromium processes and providing isolated contexts for crawling sessions.

Sources: crawl4ai/browser_manager.py

Browser Lifecycle

graph TD
    A[Launch Browser] --> B[Create Context]
    B --> C[Create Page]
    C --> D[Execute Crawl]
    D --> E[Close Context]
    E --> F[Repeat or Shutdown]
    F --> A

AsyncDispatcher

The AsyncDispatcher enables concurrent crawling operations, managing task queues and coordinating multiple browser contexts for parallel extraction.

Sources: crawl4ai/async_dispatcher.py

Parallel Execution Model

graph TD
    A[URL List] --> B[Dispatcher Queue]
    B --> C[Worker 1]
    B --> D[Worker 2]
    B --> E[Worker N]
    C --> F[Results Aggregator]
    D --> F
    E --> F
    F --> G[Combined Output]

Data Models

Results from crawling operations are structured using Pydantic models defined in models.py.

Sources: crawl4ai/models.py

Model	Purpose
CrawlResult	Container for extracted content and metadata
CrawlerRunConfig	Configuration parameters for crawl sessions

Docker Deployment Architecture

The project includes Docker deployment specifications that containerize the crawling infrastructure.

graph TD
    A[Docker Compose] --> B[Crawl4AI Container]
    A --> C[Redis Cache]
    A --> D[Chrome Browser]
    B --> C
    B --> D

Sources: deploy/docker/ARCHITECTURE.md

Technology Stack

Layer	Technology
Runtime	Python 3.10+
Browser Engine	Chrome/Chromium via Playwright
Async Framework	asyncio
Data Validation	Pydantic
Containerization	Docker

Configuration

The system supports extensive configuration options through CrawlerRunConfig, including:

JavaScript execution toggles
Memory management settings
Request throttling parameters
Content extraction strategies

Dependency Management

The project maintains a Software Bill of Materials (SBOM) for tracking dependencies and ensuring reproducible builds.

Sources: sbom/README.md

To regenerate the SBOM:

./scripts/gen-sbom.sh

Sources: crawl4ai/async_webcrawler.py

Browser Management

Related topics: Anti-Bot Detection and Proxy Management

Section Related Pages

Continue reading this section for the full explanation and source context.

Section BrowserManager

Continue reading this section for the full explanation and source context.

Section BrowserAdapter

Continue reading this section for the full explanation and source context.

Section BrowserProfiler

Continue reading this section for the full explanation and source context.

Browser Management

Overview

Browser Management in crawl4ai provides a comprehensive abstraction layer for controlling and orchestrating browser instances used during web crawling and scraping operations. The system abstracts the complexity of browser automation, allowing users to focus on data extraction rather than browser lifecycle management.

Architecture Overview

The browser management system follows a modular architecture with distinct components that handle specific responsibilities:

graph TD
    A[BrowserManager] --> B[BrowserAdapter]
    A --> C[BrowserProfiler]
    B --> D[Playwright/Chromium]
    C --> E[JS Snippets]
    F[User Request] --> A
    A --> G[Crawled Result]

Core Components

BrowserManager

The central orchestrator responsible for:

Browser instance lifecycle (creation, configuration, teardown)
Session management and isolation
Resource allocation and cleanup
Coordination between adapters and profilers

Key Responsibilities:

Responsibility	Description
Instance Creation	Creates and initializes browser contexts
Configuration	Applies user-defined browser settings
Lifecycle Control	Manages startup and shutdown sequences
Pool Management	Handles browser pool for concurrent operations

Sources: crawl4ai/browser_manager.py

BrowserAdapter

The adapter pattern implementation that provides a consistent interface for interacting with different browser engines (Playwright, Chromium, Firefox, WebKit).

Adapter Features:

Feature	Description
Engine Abstraction	Unified API across browser backends
Command Translation	Converts high-level commands to browser-specific instructions
Response Normalization	Standardizes browser responses

Sources: crawl4ai/browser_adapter.py

BrowserProfiler

Handles JavaScript injection and performance profiling during browser operations.

Profiler Capabilities:

Capability	Purpose
JS Injection	Execute custom JavaScript in page context
Performance Tracking	Monitor page load and execution metrics
Resource Profiling	Track network requests and responses

Sources: crawl4ai/browser_profiler.py

JavaScript Integration

The js_snippet module provides pre-built JavaScript utilities for browser automation tasks:

graph LR
    A[Browser Context] --> B[js_snippet Module]
    B --> C[DOM Manipulation]
    B --> D[Data Extraction]
    B --> E[Event Handling]

Common JS Snippet Categories:

DOM traversal and manipulation
Content extraction
Scroll management
Wait conditions
Network request interception

Sources: crawl4ai/js_snippet

Configuration Options

Browser Launch Parameters

Parameter	Type	Default	Description
headless	bool	true	Run browser in headless mode
args	list	[]	Additional browser arguments
timeout	int	30000	Navigation timeout in milliseconds
viewport	dict	{"width": 1920, "height": 1080}	Browser viewport dimensions
user_agent	str	None	Custom user agent string
proxy	dict	None	Proxy configuration

Context Options

Option	Type	Description
java_script_enabled	bool	Enable/disable JavaScript
ignore_https_errors	bool	Ignore SSL certificate errors
java_script_enabled	bool	Browser context JavaScript state

Sources: docs/codebase/browser.md

Browser Lifecycle

stateDiagram-v2
    [*] --> Initializing: Create BrowserManager
    Initializing --> Launching: Launch Browser
    Launching --> Ready: Browser Context Created
    Ready --> Navigating: Load URL
    Navigating --> Ready: Page Loaded
    Ready --> Executing: Run JS/Commands
    Executing --> Ready: Commands Complete
    Ready --> Closing: Shutdown Request
    Closing --> [*]: Resources Freed

Usage Patterns

Basic Browser Usage

from crawl4ai import BrowserManager

# Initialize browser manager
browser_mgr = BrowserManager(
    headless=True,
    viewport={"width": 1920, "height": 1080}
)

# Create browser context
context = browser_mgr.new_context()

# Use context for crawling
result = await context.goto("https://example.com")

Advanced Configuration

browser_mgr = BrowserManager(
    headless=False,
    args=[
        "--disable-blink-features=AutomationControlled",
        "--disable-dev-shm-usage"
    ],
    timeout=60000,
    user_agent="Custom User Agent"
)

Session Management

The system supports multiple concurrent sessions through isolated browser contexts:

graph TD
    A[BrowserManager] --> B1[Session 1 Context]
    A --> B2[Session 2 Context]
    A --> B3[Session N Context]
    B1 --> C1[Page 1]
    B2 --> C2[Page 2]
    B3 --> C3[Page N]

Error Handling

The browser management system implements comprehensive error handling:

Error Type	Handling Strategy
Navigation Timeout	Retry with exponential backoff
Browser Crash	Automatic restart and context recreation
Resource Exhaustion	Automatic cleanup of stale contexts
Network Errors	Graceful degradation with cached content

Performance Considerations

Optimization Strategies

Context Reuse: Reuse browser contexts for multiple pages when possible
Lazy Loading: Only load resources when explicitly requested
Resource Limits: Configure memory and CPU limits per context
Connection Pooling: Maintain warm browser instances for rapid access

Memory Management

Strategy	Description
Context Isolation	Each session runs in isolated context
Automatic Cleanup	Temporary files and caches cleared automatically
Resource Limits	Configurable memory caps per browser instance

Sources: crawl4ai/browser_manager.py

Async Web Crawler

Related topics: Markdown Generation, Extraction Strategies

Section Related Pages

Continue reading this section for the full explanation and source context.

Section System Components

Continue reading this section for the full explanation and source context.

Section Core Classes

Continue reading this section for the full explanation and source context.

Section Initialization

Continue reading this section for the full explanation and source context.

Related topics: Markdown Generation, Extraction Strategies

Async Web Crawler

Overview

The Async Web Crawler is the core component of crawl4ai, providing an asynchronous, high-performance web crawling engine built on Python's asyncio framework. It enables concurrent crawling of multiple URLs with built-in caching, configurable extraction strategies, and comprehensive result handling.

The primary purpose of this module is to fetch web pages, extract meaningful content, and return structured results that include HTML, markdown, media assets, metadata, and optional AI-generated summaries. The async design allows for efficient I/O-bound operations, making it suitable for large-scale web scraping projects.

Sources: crawl4ai/async_webcrawler.py:1-50

Architecture

System Components

The async web crawler system consists of several interconnected components that work together to provide a seamless crawling experience.

graph TD
    A[AsyncWebCrawler] --> B[Browser Manager]
    A --> C[Cache Layer]
    A --> D[Extraction Strategy]
    A --> E[Result Processor]
    
    B --> F[Playwright/Chromium]
    C --> G[File System Cache]
    C --> H[Memory Cache]
    
    D --> I[LLM-based Extraction]
    D --> J[CSS/XPath Extraction]
    
    E --> K[CrawlResult]
    E --> L[Raw HTML]
    E --> M[Markdown]

Sources: crawl4ai/async_webcrawler.py:1-30

Core Classes

Class	File	Purpose
`AsyncWebCrawler`	async_webcrawler.py	Main crawler entry point with `arun()` and `arun_many()` methods
`BrowserConfig`	async_configs.py	Configuration for headless browser behavior
`CrawlCache`	cache_context.py	Manages caching strategies for crawled content
`CrawlResult`	types.py	Data model for returning crawl results

Sources: crawl4ai/async_configs.py:1-30

AsyncWebCrawler Class

Initialization

The AsyncWebCrawler class can be initialized with optional configuration parameters:

class AsyncWebCrawler:
    def __init__(
        self,
        config: BrowserConfig | None = None,
        verbose: bool = False
    ) -> None:

Parameters:

Parameter	Type	Default	Description
`config`	`BrowserConfig`	`None`	Browser configuration object
`verbose`	`bool`	`False`	Enable verbose logging output

Sources: crawl4ai/async_webcrawler.py:50-70

Core Methods

#### arun()

Single URL crawling with comprehensive result extraction:

async def arun(
    self,
    url: str,
    config: BrowserConfig | None = None,
    **kwargs
) -> CrawlResult

Parameters:

Parameter	Type	Description
`url`	`str`	Target URL to crawl
`config`	`BrowserConfig`	Override browser configuration

Sources: crawl4ai/async_webcrawler.py:100-150

#### arun_many()

Batch crawling for multiple URLs concurrently:

async def arun_many(
    self,
    urls: list[str],
    config: BrowserConfig | None = None,
    **kwargs
) -> list[CrawlResult]

Parameters:

Parameter	Type	Description
`urls`	`list[str]`	List of target URLs
`config`	`BrowserConfig`	Shared configuration for all URLs

Sources: crawl4ai/async_webcrawler.py:200-250

Context Manager Support

The AsyncWebCrawler implements the async context manager protocol for proper resource cleanup:

async def __aenter__(self) -> "AsyncWebCrawler":
    await self.start()
    return self

async def __aexit__(
    self,
    exc_type, exc_val, exc_tb
) -> None:
    await self.close()

Sources: crawl4ai/async_webcrawler.py:80-100

CrawlResult Data Model

The CrawlResult class encapsulates all information retrieved from a crawled page:

classDiagram
    class CrawlResult {
        +str url
        +str html
        +str markdown
        +list~MediaItem~ media
        +list~Link~ links
        +dict metadata
        +str|None success
        +str|None error
        +dict~str, Any~ extracted_content
        +int status_code
        +datetime created_at
    }

Sources: crawl4ai/types.py:1-80

Properties

Property	Type	Description
`url`	`str`	Original request URL
`html`	`str`	Raw HTML content
`markdown`	`str`	Converted markdown content
`media`	`list[MediaItem]`	Extracted images, videos, audio
`links`	`list[Link]`	Internal and external links
`metadata`	`dict`	Page metadata (title, description)
`success`	`str \	None`	Success status message
`error`	`str \	None`	Error message if failed
`status_code`	`int`	HTTP response status code
`created_at`	`datetime`	Timestamp of crawl operation

Sources: crawl4ai/types.py:50-100

Configuration

BrowserConfig

The BrowserConfig class provides fine-grained control over browser behavior:

@dataclass
class BrowserConfig:
    headless: bool = True
    browser_type: str = "chromium"
    viewport_size: dict = {"width": 1920, "height": 1080}
    user_agent: str | None = None
    verbose: bool = False

Sources: crawl4ai/async_configs.py:30-80

Configuration Options

Option	Type	Default	Description
`headless`	`bool`	`True`	Run browser in headless mode
`browser_type`	`str`	`"chromium"`	Browser engine (chromium, firefox, webkit)
`viewport_size.width`	`int`	`1920`	Viewport width in pixels
`viewport_size.height`	`int`	`1080`	Viewport height in pixels
`user_agent`	`str \	None`	`None`	Custom user agent string
`verbose`	`bool`	`False`	Enable debug output

Sources: crawl4ai/async_configs.py:40-90

Advanced Configuration

Additional crawling parameters can be passed via kwargs:

Parameter	Type	Description
`word_count_threshold`	`int`	Minimum word count for content extraction
`extraction_strategy`	`ExtractionStrategy`	Strategy for content extraction
`cache_mode`	`CacheMode`	Caching behavior (enabled/disabled/bypass)
`js_enabled`	`bool`	Enable JavaScript execution
`wait_for`	`str`	CSS selector to wait for before returning
`delay_before_return_html`	`float`	Delay in seconds before capturing HTML

Sources: docs/md_v2/api/async-webcrawler.md:1-60

Caching System

Cache Modes

The crawl4ai framework implements a multi-layered caching strategy:

graph LR
    A[Request] --> B{Memory Cache}
    B -->|Hit| C[Return Cached]
    B -->|Miss| D{File System Cache}
    D -->|Hit| E[Return Cached]
    D -->|Miss| F[Fetch Remote]
    F --> G[Store in Both Layers]

Sources: crawl4ai/cache_context.py:1-50

CacheMode Enum

Mode	Description
`ENABLED`	Use cache if available, otherwise fetch and cache
`DISABLED`	Always fetch fresh content, bypass cache
`BYPASS`	Fetch and update cache but don't read from it
`READ_ONLY`	Only read from cache, never fetch

Sources: crawl4ai/cache_context.py:30-60

Cache Context Manager

async with cache_context(cache_mode=CacheMode.ENABLED):
    result = await crawler.arun(url="https://example.com")

Sources: crawl4ai/cache_context.py:60-90

Usage Examples

Basic Single URL Crawl

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=BrowserConfig(headless=True)
        )
        
        print(f"Success: {result.success}")
        print(f"Markdown content: {result.markdown[:500]}")

asyncio.run(main())

Sources: crawl4ai/async_webcrawler.py:150-200

Batch Crawling

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ]
    
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls=urls)
        
        for result in results:
            print(f"URL: {result.url}, Status: {result.status_code}")

asyncio.run(main())

Sources: docs/md_v2/api/async-webcrawler.md:60-100

With Custom Extraction Strategy

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def main():
    config = BrowserConfig(
        headless=True,
        verbose=True
    )
    
    strategy = LLMExtractionStrategy(
        provider="openai/gpt-4",
        api_token="your-token"
    )
    
    async with AsyncWebCrawler(config=config) as crawler:
        result = await crawler.arun(
            url="https://news-site.com/article",
            extraction_strategy=strategy
        )
        
        print(result.extracted_content)

asyncio.run(main())

Sources: crawl4ai/async_configs.py:90-130

Error Handling

The crawler returns comprehensive error information through the CrawlResult object:

result = await crawler.arun(url="https://invalid-url.xyz")

if not result.success:
    print(f"Error: {result.error}")
    print(f"Status Code: {result.status_code}")

Error Scenario	`success` Value	`error` Field
Network timeout	`False`	Connection timeout message
Invalid URL	`False`	URL validation error
JavaScript error	`False`	Browser console error
HTTP 404/500	`False`	HTTP status message
Success	`True`	`None`

Sources: crawl4ai/types.py:80-120

Performance Considerations

Async Benefits

The async architecture provides several performance advantages:

Concurrent Requests: Multiple URLs can be crawled simultaneously
Non-blocking I/O: Browser operations don't block other tasks
Resource Efficiency: Single event loop manages all crawling tasks

Best Practices

Practice	Benefit
Use `arun_many()` for batch operations	Reduces connection overhead
Enable caching for repeated URLs	Avoids redundant network requests
Set appropriate `word_count_threshold`	Reduces unnecessary processing
Use `headless=True` in production	Reduces memory usage

Sources: crawl4ai/async_webcrawler.py:250-300

Component	File	Relationship
`ExtractionStrategy`	extraction_strategy.py	Defines how content is extracted
`MediaItem`	types.py	Represents extracted media
`Link`	types.py	Represents extracted links
`CacheBackend`	cache_backend.py	Abstract cache implementation

Sources: crawl4ai/types.py:1-30

Summary

The Async Web Crawler is the foundational building block of crawl4ai, providing:

Asynchronous operation for high-performance concurrent crawling
Flexible configuration via BrowserConfig dataclass
Comprehensive result types through CrawlResult model
Multi-layered caching with configurable modes
Extensible extraction via pluggable strategies
Production-ready error handling with detailed error reporting

This architecture enables developers to build scalable web scraping solutions while maintaining clean, readable code patterns familiar to Python async developers.

Sources: crawl4ai/async_webcrawler.py:1-50

Markdown Generation

Related topics: Extraction Strategies, Async Web Crawler

Section Related Pages

Continue reading this section for the full explanation and source context.

Section MarkdownGenerationStrategy

Continue reading this section for the full explanation and source context.

Section ContentFilterStrategy

Continue reading this section for the full explanation and source context.

Section Available Providers

Continue reading this section for the full explanation and source context.

Related topics: Extraction Strategies, Async Web Crawler

Markdown Generation

Markdown Generation is a core feature in crawl4ai that transforms raw HTML content into clean, readable Markdown format. This system provides flexible strategies for content extraction, filtering, and conversion with extensive customization options.

Overview

The Markdown Generation system converts web page HTML into structured Markdown text suitable for LLM consumption, RAG systems, or documentation purposes. It offers multiple generation strategies, content filtering capabilities, and fine-grained control over extraction behavior.

Sources: docs/md_v2/core/markdown-generation.md:1-15

Architecture

graph TD
    A[HTML Input] --> B[Content Filter Strategy]
    B --> C[HTML Processing]
    C --> D[Markdown Generation Strategy]
    D --> E[Markdown Output]
    
    F[Configuration] --> B
    F --> D
    
    G[BestProvider] --> B
    G --> D

Core Components

MarkdownGenerationStrategy

The primary abstraction for generating Markdown from HTML content.

Parameter	Type	Default	Description
`provider`	`str`	`"best"`	Content extraction provider
`configs`	`dict`	`{}`	Provider-specific configurations
`strict`	`bool`	`False`	Raise errors on failure
`override_system_prompt`	`str`	`None`	Custom system prompt
`override_user_prompt`	`str`	`None`	Custom user prompt

Sources: crawl4ai/markdown_generation_strategy.py:1-50

ContentFilterStrategy

Abstract base class for filtering and selecting content before Markdown conversion.

Filter	Description
`PruningContentFilter`	Removes low-value content nodes
`BM25ContentFilter`	Uses BM25 ranking for content selection
`OrgAnnContentFilter`	Organic annotation-based filtering

Sources: crawl4ai/content_filter_strategy.py:1-100

Generation Providers

Available Providers

Provider	Description
`best`	Automatically selects optimal provider
`playwright`	Uses Playwright for JavaScript rendering
`curl`	Lightweight extraction via curl
`trafilatura`	Trafilatura library extraction
`lxml`	LXML-based HTML parsing
`readability`	Mozilla Readability algorithm

Sources: crawl4ai/markdown_generation_strategy.py:50-150

Best Provider Selection

The BestProvider class intelligently selects the most appropriate extraction method based on content characteristics.

class BestProvider:
    def get_strategy(self, html: str) -> MarkdownGenerationStrategy:
        # Analyzes HTML and selects optimal provider
        pass

Sources: crawl4ai/markdown_generation_strategy.py:150-200

Workflow

graph LR
    A[Fetch HTML] --> B{Content Filter Enabled?}
    B -->|Yes| C[Apply Filter Strategy]
    B -->|No| D[Skip Filtering]
    C --> E[Generate Markdown]
    D --> E
    E --> F{Post-Processing?}
    F -->|Yes| G[Apply Custom Rules]
    F -->|No| H[Return Result]
    G --> H

Configuration Options

Generator Config

{
    "word_threshold": 50,          # Minimum words per chunk
    "language": "en",              # Content language
    "skip_internal_links": True,   # Ignore internal links
    "content_type": "markdown"     # Output format
}

BM25 Filter Config

{
    "query": "relevant keywords",
    "top_n": 5,                    # Number of chunks
    "use_stem": True               # Apply stemming
}

Sources: crawl4ai/content_filter_strategy.py:100-180

HTML to Markdown Conversion

html2text Module

The html2text submodule handles low-level HTML to Markdown conversion.

Method	Purpose
`handle_anchor`	Convert `<a>` tags to `text`
`handle_image`	Convert `<img>` to `!alt`
`handle_heading`	Convert `<h1>`-`<h6>` to `#` - `######`
`handle_table`	Convert `<table>` to Markdown tables
`handle_code`	Preserve `<code>` and `<pre>` formatting

Sources: crawl4ai/html2text:core.py:1-100

Conversion Features

Link Preservation: External links converted to Markdown format with titles
Image Extraction: Images extracted with alt text and sources
Table Conversion: HTML tables converted to GFM tables
Code Block Handling: Syntax-aware code block extraction
List Recognition: Ordered and unordered lists properly formatted

Usage Examples

Basic Usage

from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        markdown_generator={
            "provider": "best",
            "configs": {"word_threshold": 100}
        }
    )
    print(result.markdown)

With Content Filter

from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter

filter_strategy = BM25ContentFilter(
    query="getting started installation",
    top_n=10
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://docs.example.com",
        markdown_generator={
            "provider": "playwright",
            "filter_strategy": filter_strategy
        }
    )

Advanced Configuration

Custom Prompts

Override system and user prompts for specialized extraction:

markdown_generator = {
    "override_system_prompt": "Extract only technical documentation...",
    "override_user_prompt": "Focus on API endpoints and code examples..."
}

Strict Mode

Enable strict mode to raise exceptions on extraction failures:

markdown_generator = {
    "strict": True,
    "provider": "playwright"
}

Performance Considerations

Aspect	Recommendation
Large Pages	Use `BM25ContentFilter` to reduce content
JavaScript-heavy Sites	Use `playwright` provider
Simple Pages	Use `lxml` or `trafilatura` for speed
Batch Processing	Set appropriate `word_threshold`

Error Handling

The system provides graceful degradation:

Provider Fallback: Falls back to alternative provider on failure
Strict Mode: Raises exceptions when enabled
Partial Results: Returns available content on partial failures

Sources: crawl4ai/extraction_strategy.py:50-120

Chunking: Content can be further processed with text chunking strategies
Extraction: Works alongside extraction strategies for structured data
Cache: Generated Markdown can be cached for repeated access

Sources: docs/md_v2/core/markdown-generation.md:1-15

Extraction Strategies

Related topics: Markdown Generation, Deep Crawling Strategies

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Supported Providers

Continue reading this section for the full explanation and source context.

Section Configuration Parameters

Continue reading this section for the full explanation and source context.

Section Extraction Types

Continue reading this section for the full explanation and source context.

Extraction Strategies

Extraction Strategies in crawl4ai define how content is parsed, structured, and extracted from crawled web pages. They form the core abstraction layer that determines whether unstructured HTML becomes meaningful, machine-readable data.

Overview

Extraction Strategies handle the transformation pipeline from raw HTML to structured output. The system supports two primary categories:

Category	Use Case	Performance
LLM-based	Complex, semantic extraction	Slower, higher accuracy
No-LLM	Fast, pattern-based extraction	Faster, rule-dependent

Sources: crawl4ai/extraction_strategy.py:1-50

Architecture

graph TD
    A[HTML Content] --> B[Content Scraping Strategy]
    B --> C[Chunking Strategy]
    C --> D{Extraction Strategy}
    D -->|LLM-based| E[LLM Strategy]
    D -->|No-LLM| F[No-LLM Strategy]
    E --> G[Structured JSON/Markdown]
    F --> G
    G --> H[Table Extraction Optional]
    H --> I[Final Output]

The extraction pipeline flows through scraping → chunking → extraction, with table extraction as an optional final step.

Sources: crawl4ai/extraction_strategy.py:50-100

LLM-Based Strategies

LLM strategies leverage large language models for semantic understanding and intelligent content extraction.

Supported Providers

Provider	Model Support	Configuration
OpenAI	GPT-4, GPT-3.5	`OPENAI_API_KEY`
Anthropic	Claude 3, Claude 2	`ANTHROPIC_API_KEY`
Azure OpenAI	Custom deployments	`AZURE_API_KEY`, `AZURE_API_BASE`
Ollama	Local models	`OLLAMA_BASE_URL`

Configuration Parameters

class LLMExtractionStrategy:
    def __init__(
        self,
        provider: str = "openai",
        model: str = "gpt-4",
        api_token: Optional[str] = None,
        system_prompt: Optional[str] = None,
        user_prompt: Optional[str] = None,
        extraction_type: str = "block",
        input_format: str = "html",
        instruction: Optional[str] = None
    )

Sources: docs/md_v2/extraction/llm-strategies.md

Extraction Types

Type	Description	Best For
`block`	Block-level extraction	Paragraphs, sections
`schema`	Schema-based extraction	Structured data, forms
`custom`	Custom instructions	Specific extraction needs

Sources: crawl4ai/extraction_strategy.py:100-150

No-LLM Strategies

No-LLM strategies provide fast, deterministic extraction without external API dependencies.

Available Strategies

Strategy	Purpose
`NoExtractionStrategy`	Pass-through, no extraction
`JsonCssExtractionStrategy`	CSS selector-based JSON extraction
`RegexExtractionStrategy`	Regex pattern matching
`XPathExtractionStrategy`	XPath-based extraction

Sources: docs/md_v2/extraction/no-llm-strategies.md

JsonCssExtractionStrategy

from crawl4ai import JsonCssExtractionStrategy

strategy = JsonCssExtractionStrategy(
    schema={
        "name": "ProductList",
        "baseSelector": "div.product",
        "fields": [
            {"name": "title", "selector": "h2.title", "type": "text"},
            {"name": "price", "selector": "span.price", "type": "text"},
            {"name": "image", "selector": "img", "attribute": "src"}
        ]
    }
)

Sources: crawl4ai/extraction_strategy.py:150-200

Chunking Strategies

Chunking strategies split content into manageable pieces before extraction.

Default Chunking Behavior

graph LR
    A[Large Content] --> B[Character Split]
    B --> C[Overlap Application]
    C --> D[Token Count Check]
    D -->|Under limit| E[Chunk Ready]
    D -->|Over limit| F[Recursive Split]
    F --> E

Configuration Options

Parameter	Type	Default	Description
`chunk_token_size`	int	1000	Target tokens per chunk
`overlap`	int	100	Overlapping tokens between chunks
`max_chunk_size`	int	3000	Hard maximum chunk size
`splitting_regex`	str	`\n\n+`	Regex for splitting points

Sources: crawl4ai/chunking_strategy.py:1-80

Content Scraping Strategy

The content scraping strategy determines initial content extraction from HTML.

graph TD
    A[Raw HTML] --> B{Scraping Strategy}
    B -->|BeautifulSoup| C[Parse DOM]
    B -->|Playwright| D[Dynamic Render]
    B -->|Raw| E[Minimal Processing]
    C --> F[Content Cleaned]
    D --> F
    E --> F

Strategy Selection

Strategy	JavaScript	Speed	Use Case
`BeautifulSoup`	No	Fast	Static pages
`Playwright`	Yes	Medium	SPAs, dynamic content
`RawContent`	No	Fastest	Pre-processed HTML

Sources: crawl4ai/content_scraping_strategy.py:1-60

Table Extraction

Table extraction handles tabular data structures within web pages.

class TableExtractionStrategy:
    def __init__(
        self,
        table_styles: Optional[List[str]] = None,
        ignore_tables: Optional[List[str]] = None,
        merge_multiple_headers: bool = False
    )

Extraction Configuration

Parameter	Type	Description
`table_styles`	List[str]	CSS classes to include as tables
`ignore_tables`	List[str]	CSS classes to exclude
`merge_multiple_headers`	bool	Merge multi-row headers
`extract_header`	bool	Include header row (default: True)

Sources: crawl4ai/table_extraction.py:1-100

Complete Pipeline Example

from crawl4ai import (
    AsyncWebCrawler,
    LLMExtractionStrategy,
    JsonCssExtractionStrategy,
    RegexExtractionStrategy,
    TableExtractionStrategy
)

async with AsyncWebCrawler() as crawler:
    # LLM-based extraction
    llm_result = await crawler.arun(
        url="https://example.com/article",
        extraction_strategy=LLMExtractionStrategy(
            provider="openai",
            model="gpt-4",
            instruction="Extract article title, author, and key points"
        )
    )
    
    # CSS-based extraction
    css_result = await crawler.arun(
        url="https://example.com/products",
        extraction_strategy=JsonCssExtractionStrategy(schema=product_schema)
    )

Sources: crawl4ai/extraction_strategy.py:200-250

Strategy Selection Guide

graph TD
    A[Start] --> B{Need semantic understanding?}
    B -->|Yes| C{External API acceptable?}
    B -->|No| D[No-LLM Strategy]
    C -->|Yes| E{Local deployment needed?}
    C -->|No| F[Ollama/Local LLM]
    E -->|Yes| F
    E -->|No| G[OpenAI/Anthropic]
    D --> H{Data has tabular structure?}
    H -->|Yes| I[Add TableExtractionStrategy]
    H -->|No| J[Complete]
    G --> J
    F --> J
    I --> J

Decision Matrix

Requirement	Recommended Strategy
Simple CSS extraction	`JsonCssExtractionStrategy`
Complex semantic parsing	`LLMExtractionStrategy`
High-volume, low-latency	No-LLM strategies
Schema-agnostic	LLM-based strategies
Tabular data focus	`TableExtractionStrategy`

Sources: docs/md_v2/extraction/llm-strategies.md, docs/md_v2/extraction/no-llm-strategies.md

Environment Variables

Variable	Required For	Description
`OPENAI_API_KEY`	OpenAI LLM	API key for GPT models
`ANTHROPIC_API_KEY`	Anthropic LLM	API key for Claude models
`AZURE_API_KEY`	Azure OpenAI	Azure OpenAI API key
`OLLAMA_BASE_URL`	Local LLM	Base URL for Ollama server

Sources: crawl4ai/extraction_strategy.py:250-300

Sources: crawl4ai/extraction_strategy.py:1-50

Deep Crawling Strategies

Related topics: Extraction Strategies, Async Web Crawler

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Base Strategy

Continue reading this section for the full explanation and source context.

Section BFS (Breadth-First Search) Strategy

Continue reading this section for the full explanation and source context.

Section DFS (Depth-First Search) Strategy

Continue reading this section for the full explanation and source context.

Related topics: Extraction Strategies, Async Web Crawler

Deep Crawling Strategies

Overview

Deep Crawling Strategies in crawl4ai provide systematic approaches to traverse and extract content from websites beyond a single page. These strategies enable controlled, scalable web crawling by managing URL discovery, prioritization, filtering, and scoring mechanisms. The deep crawling module supports multiple traversal algorithms (BFS, DFS, BFF) with extensible filtering and scoring systems.

Sources: base_strategy.py:1-50

Architecture

graph TD
    A[Seed URLs] --> B[DeepCrawlingStrategy]
    B --> C[URL Filters]
    B --> D[URL Scorers]
    B --> E[Traversal Algorithm]
    C --> F[Valid URLs]
    D --> G[Prioritized URLs]
    E --> H[Crawl Queue]
    G --> H
    H --> I[Crawl4AI Extractor]
    I --> J[Extracted Content]
    J --> K[Links Extracted]
    K --> B

The architecture follows a producer-consumer pattern where the strategy continuously discovers URLs from crawled pages and feeds them back into the crawl queue based on prioritization rules.

Sources: base_strategy.py:50-100

Core Components

Base Strategy

All crawling strategies inherit from DeepCrawlingStrategy, which provides the foundational interface and shared functionality:

Property/Method	Type	Description
`url_scorer`	`URLScorer`	Scores URLs for prioritization
`url_filter`	`URLFilter`	Filters URLs for validity
`keywords`	`Set[str]`	Keywords for relevance matching
`keywords_or`	`Set[str]`	Alternative keyword matching
`max_depth`	`int`	Maximum crawl depth
`included_domains`	`Set[str]`	Allowed domains
`excluded_domains`	`Set[str]`	Blocked domains
`crawl_enabled`	`bool`	Enable/disable crawling
`check_keywords`	`Callable`	Custom keyword validation

Sources: base_strategy.py:100-150

Traversal Algorithms

BFS (Breadth-First Search) Strategy

The BFS strategy explores pages level by level, ensuring comprehensive coverage before going deeper:

graph LR
    A[Level 0: Seed] --> B[Level 1: depth=1]
    B --> C[Level 2: depth=2]
    C --> D[Level 3: depth=3]
    style A fill:#90EE90
    style B fill:#87CEEB
    style C fill:#DDA0DD
    style D fill:#F0E68C

Characteristics:

Systematic exploration of shallow depths first
Ideal for site maps and directory-style sites
Higher memory usage due to large frontier sets
Better for finding all accessible pages at shallower depths

Sources: bfs_strategy.py:1-80

Configuration Parameters:

Parameter	Type	Default	Description
`max_depth`	`int`	`3`	Maximum crawl depth
`max_pages`	`int`	`50`	Maximum pages to crawl
`priority`	`int`	`0`	Base priority score
`include_external`	`bool`	`False`	Allow external domain crawling

DFS (Depth-First Search) Strategy

The DFS strategy explores as deep as possible before backtracking:

graph TD
    A[Start] --> B[Depth 1]
    B --> C[Depth 2]
    C --> D[Depth 3]
    D --> E[Backtrack]
    E --> F[Next Branch]
    F --> G[Continue Deep]
    style A fill:#90EE90
    style D fill:#FF6B6B

Characteristics:

Deep exploration of specific paths first
Lower memory footprint
Suitable for following navigation chains
Risk of getting stuck in deep site sections

Sources: dfs_strategy.py:1-80

Configuration Parameters:

Parameter	Type	Default	Description
`max_depth`	`int`	`10`	Maximum crawl depth
`max_pages`	`int`	`100`	Maximum pages to crawl
`priority`	`int`	`0`	Base priority score

BFF (Best-First with Filters) Strategy

The BFF strategy combines filtering with score-based prioritization, crawling the most relevant pages first:

graph TD
    A[URL Discovered] --> B{URL Filter}
    B -->|Pass| C{Score URL}
    B -->|Fail| X[Skip]
    C --> D[Priority Queue]
    D --> E[Crawl Next Best]
    E --> F[Extract Links]
    F --> A

Characteristics:

Relevance-based crawling using keyword matching
Configurable scoring functions
Filters out irrelevant content early
Most efficient for targeted data extraction

Sources: bff_strategy.py:1-100

Filtering System

The filtering system determines which URLs are eligible for crawling:

FilterChain

Multiple filters can be chained together for comprehensive URL validation:

from crawl4ai.deep_crawling.filters import FilterChain, SameDomainFilter, ExtensionFilter

filter_chain = FilterChain([
    SameDomainFilter(allowed_domains=["example.com"]),
    ExtensionFilter(excluded_extensions=[".pdf", ".zip"]),
])

Available Filters

Filter	Purpose	Key Parameters
`SameDomainFilter`	Restrict to same domain	`allowed_domains`, `strict`
`ExtensionFilter`	Block by file extension	`excluded_extensions`
`robots.txt Filter`	Respect robots directives	`user_agent`
`RegexFilter`	Custom pattern matching	`patterns`, `exclude`

Sources: filters.py:1-120

Scoring System

URL scoring determines crawl priority within the queue:

ScorerChain

Scorers can be combined in a chain for multi-factor evaluation:

from crawl4ai.deep_crawling.scorers import ScorerChain, KeywordRelevanceScorer, DepthScorer

scorer = ScorerChain([
    KeywordRelevanceScorer(keywords=["api", "docs"]),
    DepthScorer(max_depth=5, decay=0.5),
])

Available Scorers

Scorer	Function	Parameters
`KeywordRelevanceScorer`	Match keywords in URL/text	`keywords`, `weight`
`DepthScorer`	Penalize deep pages	`max_depth`, `decay`
`FreshnessScorer`	Prefer recent content	`date_field`, `decay`
`CustomScorer`	User-defined scoring	`scoring_fn`

Sources: scorers.py:1-150

Usage Examples

Basic BFS Crawling

from crawl4ai import AsyncWebCrawler
from crawl4ai.deep_crawling import BFSDistanceStrategy

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        strategy=BFSDistanceStrategy(
            max_depth=3,
            max_pages=50
        )
    )

Targeted Crawling with BFF

from crawl4ai.deep_crawling import BFFStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from crawl4ai.deep_crawling.filters import SameDomainFilter

strategy = BFFStrategy(
    max_depth=5,
    max_pages=100,
    keywords={"documentation", "api", "guide"},
    url_filter=SameDomainFilter(allowed_domains=["example.com"]),
    url_scorer=KeywordRelevanceScorer(keywords={"documentation"}),
)

Sources: docs/md_v2/core/deep-crawling.md:1-100

Workflow States

stateDiagram-v2
    [*] --> Initializing: Start crawl
    Initializing --> Crawling: Load seed URLs
    Crawling --> Processing: Fetch page
    Processing --> Filtering: Extract links
    Filtering --> Scoring: Validate URLs
    Scoring --> Queuing: Rank by priority
    Queuing --> Crawling: Next URL
    Crawling --> [*]: Queue empty or max reached

Strategy Selection Guide

Use Case	Recommended Strategy	Reason
Site mapping	BFS	Comprehensive shallow coverage
Documentation sites	BFS	Find all pages systematically
Article/navigation chains	DFS	Follow deep links naturally
Targeted data extraction	BFF	Prioritize relevant pages
API documentation	BFF	Filter by keyword relevance
Limited resources	DFS	Lower memory footprint

Configuration Reference

Common Parameters

@dataclass
class DeepCrawlConfig:
    # Scope
    included_domains: Set[str] = None
    excluded_domains: Set[str] = None
    
    # Limits
    max_depth: int = 3
    max_pages: int = 100
    max_total_pages: int = 1000
    
    # Filtering
    allow_external: bool = False
    check_keywords: bool = True
    
    # Scoring
    scoring: ScorerChain = None
    filter: FilterChain = None

Keyword Matching

# AND matching (all keywords must match)
keywords = {"documentation", "api"}
keywords_or = False

# OR matching (any keyword matches)
keywords = {"guide", "tutorial", "docs"}
keywords_or = True

Sources: base_strategy.py:150-200

Advanced Features

Custom Filters

from crawl4ai.deep_crawling.filters import URLFilter

class CustomFilter(URLFilter):
    def should_crawl(self, url: str) -> bool:
        # Custom logic
        return "product" in url and not url.endswith(".jpg")

Custom Scorers

from crawl4ai.deep_crawling.scorers import URLScorer

class PriorityScorer(URLScorer):
    def score(self, url: str, context: dict) -> float:
        base_score = 1.0
        if "important" in url:
            base_score *= 2.0
        return base_score

Sources: filters.py:150-200, scorers.py:150-200

Best Practices

Set reasonable limits: Always configure max_pages and max_depth to prevent runaway crawling
Use BFF for targeted extraction: When you know what content you need, BFF reduces noise
Filter early, score late: Apply filters before scoring to reduce unnecessary processing
Respect robots.txt: Configure filters to respect site crawling directives
Monitor memory usage: BFS uses more memory; switch to DFS for resource-constrained environments
Combine keyword strategies: Use both keywords (AND) and keywords_or (OR) for flexible matching

Anti-Bot Detection and Proxy Management

Overview

Crawl4ai provides sophisticated anti-bot detection evasion and proxy management capabilities to ensure reliable web crawling operations. These features work together to detect and circumvent bot protection mechanisms while maintaining request anonymity through proxy rotation.

Architecture Overview

graph TD
    A[Client Request] --> B[AntiBotDetector]
    B --> C{Bot Detection?}
    C -->|Yes| D[Apply Evasion Strategy]
    C -->|No| E[Direct Request]
    D --> F[ProxySelector]
    F --> G[Rotating Proxies]
    G --> H[Target Website]
    H --> I{Response Valid?}
    I -->|No| J[Fallback Mechanism]
    J --> B
    I -->|Yes| K[Return Content]

Anti-Bot Detection System

Purpose and Scope

The anti-bot detection module (antibot_detector.py) analyzes responses from target websites to determine if bot protection mechanisms have been triggered. When detection occurs, the system can automatically apply evasion strategies or fall back to alternative methods.

Sources: crawl4ai/antibot_detector.py

Detection Strategies

Strategy	Description	Use Case
Header Analysis	Examines HTTP headers for bot detection signals	Standard bot checks
Content Analysis	Scans response content for CAPTCHAs or blocking messages	Challenge pages
Status Code Monitoring	Tracks HTTP status codes indicating blocks	403, 429 responses
JavaScript Challenge Detection	Identifies JS-based bot challenges	Cloudflare, PerimeterX

Key Components

The anti-bot detector integrates with the async configuration system to provide seamless fallback handling:

# Pseudocode representation based on async_configs.py integration
class AntiBotConfig:
    enabled: bool = True
    detection_threshold: float = 0.7
    auto_fallback: bool = True
    max_retries: int = 3

Sources: crawl4ai/async_configs.py

Proxy Management System

Purpose and Scope

The proxy strategy module (proxy_strategy.py) manages proxy rotation, selection, and health checking to maintain request anonymity and distribute load across multiple IP addresses.

Sources: crawl4ai/proxy_strategy.py

Proxy Rotation Strategies

Strategy	Description	Best For
Round Robin	Sequential proxy selection	Even distribution
Random	Random proxy selection	Avoiding pattern detection
Weighted	Prioritize faster/reliable proxies	Performance optimization
Geographic	Match proxy location to target	Region-specific content

Configuration Parameters

class ProxyConfig:
    proxies: List[str] = []          # List of proxy URLs
    rotation_strategy: str = "round_robin"
    health_check_interval: int = 300 # seconds
    timeout: int = 30                # proxy timeout in seconds
    retry_on_failure: bool = True

Proxy Health Monitoring

The system continuously monitors proxy health through periodic health checks, removing failed proxies from the active pool and re-evaluating them after a cooldown period.

Integration with Async Configuration

Fallback Mechanisms

When anti-bot detection triggers, the system can automatically switch to fallback modes:

Proxy Fallback: Rotate to a different proxy server
Strategy Fallback: Switch to alternative crawling strategies
User-Agent Fallback: Use different browser fingerprints

Sources: docs/md_v2/advanced/anti-bot-and-fallback.md

Configuration Example

anti_bot:
  enabled: true
  detection_sensitivity: "medium"
  auto_fallback: true
  
proxy:
  enabled: true
  strategy: "weighted"
  proxies:
    - "http://proxy1.example.com:8080"
    - "http://proxy2.example.com:8080"
  health_check:
    enabled: true
    interval: 300

Security Considerations

Proxy Security

When configuring proxies, consider the following security aspects:

Proxy Protocol: Use HTTPS proxies to encrypt traffic
Authentication: Implement proxy authentication where supported
Provider Reputation: Use trusted proxy providers
IP Rotation: Avoid predictable IP patterns

Sources: docs/md_v2/advanced/proxy-security.md

Best Practices

Practice	Description
Rate Limiting	Respect target site limits to avoid IP bans
Request Delays	Implement delays between requests
Header Randomization	Vary User-Agent and other headers
Cookie Management	Handle cookies appropriately per session
SSL Verification	Validate SSL certificates for security

Workflow Diagram

sequenceDiagram
    participant Client
    participant AntiBot as AntiBot Detector
    participant ProxyMgr as Proxy Manager
    participant Target as Target Website
    
    Client->>AntiBot: Send Request
    AntiBot->>ProxyMgr: Request Proxy
    ProxyMgr->>AntiBot: Return Proxy
    AntiBot->>Target: Forward Request via Proxy
    
    alt Bot Detected
        Target-->>AntiBot: Bot Challenge Response
        AntiBot->>ProxyMgr: Request Different Proxy
        ProxyMgr-->>AntiBot: New Proxy
        AntiBot->>Target: Retry with New Proxy
    else Success
        Target-->>Client: Return Content
    end

Error Handling

Common Error Scenarios

Error	Cause	Resolution
`403 Forbidden`	IP blocked	Rotate proxy
`429 Too Many Requests`	Rate limited	Backoff and retry
`CAPTCHA Required`	Bot detected	Switch strategy
`Proxy Timeout`	Proxy unavailable	Health check and replace

Retry Logic

The system implements exponential backoff for retries:

retry_config = {
    "max_attempts": 3,
    "base_delay": 1.0,      # seconds
    "max_delay": 60.0,       # seconds
    "exponential_base": 2
}

Summary

Crawl4ai's anti-bot detection and proxy management system provides a robust framework for evading bot detection while maintaining reliable web crawling operations. The integration between AntiBotDetector and ProxyStrategy enables automatic fallback mechanisms that significantly improve crawling success rates against protected websites.

Sources: crawl4ai/antibot_detector.py

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high [Bug]: arun() and arun_many() type hinting needs fixing

First-time setup may fail or require extra isolation and rollback planning.

high [Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked as failed), no error messages or failure reason…

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

high [Bug]: MCP scrape tools lack wait_until / SPA support that REST API and CLI provide

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

high [Bug]: `remove_empty_elements_fast()` drops trailing text when removing empty elements with non-empty .tail

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

Doramagic Pitfall Log

Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: [Bug]: arun() and arun_many() type hinting needs fixing

Severity: high
Finding: Installation risk is backed by a source signal: [Bug]: arun() and arun_many() type hinting needs fixing. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1898

2. Configuration risk: [Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked as failed), no error messages or failure reason…

Severity: high
Finding: Configuration risk is backed by a source signal: [Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked as failed), no error messages or failure reason…. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1949

3. Configuration risk: [Bug]: MCP scrape tools lack wait_until / SPA support that REST API and CLI provide

Severity: high
Finding: Configuration risk is backed by a source signal: [Bug]: MCP scrape tools lack wait_until / SPA support that REST API and CLI provide. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1963

4. Configuration risk: [Bug]: `remove_empty_elements_fast()` drops trailing text when removing empty elements with non-empty .tail

Severity: high
Finding: Configuration risk is backed by a source signal: [Bug]: remove_empty_elements_fast() drops trailing text when removing empty elements with non-empty .tail. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1938

5. Security or permission risk: [Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-3x token overhead for CJK content

Severity: high
Finding: Security or permission risk is backed by a source signal: [Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-3x token overhead for CJK content. Treat it as a review item until the current version is checked.
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1962

6. Installation risk: [Bug] AsyncLogger writes to stdout, breaking MCP stdio transport

Severity: medium
Finding: Installation risk is backed by a source signal: [Bug] AsyncLogger writes to stdout, breaking MCP stdio transport. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1968

7. Installation risk: [Bug]: The install with pip on just about any system rarely works. It requires an env or it only partial installs

Severity: medium
Finding: Installation risk is backed by a source signal: [Bug]: The install with pip on just about any system rarely works. It requires an env or it only partial installs. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1950

8. Installation risk: [Bug]: enable_stealth=True is a silent no-op — StealthAdapter imports symbols that don't exist in playwright-stealth 2.x

Severity: medium
Finding: Installation risk is backed by a source signal: [Bug]: enable_stealth=True is a silent no-op — StealthAdapter imports symbols that don't exist in playwright-stealth 2.x. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1959

9. Installation risk: v0.7.1:Update

Severity: medium
Finding: Installation risk is backed by a source signal: v0.7.1:Update. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/releases/tag/v0.7.1

10. Installation risk: v0.7.2: CI/CD & Dependency Optimization Update

Severity: medium
Finding: Installation risk is backed by a source signal: v0.7.2: CI/CD & Dependency Optimization Update. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/releases/tag/v0.7.2

11. Configuration risk: [Bug]: Markdown export loses heading hierarchy and table structure

Severity: medium
Finding: Configuration risk is backed by a source signal: [Bug]: Markdown export loses heading hierarchy and table structure. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1964

12. Capability assumption: README/documentation is current enough for a first validation pass.

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: capability.assumptions | github_repo:798201435 | https://github.com/unclecode/crawl4ai | README/documentation is current enough for a first validation pass.

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using crawl4ai with real data or production workflows.

[[Bug] AsyncLogger writes to stdout, breaking MCP stdio transport](https://github.com/unclecode/crawl4ai/issues/1968) - github / github_issue
[[Bug]: Markdown text extraction drops text when element contains empty e](https://github.com/unclecode/crawl4ai/issues/1966) - github / github_issue
[[Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-](https://github.com/unclecode/crawl4ai/issues/1962) - github / github_issue
[[Bug]: MCP scrape tools lack wait_until / SPA support that REST API and](https://github.com/unclecode/crawl4ai/issues/1963) - github / github_issue
[[Bug]: Markdown export loses heading hierarchy and table structure](https://github.com/unclecode/crawl4ai/issues/1964) - github / github_issue
[[Bug]: enable_stealth=True is a silent no-op — StealthAdapter imports sy](https://github.com/unclecode/crawl4ai/issues/1959) - github / github_issue
[[Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked](https://github.com/unclecode/crawl4ai/issues/1949) - github / github_issue
[[Bug]: arun() and arun_many() type hinting needs fixing](https://github.com/unclecode/crawl4ai/issues/1898) - github / github_issue
[[Bug]: The install with pip on just about any system rarely works. It re](https://github.com/unclecode/crawl4ai/issues/1950) - github / github_issue
[[Bug]: remove_empty_elements_fast() drops trailing text when removing](https://github.com/unclecode/crawl4ai/issues/1938) - github / github_issue
Release v0.7.7 - github / github_release
Release v0.7.5 - github / github_release

Source: Project Pack community evidence and pitfall evidence