Doramagic Project Pack Β· Human Manual

crawl4ai

Related topics: Installation Guide, Quick Start Guide

Introduction to Crawl4AI

Related topics: Installation Guide, Quick Start Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Processing Pipeline

Continue reading this section for the full explanation and source context.

Section 1. AI-Powered Extraction

Continue reading this section for the full explanation and source context.

Section 2. Multiple Output Strategies

Continue reading this section for the full explanation and source context.

Related topics: Installation Guide, Quick Start Guide

Introduction to Crawl4AI

Overview

Crawl4AI is an open-source AI-powered web crawling framework designed to extract structured data from web pages and deliver clean, LLM-ready output. It serves as a modern alternative to traditional web scraping tools by combining intelligent crawling capabilities with AI-driven content extraction and formatting.

The project emphasizes ease of use, providing both programmatic APIs and command-line interfaces for rapid integration into data pipelines, research workflows, and AI applications.

Purpose and Scope

Crawl4AI addresses the fundamental challenge of extracting meaningful data from unstructured web content. While conventional web crawlers focus on fetching page content, Crawl4AI goes further by:

  • Semantic Understanding: Analyzing page content to identify and extract relevant information based on context rather than rigid selectors
  • Structured Output: Delivering data in formats optimized for large language model consumption, including Markdown, JSON, and structured extractions
  • Performance Optimization: Enabling high-throughput crawling with configurable browser automation and connection pooling
  • flexibility: Supporting various output strategies including simple crawling, chunked extraction, and memory-aware processing

The scope encompasses web crawling, content extraction, link navigation, media handling, and output formattingβ€”providing an end-to-end solution from URL input to structured data output.

Core Architecture

Crawl4AI follows a modular architecture composed of distinct processing stages:

graph TD
    A[URL Input] --> B[Crawl Strategy]
    B --> C[Browser Automation]
    C --> D[Content Extraction]
    D --> E[AI Processing]
    E --> F[Output Formatter]
    F --> G[Structured Output]
    
    C -->|JS Rendering| H[JavaScript Executor]
    D -->|Media| I[Media Handler]
    E -->|Memory| J[Memory Manager]

Processing Pipeline

StageComponentDescription
InputURL ParserValidates and normalizes target URLs
CrawlStrategy EngineSelects crawling approach based on configuration
RenderBrowser PoolManages headless browser instances
ExtractAI ExtractorUses ML models to identify relevant content
FormatOutput SerializerConverts to target format (JSON/Markdown/HTML)

Key Features

1. AI-Powered Extraction

Crawl4AI leverages machine learning models to understand page content semantically. Rather than relying solely on CSS selectors or XPath expressions, the extractor can identify:

  • Main article content and metadata
  • Structured data elements (tables, lists, forms)
  • Semantic sections and their relationships
  • Relevant versus boilerplate content

2. Multiple Output Strategies

The framework supports various extraction strategies optimized for different use cases:

StrategyUse CaseOutput
defaultGeneral purposeClean Markdown with metadata
cosineSemantic clusteringGrouped content chunks
no-cacheFresh dataBypass internal caching
passiveLow resourceMinimal processing
brainlessSimple fetchRaw HTML without AI processing

3. Browser Automation

Integrated headless browser support enables:

  • JavaScript rendering for single-page applications
  • Cookie and session management
  • Custom headers and authentication
  • Screenshot capture
  • PDF generation

4. Media Handling

Crawl4AI processes various media types during extraction:

  • Images: Download, compress, and embed with alt-text preservation
  • Videos: Extract metadata and embed URLs
  • Audio: Handle media references for podcasts and audio content
  • Documents: Process embedded PDFs and downloadable files

Installation and Setup

Prerequisites

  • Python 3.9 or higher
  • Chrome/Chromium browser (for browser automation features)
  • pip or poetry package manager

Basic Installation

pip install crawl4ai

Development Installation

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e ".[dev]"

Verify Installation

from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
    print(result.markdown)

Basic Usage Patterns

Simple Crawl

from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
    print(f"Title: {result.metadata.get('title')}")
    print(f"Content: {result.markdown}")

Configured Extraction

from crawl4ai import CrawlerRunConfig, AsyncWebCrawler

config = CrawlerRunConfig(
    mode="aggressive",
    word_count_threshold=10,
    remove_hidden_text=True,
    process_iframes=True,
    scroll_delay=1.0
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com/article",
        config=config
    )
    print(result.markdown)

Batch Crawling

from crawl4ai import AsyncWebCrawler

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(urls=urls)
    for result in results:
        print(f"URL: {result.url}")
        print(f"Status: {result.status_code}")

Configuration Reference

CrawlerRunConfig Parameters

ParameterTypeDefaultDescription
modestr"default"Extraction strategy
headlessboolTrueRun browser in headless mode
verboseboolFalseEnable verbose logging
text_thresholdintMinimum text length filter
word_count_thresholdintWords per chunk threshold
skip_download_imagesboolFalseSkip image downloads
page_timeoutint30000Page load timeout (ms)
scroll_delayfloat0Delay between scrolls

Browser Configuration

ParameterTypeDescription
browser_typestrChromium/Firefox/WebKit
headlessboolHeadless mode toggle
proxydictProxy configuration
user_agentstrCustom user agent string

Project Structure

crawl4ai/
β”œβ”€β”€ src/crawl4ai/          # Main package source
β”‚   β”œβ”€β”€ core/              # Core crawling engine
β”‚   β”œβ”€β”€ extractors/        # Content extraction strategies
β”‚   β”œβ”€β”€ formatters/        # Output formatters
β”‚   └── utils/             # Utility functions
β”œβ”€β”€ examples/              # Usage examples
β”œβ”€β”€ tests/                 # Test suite
β”œβ”€β”€ docs/                  # Documentation
β”œβ”€β”€ scripts/               # Build and utility scripts
└── sbom/                  # Software Bill of Materials

Dependencies and SBOM

Crawl4AI maintains a comprehensive Software Bill of Materials (SBOM) documenting all direct and transitive dependencies. This SBOM is generated using CycloneDX format and regenerated on a best-effort basis through automated scripts.

Regenerating SBOM

./scripts/gen-sbom.sh

The SBOM provides visibility into the project's dependency tree, supporting security audits and license compliance verification.

Extensibility

Custom Extractors

Extend the extraction framework by implementing the base extractor interface:

from crawl4ai.extractors import BaseExtractor

class CustomExtractor(BaseExtractor):
    async def extract(self, html: str, url: str) -> dict:
        # Custom extraction logic
        return {"content": html, "custom_field": "value"}

Output Formatters

Create custom output formats by implementing the formatter interface:

from crawl4ai.formatters import BaseFormatter

class CustomFormatter(BaseFormatter):
    def format(self, result: CrawlResult) -> str:
        # Custom formatting logic
        return custom_string

Contributing

The project welcomes contributions from the community. Developers interested in contributing should:

  1. Fork the repository
  2. Create a feature branch
  3. Follow the established coding standards
  4. Add tests for new functionality
  5. Submit a pull request with clear documentation

Resources and Documentation

ResourceLocation
Source CodeGitHub Repository
Documentation/docs directory
Examples/examples directory
SBOM/sbom directory
Issue TrackerGitHub Issues

License

Crawl4AI is released under open-source licensing terms. Refer to the LICENSE file in the repository for specific terms and conditions.

Source: https://github.com/unclecode/crawl4ai / Human Manual

Installation Guide

Related topics: Quick Start Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Hardware Requirements

Continue reading this section for the full explanation and source context.

Section Software Prerequisites

Continue reading this section for the full explanation and source context.

Section Method 1: pip Installation (Recommended)

Continue reading this section for the full explanation and source context.

Related topics: Quick Start Guide

Installation Guide

Overview

This guide covers all supported methods for installing crawl4ai, a powerful web crawling and data extraction framework. The installation process handles automatic setup of core dependencies including Playwright for browser automation, supporting multiple installation scenarios from simple pip installations to Docker containerized deployments.

System Requirements

Hardware Requirements

ComponentMinimumRecommended
RAM4 GB8 GB
Disk Space2 GB5 GB
CPU2 cores4+ cores

Software Prerequisites

RequirementVersionNotes
Python>= 3.9Tested up to 3.12
pipLatestFor pip installations
Docker20.10+For Docker installations
Chrome/ChromiumLatestAuto-installed by Playwright

Installation Methods

The simplest and most common installation method uses pip package manager.

pip install crawl4ai

After pip installation, you must run the post-installation setup to configure browser dependencies:

python -m crawl4ai install

This command installs Playwright browsers and configures the necessary system dependencies. Sources: crawl4ai/install.py:1-50

Method 2: Docker Installation

Docker provides an isolated environment with all dependencies pre-configured.

#### Pulling the Official Image

docker pull unclecode/crawl4ai

#### Running with Docker

docker run -d \
  --name crawl4ai \
  -p 8000:8000 \
  unclecode/crawl4ai

#### Building Custom Docker Image

You can build your own image using the provided Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "-m", "crawl4ai"]

Sources: Dockerfile

Method 3: Installation from Source

For development or customization purposes, install from the source repository:

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

Dependencies Management

Core Dependencies

The project defines dependencies in setup.py for package distribution and requirements.txt for development.

setup.py dependencies:

PackagePurpose
playwrightBrowser automation
asyncioAsync operations
aiohttpHTTP client
beautifulsoup4HTML parsing
lxmlXML/HTML processing

Sources: setup.py

Runtime Dependency Installation

The crawl4ai/install.py module handles automatic installation of runtime dependencies:

# Key installation steps in install.py
import subprocess
import sys

def install_browsers():
    subprocess.check_call([
        sys.executable, "-m", "playwright", "install", "chromium"
    ])

Sources: crawl4ai/install.py:20-30

Installing Optional Dependencies

ExtraCommandDescription
All extraspip install crawl4ai[all]Install all optional packages
Dev toolspip install crawl4ai[dev]Development dependencies
LLM supportpip install crawl4ai[llm]Language model integration

Installation Workflow

graph TD
    A[Start Installation] --> B{Installation Method}
    B -->|pip| C[Run pip install]
    B -->|Docker| D[Pull/Build Image]
    B -->|Source| E[Clone Repository]
    C --> F[Run post-install]
    F --> G[Install Playwright Browsers]
    D --> H[Run Container]
    E --> I[Install in Editable Mode]
    I --> F
    G --> J[Verify Installation]
    H --> J
    J --> K{Success?}
    K -->|Yes| L[Installation Complete]
    K -->|No| M[Troubleshoot]
    M --> G

Post-Installation Verification

Verify that crawl4ai is installed correctly by checking the installation status:

python -m crawl4ai --version

Or test the installation programmatically:

import crawl4ai
print(crawl4ai.__version__)

Browser Verification

Ensure Playwright browsers are properly installed:

python -m playwright install-deps chromium
python -m playwright install chromium

Environment Configuration

Environment Variables

VariableDefaultDescription
CRAWL4AI_BROWSER_HEADLESStrueRun browser in headless mode
CRAWL4AI_MAX_CONCURRENT5Maximum concurrent crawls
CRAWL4AI_CACHE_DIR~/.crawl4ai/cacheCache directory path

Configuration File

Create ~/.crawl4ai/config.json for persistent configuration:

{
  "browser": {
    "headless": true,
    "viewport": {"width": 1920, "height": 1080}
  },
  "cache": {
    "enabled": true,
    "ttl": 3600
  }
}

Common Installation Issues

Issue: Playwright Installation Fails

Solution: Install system dependencies manually:

# Ubuntu/Debian
apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 libcairo2

# Then retry browser installation
python -m playwright install chromium

Issue: Permission Denied

Solution: Use virtual environment or --user flag:

python -m venv venv
source venv/bin/activate  # Linux/Mac
pip install crawl4ai

Issue: Import Errors After Installation

Solution: Verify Python path and reinstall:

pip uninstall crawl4ai
pip install crawl4ai --force-reinstall

Docker-Specific Configuration

Volume Mounts

Mount local directories for persistent data:

docker run -v /path/to/data:/data crawl4ai

Network Configuration

For web crawling behind proxies:

docker run -e HTTP_PROXY=http://proxy:8080 \
           -e HTTPS_PROXY=https://proxy:8080 \
           crawl4ai

Upgrading crawl4ai

pip Upgrade

pip install crawl4ai --upgrade

Docker Upgrade

docker pull unclecode/crawl4ai
docker stop old_container
docker rm old_container
docker run crawl4ai

Source Upgrade

git pull origin main
pip install -e . --force-reinstall

Next Steps

After successful installation, proceed to:

  1. Quick Start - Run your first crawl operation
  2. Configuration Guide - Customize crawl behavior
  3. API Reference - Explore available methods and options
  4. Examples - Review usage patterns and best practices

Sources: Dockerfile

Quick Start Guide

Related topics: Async Web Crawler, Markdown Generation

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Purpose and Scope

Continue reading this section for the full explanation and source context.

Section Prerequisites

Continue reading this section for the full explanation and source context.

Section Installation Command

Continue reading this section for the full explanation and source context.

Related topics: Async Web Crawler, Markdown Generation

Quick Start Guide

Overview

The Quick Start Guide provides developers with a rapid introduction to using crawl4ai for web crawling and data extraction tasks. It serves as the entry point for users who want to begin extracting structured data from websites within minutes of installation.

Purpose and Scope

The Quick Start Guide is designed to:

  • Demonstrate the simplest possible usage pattern for crawling web pages
  • Show how to extract and structure content from HTML pages
  • Provide copy-paste-ready code examples for immediate experimentation
  • Bridge the gap between installation and production usage

Installation

Prerequisites

RequirementDescription
PythonVersion 3.8 or higher
pipLatest version recommended
BrowserChrome/Chromium (for JavaScript rendering)

Installation Command

pip install crawl4ai

Basic Usage Pattern

The fundamental workflow in crawl4ai follows a simple three-step pattern:

graph TD
    A[Create AsyncWebCrawler Instance] --> B[Configure Parameters]
    B --> C[Call crawl Method with URL]
    C --> D[Process Result Object]
    D --> E[Extract Content/Markdown/HTML]

Hello World Example

The simplest possible usage demonstrates core functionality:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.crawl(url="https://example.com")
        
        if result.success:
            print(f"Content: {result.markdown}")
            print(f"Links found: {len(result.links)}")
        else:
            print(f"Crawl failed: {result.error_message}")

asyncio.run(main())

Core Components

AsyncWebCrawler

The primary entry point for all crawling operations:

ParameterTypeDescription
verboseboolEnable detailed logging output
headlessboolRun browser in headless mode
browser_typestrSpecify browser engine

Sources: crawl4ai/__init__.py

CrawlResult Object

The return value from crawler.crawl() contains extracted data:

PropertyTypeDescription
successboolWhether crawl completed successfully
markdownstrExtracted content as markdown
htmlstrRaw HTML content
linksdictDictionary of internal/external links
mediadictImages, videos, and other media
error_messagestrError details if success is False

Common Usage Patterns

Pattern 1: Simple Content Extraction

import asyncio
from crawl4ai import AsyncWebCrawler

async def extract_content():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.crawl(
            url="https://example.com",
            word_count_threshold=10
        )
        
        if result.success:
            return result.markdown
        return None

content = asyncio.run(extract_content())

Pattern 2: Batch Crawling

import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_multiple(urls):
    async with AsyncWebCrawler() as crawler:
        tasks = [crawler.crawl(url=url) for url in urls]
        results = await asyncio.gather(*tasks)
        return [r for r in results if r.success]

urls = ["https://example.com", "https://example.org"]
successful_results = asyncio.run(crawl_multiple(urls))

Configuration Options

Browser Configuration

from crawl4ai import BrowserConfig, CrawlerRunConfig

browser_config = BrowserConfig(
    headless=True,
    verbose=False
)

run_config = CrawlerRunConfig(
    word_count_threshold=10,
    page_timeout=30000
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.crawl(
        url="https://example.com",
        config=run_config
    )

Error Handling

Always check the success property before accessing extracted content:

result = await crawler.crawl(url="https://example.com")

if result.success:
    process_data(result.markdown)
else:
    log_error(f"Crawl failed: {result.error_message}")
    handle_failure()

Next Steps

After completing the Quick Start Guide, users should explore:

  • Advanced extraction strategies with CSS selectors and XPath
  • JavaScript-heavy page crawling
  • Rate limiting and polite crawling practices
  • Integration with AI/LLM pipelines for content analysis

Sources: crawl4ai/__init__.py

System Architecture

Related topics: Browser Management, Async Web Crawler

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Key Parameters

Continue reading this section for the full explanation and source context.

Section Workflow

Continue reading this section for the full explanation and source context.

Section Browser Lifecycle

Continue reading this section for the full explanation and source context.

Related topics: Browser Management, Async Web Crawler

System Architecture

Overview

Crawl4AI is a high-performance web crawling framework designed for AI applications. It enables efficient extraction of web content along with metadata, supporting both single-page crawling and large-scale asynchronous crawling operations. The architecture emphasizes separation of concerns between browser management, crawling logic, and result processing.

Core Components

The system is built around three primary modules that work in coordination:

ComponentFileResponsibility
AsyncWebCrawlercrawl4ai/async_webcrawler.pyMain entry point for crawling operations
BrowserManagercrawl4ai/browser_manager.pyHandles browser lifecycle and page interactions
AsyncDispatchercrawl4ai/async_dispatcher.pyManages concurrent crawling tasks

Component Architecture

graph TD
    A[User / API Client] --> B[AsyncWebCrawler]
    B --> C[BrowserManager]
    B --> D[AsyncDispatcher]
    C --> E[Browser Instance]
    D --> F[Task Queue]
    F --> E
    E --> G[Content Extraction]
    G --> H[Result Models]
    H --> B

AsyncWebCrawler

The AsyncWebCrawler class serves as the primary interface for initiating crawl operations. It accepts configuration parameters and coordinates the crawling workflow.

Key Parameters

ParameterTypeDescription
configCrawlerRunConfigConfiguration for the crawl session
browser_managerBrowserManagerShared browser manager instance
dispatcherAsyncDispatcherTask dispatcher for async operations

Sources: crawl4ai/async_webcrawler.py

Workflow

graph LR
    A[Initialize Crawler] --> B[Configure Browser]
    B --> C[Create BrowserContext]
    C --> D[Navigate to URL]
    D --> E[Extract Content]
    E --> F[Return CrawlResult]

BrowserManager

The BrowserManager handles the lifecycle of browser instances, managing Chrome/Chromium processes and providing isolated contexts for crawling sessions.

Sources: crawl4ai/browser_manager.py

Browser Lifecycle

graph TD
    A[Launch Browser] --> B[Create Context]
    B --> C[Create Page]
    C --> D[Execute Crawl]
    D --> E[Close Context]
    E --> F[Repeat or Shutdown]
    F --> A

AsyncDispatcher

The AsyncDispatcher enables concurrent crawling operations, managing task queues and coordinating multiple browser contexts for parallel extraction.

Sources: crawl4ai/async_dispatcher.py

Parallel Execution Model

graph TD
    A[URL List] --> B[Dispatcher Queue]
    B --> C[Worker 1]
    B --> D[Worker 2]
    B --> E[Worker N]
    C --> F[Results Aggregator]
    D --> F
    E --> F
    F --> G[Combined Output]

Data Models

Results from crawling operations are structured using Pydantic models defined in models.py.

Sources: crawl4ai/models.py

ModelPurpose
CrawlResultContainer for extracted content and metadata
CrawlerRunConfigConfiguration parameters for crawl sessions

Docker Deployment Architecture

The project includes Docker deployment specifications that containerize the crawling infrastructure.

graph TD
    A[Docker Compose] --> B[Crawl4AI Container]
    A --> C[Redis Cache]
    A --> D[Chrome Browser]
    B --> C
    B --> D

Sources: deploy/docker/ARCHITECTURE.md

Technology Stack

LayerTechnology
RuntimePython 3.10+
Browser EngineChrome/Chromium via Playwright
Async Frameworkasyncio
Data ValidationPydantic
ContainerizationDocker

Configuration

The system supports extensive configuration options through CrawlerRunConfig, including:

  • JavaScript execution toggles
  • Memory management settings
  • Request throttling parameters
  • Content extraction strategies

Dependency Management

The project maintains a Software Bill of Materials (SBOM) for tracking dependencies and ensuring reproducible builds.

Sources: sbom/README.md

To regenerate the SBOM:

./scripts/gen-sbom.sh

Sources: crawl4ai/async_webcrawler.py

Browser Management

Related topics: Anti-Bot Detection and Proxy Management

Section Related Pages

Continue reading this section for the full explanation and source context.

Section BrowserManager

Continue reading this section for the full explanation and source context.

Section BrowserAdapter

Continue reading this section for the full explanation and source context.

Section BrowserProfiler

Continue reading this section for the full explanation and source context.

Related topics: Anti-Bot Detection and Proxy Management

Browser Management

Overview

Browser Management in crawl4ai provides a comprehensive abstraction layer for controlling and orchestrating browser instances used during web crawling and scraping operations. The system abstracts the complexity of browser automation, allowing users to focus on data extraction rather than browser lifecycle management.

Architecture Overview

The browser management system follows a modular architecture with distinct components that handle specific responsibilities:

graph TD
    A[BrowserManager] --> B[BrowserAdapter]
    A --> C[BrowserProfiler]
    B --> D[Playwright/Chromium]
    C --> E[JS Snippets]
    F[User Request] --> A
    A --> G[Crawled Result]

Core Components

BrowserManager

The central orchestrator responsible for:

  • Browser instance lifecycle (creation, configuration, teardown)
  • Session management and isolation
  • Resource allocation and cleanup
  • Coordination between adapters and profilers

Key Responsibilities:

ResponsibilityDescription
Instance CreationCreates and initializes browser contexts
ConfigurationApplies user-defined browser settings
Lifecycle ControlManages startup and shutdown sequences
Pool ManagementHandles browser pool for concurrent operations

Sources: crawl4ai/browser_manager.py

BrowserAdapter

The adapter pattern implementation that provides a consistent interface for interacting with different browser engines (Playwright, Chromium, Firefox, WebKit).

Adapter Features:

FeatureDescription
Engine AbstractionUnified API across browser backends
Command TranslationConverts high-level commands to browser-specific instructions
Response NormalizationStandardizes browser responses

Sources: crawl4ai/browser_adapter.py

BrowserProfiler

Handles JavaScript injection and performance profiling during browser operations.

Profiler Capabilities:

CapabilityPurpose
JS InjectionExecute custom JavaScript in page context
Performance TrackingMonitor page load and execution metrics
Resource ProfilingTrack network requests and responses

Sources: crawl4ai/browser_profiler.py

JavaScript Integration

The js_snippet module provides pre-built JavaScript utilities for browser automation tasks:

graph LR
    A[Browser Context] --> B[js_snippet Module]
    B --> C[DOM Manipulation]
    B --> D[Data Extraction]
    B --> E[Event Handling]

Common JS Snippet Categories:

  • DOM traversal and manipulation
  • Content extraction
  • Scroll management
  • Wait conditions
  • Network request interception

Sources: crawl4ai/js_snippet

Configuration Options

Browser Launch Parameters

ParameterTypeDefaultDescription
headlessbooltrueRun browser in headless mode
argslist[]Additional browser arguments
timeoutint30000Navigation timeout in milliseconds
viewportdict{"width": 1920, "height": 1080}Browser viewport dimensions
user_agentstrNoneCustom user agent string
proxydictNoneProxy configuration

Context Options

OptionTypeDescription
java_script_enabledboolEnable/disable JavaScript
ignore_https_errorsboolIgnore SSL certificate errors
java_script_enabledboolBrowser context JavaScript state

Sources: docs/codebase/browser.md

Browser Lifecycle

stateDiagram-v2
    [*] --> Initializing: Create BrowserManager
    Initializing --> Launching: Launch Browser
    Launching --> Ready: Browser Context Created
    Ready --> Navigating: Load URL
    Navigating --> Ready: Page Loaded
    Ready --> Executing: Run JS/Commands
    Executing --> Ready: Commands Complete
    Ready --> Closing: Shutdown Request
    Closing --> [*]: Resources Freed

Usage Patterns

Basic Browser Usage

from crawl4ai import BrowserManager

# Initialize browser manager
browser_mgr = BrowserManager(
    headless=True,
    viewport={"width": 1920, "height": 1080}
)

# Create browser context
context = browser_mgr.new_context()

# Use context for crawling
result = await context.goto("https://example.com")

Advanced Configuration

browser_mgr = BrowserManager(
    headless=False,
    args=[
        "--disable-blink-features=AutomationControlled",
        "--disable-dev-shm-usage"
    ],
    timeout=60000,
    user_agent="Custom User Agent"
)

Session Management

The system supports multiple concurrent sessions through isolated browser contexts:

graph TD
    A[BrowserManager] --> B1[Session 1 Context]
    A --> B2[Session 2 Context]
    A --> B3[Session N Context]
    B1 --> C1[Page 1]
    B2 --> C2[Page 2]
    B3 --> C3[Page N]

Error Handling

The browser management system implements comprehensive error handling:

Error TypeHandling Strategy
Navigation TimeoutRetry with exponential backoff
Browser CrashAutomatic restart and context recreation
Resource ExhaustionAutomatic cleanup of stale contexts
Network ErrorsGraceful degradation with cached content

Performance Considerations

Optimization Strategies

  1. Context Reuse: Reuse browser contexts for multiple pages when possible
  2. Lazy Loading: Only load resources when explicitly requested
  3. Resource Limits: Configure memory and CPU limits per context
  4. Connection Pooling: Maintain warm browser instances for rapid access

Memory Management

StrategyDescription
Context IsolationEach session runs in isolated context
Automatic CleanupTemporary files and caches cleared automatically
Resource LimitsConfigurable memory caps per browser instance

Sources: crawl4ai/browser_manager.py

Async Web Crawler

Related topics: Markdown Generation, Extraction Strategies

Section Related Pages

Continue reading this section for the full explanation and source context.

Section System Components

Continue reading this section for the full explanation and source context.

Section Core Classes

Continue reading this section for the full explanation and source context.

Section Initialization

Continue reading this section for the full explanation and source context.

Related topics: Markdown Generation, Extraction Strategies

Async Web Crawler

Overview

The Async Web Crawler is the core component of crawl4ai, providing an asynchronous, high-performance web crawling engine built on Python's asyncio framework. It enables concurrent crawling of multiple URLs with built-in caching, configurable extraction strategies, and comprehensive result handling.

The primary purpose of this module is to fetch web pages, extract meaningful content, and return structured results that include HTML, markdown, media assets, metadata, and optional AI-generated summaries. The async design allows for efficient I/O-bound operations, making it suitable for large-scale web scraping projects.

Sources: crawl4ai/async_webcrawler.py:1-50

Architecture

System Components

The async web crawler system consists of several interconnected components that work together to provide a seamless crawling experience.

graph TD
    A[AsyncWebCrawler] --> B[Browser Manager]
    A --> C[Cache Layer]
    A --> D[Extraction Strategy]
    A --> E[Result Processor]
    
    B --> F[Playwright/Chromium]
    C --> G[File System Cache]
    C --> H[Memory Cache]
    
    D --> I[LLM-based Extraction]
    D --> J[CSS/XPath Extraction]
    
    E --> K[CrawlResult]
    E --> L[Raw HTML]
    E --> M[Markdown]

Sources: crawl4ai/async_webcrawler.py:1-30

Core Classes

ClassFilePurpose
AsyncWebCrawlerasync_webcrawler.pyMain crawler entry point with arun() and arun_many() methods
BrowserConfigasync_configs.pyConfiguration for headless browser behavior
CrawlCachecache_context.pyManages caching strategies for crawled content
CrawlResulttypes.pyData model for returning crawl results

Sources: crawl4ai/async_configs.py:1-30

AsyncWebCrawler Class

Initialization

The AsyncWebCrawler class can be initialized with optional configuration parameters:

class AsyncWebCrawler:
    def __init__(
        self,
        config: BrowserConfig | None = None,
        verbose: bool = False
    ) -> None:

Parameters:

ParameterTypeDefaultDescription
configBrowserConfigNoneBrowser configuration object
verboseboolFalseEnable verbose logging output

Sources: crawl4ai/async_webcrawler.py:50-70

Core Methods

#### arun()

Single URL crawling with comprehensive result extraction:

async def arun(
    self,
    url: str,
    config: BrowserConfig | None = None,
    **kwargs
) -> CrawlResult

Parameters:

ParameterTypeDescription
urlstrTarget URL to crawl
configBrowserConfigOverride browser configuration

Sources: crawl4ai/async_webcrawler.py:100-150

#### arun_many()

Batch crawling for multiple URLs concurrently:

async def arun_many(
    self,
    urls: list[str],
    config: BrowserConfig | None = None,
    **kwargs
) -> list[CrawlResult]

Parameters:

ParameterTypeDescription
urlslist[str]List of target URLs
configBrowserConfigShared configuration for all URLs

Sources: crawl4ai/async_webcrawler.py:200-250

Context Manager Support

The AsyncWebCrawler implements the async context manager protocol for proper resource cleanup:

async def __aenter__(self) -> "AsyncWebCrawler":
    await self.start()
    return self

async def __aexit__(
    self,
    exc_type, exc_val, exc_tb
) -> None:
    await self.close()

Sources: crawl4ai/async_webcrawler.py:80-100

CrawlResult Data Model

The CrawlResult class encapsulates all information retrieved from a crawled page:

classDiagram
    class CrawlResult {
        +str url
        +str html
        +str markdown
        +list~MediaItem~ media
        +list~Link~ links
        +dict metadata
        +str|None success
        +str|None error
        +dict~str, Any~ extracted_content
        +int status_code
        +datetime created_at
    }

Sources: crawl4ai/types.py:1-80

Properties

PropertyTypeDescription
urlstrOriginal request URL
htmlstrRaw HTML content
markdownstrConverted markdown content
medialist[MediaItem]Extracted images, videos, audio
linkslist[Link]Internal and external links
metadatadictPage metadata (title, description)
success`str \None`Success status message
error`str \None`Error message if failed
status_codeintHTTP response status code
created_atdatetimeTimestamp of crawl operation

Sources: crawl4ai/types.py:50-100

Configuration

BrowserConfig

The BrowserConfig class provides fine-grained control over browser behavior:

@dataclass
class BrowserConfig:
    headless: bool = True
    browser_type: str = "chromium"
    viewport_size: dict = {"width": 1920, "height": 1080}
    user_agent: str | None = None
    verbose: bool = False

Sources: crawl4ai/async_configs.py:30-80

Configuration Options

OptionTypeDefaultDescription
headlessboolTrueRun browser in headless mode
browser_typestr"chromium"Browser engine (chromium, firefox, webkit)
viewport_size.widthint1920Viewport width in pixels
viewport_size.heightint1080Viewport height in pixels
user_agent`str \None`NoneCustom user agent string
verboseboolFalseEnable debug output

Sources: crawl4ai/async_configs.py:40-90

Advanced Configuration

Additional crawling parameters can be passed via kwargs:

ParameterTypeDescription
word_count_thresholdintMinimum word count for content extraction
extraction_strategyExtractionStrategyStrategy for content extraction
cache_modeCacheModeCaching behavior (enabled/disabled/bypass)
js_enabledboolEnable JavaScript execution
wait_forstrCSS selector to wait for before returning
delay_before_return_htmlfloatDelay in seconds before capturing HTML

Sources: docs/md_v2/api/async-webcrawler.md:1-60

Caching System

Cache Modes

The crawl4ai framework implements a multi-layered caching strategy:

graph LR
    A[Request] --> B{Memory Cache}
    B -->|Hit| C[Return Cached]
    B -->|Miss| D{File System Cache}
    D -->|Hit| E[Return Cached]
    D -->|Miss| F[Fetch Remote]
    F --> G[Store in Both Layers]

Sources: crawl4ai/cache_context.py:1-50

CacheMode Enum

ModeDescription
ENABLEDUse cache if available, otherwise fetch and cache
DISABLEDAlways fetch fresh content, bypass cache
BYPASSFetch and update cache but don't read from it
READ_ONLYOnly read from cache, never fetch

Sources: crawl4ai/cache_context.py:30-60

Cache Context Manager

async with cache_context(cache_mode=CacheMode.ENABLED):
    result = await crawler.arun(url="https://example.com")

Sources: crawl4ai/cache_context.py:60-90

Usage Examples

Basic Single URL Crawl

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=BrowserConfig(headless=True)
        )
        
        print(f"Success: {result.success}")
        print(f"Markdown content: {result.markdown[:500]}")

asyncio.run(main())

Sources: crawl4ai/async_webcrawler.py:150-200

Batch Crawling

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ]
    
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls=urls)
        
        for result in results:
            print(f"URL: {result.url}, Status: {result.status_code}")

asyncio.run(main())

Sources: docs/md_v2/api/async-webcrawler.md:60-100

With Custom Extraction Strategy

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def main():
    config = BrowserConfig(
        headless=True,
        verbose=True
    )
    
    strategy = LLMExtractionStrategy(
        provider="openai/gpt-4",
        api_token="your-token"
    )
    
    async with AsyncWebCrawler(config=config) as crawler:
        result = await crawler.arun(
            url="https://news-site.com/article",
            extraction_strategy=strategy
        )
        
        print(result.extracted_content)

asyncio.run(main())

Sources: crawl4ai/async_configs.py:90-130

Error Handling

The crawler returns comprehensive error information through the CrawlResult object:

result = await crawler.arun(url="https://invalid-url.xyz")

if not result.success:
    print(f"Error: {result.error}")
    print(f"Status Code: {result.status_code}")
Error Scenariosuccess Valueerror Field
Network timeoutFalseConnection timeout message
Invalid URLFalseURL validation error
JavaScript errorFalseBrowser console error
HTTP 404/500FalseHTTP status message
SuccessTrueNone

Sources: crawl4ai/types.py:80-120

Performance Considerations

Async Benefits

The async architecture provides several performance advantages:

  1. Concurrent Requests: Multiple URLs can be crawled simultaneously
  2. Non-blocking I/O: Browser operations don't block other tasks
  3. Resource Efficiency: Single event loop manages all crawling tasks

Best Practices

PracticeBenefit
Use arun_many() for batch operationsReduces connection overhead
Enable caching for repeated URLsAvoids redundant network requests
Set appropriate word_count_thresholdReduces unnecessary processing
Use headless=True in productionReduces memory usage

Sources: crawl4ai/async_webcrawler.py:250-300

ComponentFileRelationship
ExtractionStrategyextraction_strategy.pyDefines how content is extracted
MediaItemtypes.pyRepresents extracted media
Linktypes.pyRepresents extracted links
CacheBackendcache_backend.pyAbstract cache implementation

Sources: crawl4ai/types.py:1-30

Summary

The Async Web Crawler is the foundational building block of crawl4ai, providing:

  • Asynchronous operation for high-performance concurrent crawling
  • Flexible configuration via BrowserConfig dataclass
  • Comprehensive result types through CrawlResult model
  • Multi-layered caching with configurable modes
  • Extensible extraction via pluggable strategies
  • Production-ready error handling with detailed error reporting

This architecture enables developers to build scalable web scraping solutions while maintaining clean, readable code patterns familiar to Python async developers.

Sources: crawl4ai/async_webcrawler.py:1-50

Markdown Generation

Related topics: Extraction Strategies, Async Web Crawler

Section Related Pages

Continue reading this section for the full explanation and source context.

Section MarkdownGenerationStrategy

Continue reading this section for the full explanation and source context.

Section ContentFilterStrategy

Continue reading this section for the full explanation and source context.

Section Available Providers

Continue reading this section for the full explanation and source context.

Related topics: Extraction Strategies, Async Web Crawler

Markdown Generation

Markdown Generation is a core feature in crawl4ai that transforms raw HTML content into clean, readable Markdown format. This system provides flexible strategies for content extraction, filtering, and conversion with extensive customization options.

Overview

The Markdown Generation system converts web page HTML into structured Markdown text suitable for LLM consumption, RAG systems, or documentation purposes. It offers multiple generation strategies, content filtering capabilities, and fine-grained control over extraction behavior.

Sources: docs/md_v2/core/markdown-generation.md:1-15

Architecture

graph TD
    A[HTML Input] --> B[Content Filter Strategy]
    B --> C[HTML Processing]
    C --> D[Markdown Generation Strategy]
    D --> E[Markdown Output]
    
    F[Configuration] --> B
    F --> D
    
    G[BestProvider] --> B
    G --> D

Core Components

MarkdownGenerationStrategy

The primary abstraction for generating Markdown from HTML content.

ParameterTypeDefaultDescription
providerstr"best"Content extraction provider
configsdict{}Provider-specific configurations
strictboolFalseRaise errors on failure
override_system_promptstrNoneCustom system prompt
override_user_promptstrNoneCustom user prompt

Sources: crawl4ai/markdown_generation_strategy.py:1-50

ContentFilterStrategy

Abstract base class for filtering and selecting content before Markdown conversion.

FilterDescription
PruningContentFilterRemoves low-value content nodes
BM25ContentFilterUses BM25 ranking for content selection
OrgAnnContentFilterOrganic annotation-based filtering

Sources: crawl4ai/content_filter_strategy.py:1-100

Generation Providers

Available Providers

ProviderDescription
bestAutomatically selects optimal provider
playwrightUses Playwright for JavaScript rendering
curlLightweight extraction via curl
trafilaturaTrafilatura library extraction
lxmlLXML-based HTML parsing
readabilityMozilla Readability algorithm

Sources: crawl4ai/markdown_generation_strategy.py:50-150

Best Provider Selection

The BestProvider class intelligently selects the most appropriate extraction method based on content characteristics.

class BestProvider:
    def get_strategy(self, html: str) -> MarkdownGenerationStrategy:
        # Analyzes HTML and selects optimal provider
        pass

Sources: crawl4ai/markdown_generation_strategy.py:150-200

Workflow

graph LR
    A[Fetch HTML] --> B{Content Filter Enabled?}
    B -->|Yes| C[Apply Filter Strategy]
    B -->|No| D[Skip Filtering]
    C --> E[Generate Markdown]
    D --> E
    E --> F{Post-Processing?}
    F -->|Yes| G[Apply Custom Rules]
    F -->|No| H[Return Result]
    G --> H

Configuration Options

Generator Config

{
    "word_threshold": 50,          # Minimum words per chunk
    "language": "en",              # Content language
    "skip_internal_links": True,   # Ignore internal links
    "content_type": "markdown"     # Output format
}

BM25 Filter Config

{
    "query": "relevant keywords",
    "top_n": 5,                    # Number of chunks
    "use_stem": True               # Apply stemming
}

Sources: crawl4ai/content_filter_strategy.py:100-180

HTML to Markdown Conversion

html2text Module

The html2text submodule handles low-level HTML to Markdown conversion.

MethodPurpose
handle_anchorConvert <a> tags to text
handle_imageConvert <img> to !alt
handle_headingConvert <h1>-<h6> to # - ######
handle_tableConvert <table> to Markdown tables
handle_codePreserve <code> and <pre> formatting

Sources: crawl4ai/html2text:core.py:1-100

Conversion Features

  • Link Preservation: External links converted to Markdown format with titles
  • Image Extraction: Images extracted with alt text and sources
  • Table Conversion: HTML tables converted to GFM tables
  • Code Block Handling: Syntax-aware code block extraction
  • List Recognition: Ordered and unordered lists properly formatted

Usage Examples

Basic Usage

from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        markdown_generator={
            "provider": "best",
            "configs": {"word_threshold": 100}
        }
    )
    print(result.markdown)

With Content Filter

from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter

filter_strategy = BM25ContentFilter(
    query="getting started installation",
    top_n=10
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://docs.example.com",
        markdown_generator={
            "provider": "playwright",
            "filter_strategy": filter_strategy
        }
    )

Advanced Configuration

Custom Prompts

Override system and user prompts for specialized extraction:

markdown_generator = {
    "override_system_prompt": "Extract only technical documentation...",
    "override_user_prompt": "Focus on API endpoints and code examples..."
}

Strict Mode

Enable strict mode to raise exceptions on extraction failures:

markdown_generator = {
    "strict": True,
    "provider": "playwright"
}

Performance Considerations

AspectRecommendation
Large PagesUse BM25ContentFilter to reduce content
JavaScript-heavy SitesUse playwright provider
Simple PagesUse lxml or trafilatura for speed
Batch ProcessingSet appropriate word_threshold

Error Handling

The system provides graceful degradation:

  1. Provider Fallback: Falls back to alternative provider on failure
  2. Strict Mode: Raises exceptions when enabled
  3. Partial Results: Returns available content on partial failures

Sources: crawl4ai/extraction_strategy.py:50-120

  • Chunking: Content can be further processed with text chunking strategies
  • Extraction: Works alongside extraction strategies for structured data
  • Cache: Generated Markdown can be cached for repeated access

Sources: docs/md_v2/core/markdown-generation.md:1-15

Extraction Strategies

Related topics: Markdown Generation, Deep Crawling Strategies

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Supported Providers

Continue reading this section for the full explanation and source context.

Section Configuration Parameters

Continue reading this section for the full explanation and source context.

Section Extraction Types

Continue reading this section for the full explanation and source context.

Related topics: Markdown Generation, Deep Crawling Strategies

Extraction Strategies

Extraction Strategies in crawl4ai define how content is parsed, structured, and extracted from crawled web pages. They form the core abstraction layer that determines whether unstructured HTML becomes meaningful, machine-readable data.

Overview

Extraction Strategies handle the transformation pipeline from raw HTML to structured output. The system supports two primary categories:

CategoryUse CasePerformance
LLM-basedComplex, semantic extractionSlower, higher accuracy
No-LLMFast, pattern-based extractionFaster, rule-dependent

Sources: crawl4ai/extraction_strategy.py:1-50

Architecture

graph TD
    A[HTML Content] --> B[Content Scraping Strategy]
    B --> C[Chunking Strategy]
    C --> D{Extraction Strategy}
    D -->|LLM-based| E[LLM Strategy]
    D -->|No-LLM| F[No-LLM Strategy]
    E --> G[Structured JSON/Markdown]
    F --> G
    G --> H[Table Extraction Optional]
    H --> I[Final Output]

The extraction pipeline flows through scraping β†’ chunking β†’ extraction, with table extraction as an optional final step.

Sources: crawl4ai/extraction_strategy.py:50-100

LLM-Based Strategies

LLM strategies leverage large language models for semantic understanding and intelligent content extraction.

Supported Providers

ProviderModel SupportConfiguration
OpenAIGPT-4, GPT-3.5OPENAI_API_KEY
AnthropicClaude 3, Claude 2ANTHROPIC_API_KEY
Azure OpenAICustom deploymentsAZURE_API_KEY, AZURE_API_BASE
OllamaLocal modelsOLLAMA_BASE_URL

Configuration Parameters

class LLMExtractionStrategy:
    def __init__(
        self,
        provider: str = "openai",
        model: str = "gpt-4",
        api_token: Optional[str] = None,
        system_prompt: Optional[str] = None,
        user_prompt: Optional[str] = None,
        extraction_type: str = "block",
        input_format: str = "html",
        instruction: Optional[str] = None
    )

Sources: docs/md_v2/extraction/llm-strategies.md

Extraction Types

TypeDescriptionBest For
blockBlock-level extractionParagraphs, sections
schemaSchema-based extractionStructured data, forms
customCustom instructionsSpecific extraction needs

Sources: crawl4ai/extraction_strategy.py:100-150

No-LLM Strategies

No-LLM strategies provide fast, deterministic extraction without external API dependencies.

Available Strategies

StrategyPurpose
NoExtractionStrategyPass-through, no extraction
JsonCssExtractionStrategyCSS selector-based JSON extraction
RegexExtractionStrategyRegex pattern matching
XPathExtractionStrategyXPath-based extraction

Sources: docs/md_v2/extraction/no-llm-strategies.md

JsonCssExtractionStrategy

from crawl4ai import JsonCssExtractionStrategy

strategy = JsonCssExtractionStrategy(
    schema={
        "name": "ProductList",
        "baseSelector": "div.product",
        "fields": [
            {"name": "title", "selector": "h2.title", "type": "text"},
            {"name": "price", "selector": "span.price", "type": "text"},
            {"name": "image", "selector": "img", "attribute": "src"}
        ]
    }
)

Sources: crawl4ai/extraction_strategy.py:150-200

Chunking Strategies

Chunking strategies split content into manageable pieces before extraction.

Default Chunking Behavior

graph LR
    A[Large Content] --> B[Character Split]
    B --> C[Overlap Application]
    C --> D[Token Count Check]
    D -->|Under limit| E[Chunk Ready]
    D -->|Over limit| F[Recursive Split]
    F --> E

Configuration Options

ParameterTypeDefaultDescription
chunk_token_sizeint1000Target tokens per chunk
overlapint100Overlapping tokens between chunks
max_chunk_sizeint3000Hard maximum chunk size
splitting_regexstr\n\n+Regex for splitting points

Sources: crawl4ai/chunking_strategy.py:1-80

Content Scraping Strategy

The content scraping strategy determines initial content extraction from HTML.

graph TD
    A[Raw HTML] --> B{Scraping Strategy}
    B -->|BeautifulSoup| C[Parse DOM]
    B -->|Playwright| D[Dynamic Render]
    B -->|Raw| E[Minimal Processing]
    C --> F[Content Cleaned]
    D --> F
    E --> F

Strategy Selection

StrategyJavaScriptSpeedUse Case
BeautifulSoupNoFastStatic pages
PlaywrightYesMediumSPAs, dynamic content
RawContentNoFastestPre-processed HTML

Sources: crawl4ai/content_scraping_strategy.py:1-60

Table Extraction

Table extraction handles tabular data structures within web pages.

class TableExtractionStrategy:
    def __init__(
        self,
        table_styles: Optional[List[str]] = None,
        ignore_tables: Optional[List[str]] = None,
        merge_multiple_headers: bool = False
    )

Extraction Configuration

ParameterTypeDescription
table_stylesList[str]CSS classes to include as tables
ignore_tablesList[str]CSS classes to exclude
merge_multiple_headersboolMerge multi-row headers
extract_headerboolInclude header row (default: True)

Sources: crawl4ai/table_extraction.py:1-100

Complete Pipeline Example

from crawl4ai import (
    AsyncWebCrawler,
    LLMExtractionStrategy,
    JsonCssExtractionStrategy,
    RegexExtractionStrategy,
    TableExtractionStrategy
)

async with AsyncWebCrawler() as crawler:
    # LLM-based extraction
    llm_result = await crawler.arun(
        url="https://example.com/article",
        extraction_strategy=LLMExtractionStrategy(
            provider="openai",
            model="gpt-4",
            instruction="Extract article title, author, and key points"
        )
    )
    
    # CSS-based extraction
    css_result = await crawler.arun(
        url="https://example.com/products",
        extraction_strategy=JsonCssExtractionStrategy(schema=product_schema)
    )

Sources: crawl4ai/extraction_strategy.py:200-250

Strategy Selection Guide

graph TD
    A[Start] --> B{Need semantic understanding?}
    B -->|Yes| C{External API acceptable?}
    B -->|No| D[No-LLM Strategy]
    C -->|Yes| E{Local deployment needed?}
    C -->|No| F[Ollama/Local LLM]
    E -->|Yes| F
    E -->|No| G[OpenAI/Anthropic]
    D --> H{Data has tabular structure?}
    H -->|Yes| I[Add TableExtractionStrategy]
    H -->|No| J[Complete]
    G --> J
    F --> J
    I --> J

Decision Matrix

RequirementRecommended Strategy
Simple CSS extractionJsonCssExtractionStrategy
Complex semantic parsingLLMExtractionStrategy
High-volume, low-latencyNo-LLM strategies
Schema-agnosticLLM-based strategies
Tabular data focusTableExtractionStrategy

Sources: docs/md_v2/extraction/llm-strategies.md, docs/md_v2/extraction/no-llm-strategies.md

Environment Variables

VariableRequired ForDescription
OPENAI_API_KEYOpenAI LLMAPI key for GPT models
ANTHROPIC_API_KEYAnthropic LLMAPI key for Claude models
AZURE_API_KEYAzure OpenAIAzure OpenAI API key
OLLAMA_BASE_URLLocal LLMBase URL for Ollama server

Sources: crawl4ai/extraction_strategy.py:250-300

Sources: crawl4ai/extraction_strategy.py:1-50

Deep Crawling Strategies

Related topics: Extraction Strategies, Async Web Crawler

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Base Strategy

Continue reading this section for the full explanation and source context.

Section BFS (Breadth-First Search) Strategy

Continue reading this section for the full explanation and source context.

Section DFS (Depth-First Search) Strategy

Continue reading this section for the full explanation and source context.

Related topics: Extraction Strategies, Async Web Crawler

Deep Crawling Strategies

Overview

Deep Crawling Strategies in crawl4ai provide systematic approaches to traverse and extract content from websites beyond a single page. These strategies enable controlled, scalable web crawling by managing URL discovery, prioritization, filtering, and scoring mechanisms. The deep crawling module supports multiple traversal algorithms (BFS, DFS, BFF) with extensible filtering and scoring systems.

Sources: base_strategy.py:1-50

Architecture

graph TD
    A[Seed URLs] --> B[DeepCrawlingStrategy]
    B --> C[URL Filters]
    B --> D[URL Scorers]
    B --> E[Traversal Algorithm]
    C --> F[Valid URLs]
    D --> G[Prioritized URLs]
    E --> H[Crawl Queue]
    G --> H
    H --> I[Crawl4AI Extractor]
    I --> J[Extracted Content]
    J --> K[Links Extracted]
    K --> B

The architecture follows a producer-consumer pattern where the strategy continuously discovers URLs from crawled pages and feeds them back into the crawl queue based on prioritization rules.

Sources: base_strategy.py:50-100

Core Components

Base Strategy

All crawling strategies inherit from DeepCrawlingStrategy, which provides the foundational interface and shared functionality:

Property/MethodTypeDescription
url_scorerURLScorerScores URLs for prioritization
url_filterURLFilterFilters URLs for validity
keywordsSet[str]Keywords for relevance matching
keywords_orSet[str]Alternative keyword matching
max_depthintMaximum crawl depth
included_domainsSet[str]Allowed domains
excluded_domainsSet[str]Blocked domains
crawl_enabledboolEnable/disable crawling
check_keywordsCallableCustom keyword validation

Sources: base_strategy.py:100-150

Traversal Algorithms

BFS (Breadth-First Search) Strategy

The BFS strategy explores pages level by level, ensuring comprehensive coverage before going deeper:

graph LR
    A[Level 0: Seed] --> B[Level 1: depth=1]
    B --> C[Level 2: depth=2]
    C --> D[Level 3: depth=3]
    style A fill:#90EE90
    style B fill:#87CEEB
    style C fill:#DDA0DD
    style D fill:#F0E68C

Characteristics:

  • Systematic exploration of shallow depths first
  • Ideal for site maps and directory-style sites
  • Higher memory usage due to large frontier sets
  • Better for finding all accessible pages at shallower depths

Sources: bfs_strategy.py:1-80

Configuration Parameters:

ParameterTypeDefaultDescription
max_depthint3Maximum crawl depth
max_pagesint50Maximum pages to crawl
priorityint0Base priority score
include_externalboolFalseAllow external domain crawling

DFS (Depth-First Search) Strategy

The DFS strategy explores as deep as possible before backtracking:

graph TD
    A[Start] --> B[Depth 1]
    B --> C[Depth 2]
    C --> D[Depth 3]
    D --> E[Backtrack]
    E --> F[Next Branch]
    F --> G[Continue Deep]
    style A fill:#90EE90
    style D fill:#FF6B6B

Characteristics:

  • Deep exploration of specific paths first
  • Lower memory footprint
  • Suitable for following navigation chains
  • Risk of getting stuck in deep site sections

Sources: dfs_strategy.py:1-80

Configuration Parameters:

ParameterTypeDefaultDescription
max_depthint10Maximum crawl depth
max_pagesint100Maximum pages to crawl
priorityint0Base priority score

BFF (Best-First with Filters) Strategy

The BFF strategy combines filtering with score-based prioritization, crawling the most relevant pages first:

graph TD
    A[URL Discovered] --> B{URL Filter}
    B -->|Pass| C{Score URL}
    B -->|Fail| X[Skip]
    C --> D[Priority Queue]
    D --> E[Crawl Next Best]
    E --> F[Extract Links]
    F --> A

Characteristics:

  • Relevance-based crawling using keyword matching
  • Configurable scoring functions
  • Filters out irrelevant content early
  • Most efficient for targeted data extraction

Sources: bff_strategy.py:1-100

Filtering System

The filtering system determines which URLs are eligible for crawling:

FilterChain

Multiple filters can be chained together for comprehensive URL validation:

from crawl4ai.deep_crawling.filters import FilterChain, SameDomainFilter, ExtensionFilter

filter_chain = FilterChain([
    SameDomainFilter(allowed_domains=["example.com"]),
    ExtensionFilter(excluded_extensions=[".pdf", ".zip"]),
])

Available Filters

FilterPurposeKey Parameters
SameDomainFilterRestrict to same domainallowed_domains, strict
ExtensionFilterBlock by file extensionexcluded_extensions
robots.txt FilterRespect robots directivesuser_agent
RegexFilterCustom pattern matchingpatterns, exclude

Sources: filters.py:1-120

Scoring System

URL scoring determines crawl priority within the queue:

ScorerChain

Scorers can be combined in a chain for multi-factor evaluation:

from crawl4ai.deep_crawling.scorers import ScorerChain, KeywordRelevanceScorer, DepthScorer

scorer = ScorerChain([
    KeywordRelevanceScorer(keywords=["api", "docs"]),
    DepthScorer(max_depth=5, decay=0.5),
])

Available Scorers

ScorerFunctionParameters
KeywordRelevanceScorerMatch keywords in URL/textkeywords, weight
DepthScorerPenalize deep pagesmax_depth, decay
FreshnessScorerPrefer recent contentdate_field, decay
CustomScorerUser-defined scoringscoring_fn

Sources: scorers.py:1-150

Usage Examples

Basic BFS Crawling

from crawl4ai import AsyncWebCrawler
from crawl4ai.deep_crawling import BFSDistanceStrategy

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        strategy=BFSDistanceStrategy(
            max_depth=3,
            max_pages=50
        )
    )

Targeted Crawling with BFF

from crawl4ai.deep_crawling import BFFStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from crawl4ai.deep_crawling.filters import SameDomainFilter

strategy = BFFStrategy(
    max_depth=5,
    max_pages=100,
    keywords={"documentation", "api", "guide"},
    url_filter=SameDomainFilter(allowed_domains=["example.com"]),
    url_scorer=KeywordRelevanceScorer(keywords={"documentation"}),
)

Sources: docs/md_v2/core/deep-crawling.md:1-100

Workflow States

stateDiagram-v2
    [*] --> Initializing: Start crawl
    Initializing --> Crawling: Load seed URLs
    Crawling --> Processing: Fetch page
    Processing --> Filtering: Extract links
    Filtering --> Scoring: Validate URLs
    Scoring --> Queuing: Rank by priority
    Queuing --> Crawling: Next URL
    Crawling --> [*]: Queue empty or max reached

Strategy Selection Guide

Use CaseRecommended StrategyReason
Site mappingBFSComprehensive shallow coverage
Documentation sitesBFSFind all pages systematically
Article/navigation chainsDFSFollow deep links naturally
Targeted data extractionBFFPrioritize relevant pages
API documentationBFFFilter by keyword relevance
Limited resourcesDFSLower memory footprint

Configuration Reference

Common Parameters

@dataclass
class DeepCrawlConfig:
    # Scope
    included_domains: Set[str] = None
    excluded_domains: Set[str] = None
    
    # Limits
    max_depth: int = 3
    max_pages: int = 100
    max_total_pages: int = 1000
    
    # Filtering
    allow_external: bool = False
    check_keywords: bool = True
    
    # Scoring
    scoring: ScorerChain = None
    filter: FilterChain = None

Keyword Matching

# AND matching (all keywords must match)
keywords = {"documentation", "api"}
keywords_or = False

# OR matching (any keyword matches)
keywords = {"guide", "tutorial", "docs"}
keywords_or = True

Sources: base_strategy.py:150-200

Advanced Features

Custom Filters

from crawl4ai.deep_crawling.filters import URLFilter

class CustomFilter(URLFilter):
    def should_crawl(self, url: str) -> bool:
        # Custom logic
        return "product" in url and not url.endswith(".jpg")

Custom Scorers

from crawl4ai.deep_crawling.scorers import URLScorer

class PriorityScorer(URLScorer):
    def score(self, url: str, context: dict) -> float:
        base_score = 1.0
        if "important" in url:
            base_score *= 2.0
        return base_score

Sources: filters.py:150-200, scorers.py:150-200

Best Practices

  1. Set reasonable limits: Always configure max_pages and max_depth to prevent runaway crawling
  2. Use BFF for targeted extraction: When you know what content you need, BFF reduces noise
  3. Filter early, score late: Apply filters before scoring to reduce unnecessary processing
  4. Respect robots.txt: Configure filters to respect site crawling directives
  5. Monitor memory usage: BFS uses more memory; switch to DFS for resource-constrained environments
  6. Combine keyword strategies: Use both keywords (AND) and keywords_or (OR) for flexible matching

See Also

Sources: base_strategy.py:1-50

Anti-Bot Detection and Proxy Management

Related topics: Browser Management

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Purpose and Scope

Continue reading this section for the full explanation and source context.

Section Detection Strategies

Continue reading this section for the full explanation and source context.

Section Key Components

Continue reading this section for the full explanation and source context.

Related topics: Browser Management

Anti-Bot Detection and Proxy Management

Overview

Crawl4ai provides sophisticated anti-bot detection evasion and proxy management capabilities to ensure reliable web crawling operations. These features work together to detect and circumvent bot protection mechanisms while maintaining request anonymity through proxy rotation.

Architecture Overview

graph TD
    A[Client Request] --> B[AntiBotDetector]
    B --> C{Bot Detection?}
    C -->|Yes| D[Apply Evasion Strategy]
    C -->|No| E[Direct Request]
    D --> F[ProxySelector]
    F --> G[Rotating Proxies]
    G --> H[Target Website]
    H --> I{Response Valid?}
    I -->|No| J[Fallback Mechanism]
    J --> B
    I -->|Yes| K[Return Content]

Anti-Bot Detection System

Purpose and Scope

The anti-bot detection module (antibot_detector.py) analyzes responses from target websites to determine if bot protection mechanisms have been triggered. When detection occurs, the system can automatically apply evasion strategies or fall back to alternative methods.

Sources: crawl4ai/antibot_detector.py

Detection Strategies

StrategyDescriptionUse Case
Header AnalysisExamines HTTP headers for bot detection signalsStandard bot checks
Content AnalysisScans response content for CAPTCHAs or blocking messagesChallenge pages
Status Code MonitoringTracks HTTP status codes indicating blocks403, 429 responses
JavaScript Challenge DetectionIdentifies JS-based bot challengesCloudflare, PerimeterX

Key Components

The anti-bot detector integrates with the async configuration system to provide seamless fallback handling:

# Pseudocode representation based on async_configs.py integration
class AntiBotConfig:
    enabled: bool = True
    detection_threshold: float = 0.7
    auto_fallback: bool = True
    max_retries: int = 3

Sources: crawl4ai/async_configs.py

Proxy Management System

Purpose and Scope

The proxy strategy module (proxy_strategy.py) manages proxy rotation, selection, and health checking to maintain request anonymity and distribute load across multiple IP addresses.

Sources: crawl4ai/proxy_strategy.py

Proxy Rotation Strategies

StrategyDescriptionBest For
Round RobinSequential proxy selectionEven distribution
RandomRandom proxy selectionAvoiding pattern detection
WeightedPrioritize faster/reliable proxiesPerformance optimization
GeographicMatch proxy location to targetRegion-specific content

Configuration Parameters

class ProxyConfig:
    proxies: List[str] = []          # List of proxy URLs
    rotation_strategy: str = "round_robin"
    health_check_interval: int = 300 # seconds
    timeout: int = 30                # proxy timeout in seconds
    retry_on_failure: bool = True

Proxy Health Monitoring

The system continuously monitors proxy health through periodic health checks, removing failed proxies from the active pool and re-evaluating them after a cooldown period.

Integration with Async Configuration

Fallback Mechanisms

When anti-bot detection triggers, the system can automatically switch to fallback modes:

  1. Proxy Fallback: Rotate to a different proxy server
  2. Strategy Fallback: Switch to alternative crawling strategies
  3. User-Agent Fallback: Use different browser fingerprints

Sources: docs/md_v2/advanced/anti-bot-and-fallback.md

Configuration Example

anti_bot:
  enabled: true
  detection_sensitivity: "medium"
  auto_fallback: true
  
proxy:
  enabled: true
  strategy: "weighted"
  proxies:
    - "http://proxy1.example.com:8080"
    - "http://proxy2.example.com:8080"
  health_check:
    enabled: true
    interval: 300

Security Considerations

Proxy Security

When configuring proxies, consider the following security aspects:

  • Proxy Protocol: Use HTTPS proxies to encrypt traffic
  • Authentication: Implement proxy authentication where supported
  • Provider Reputation: Use trusted proxy providers
  • IP Rotation: Avoid predictable IP patterns

Sources: docs/md_v2/advanced/proxy-security.md

Best Practices

PracticeDescription
Rate LimitingRespect target site limits to avoid IP bans
Request DelaysImplement delays between requests
Header RandomizationVary User-Agent and other headers
Cookie ManagementHandle cookies appropriately per session
SSL VerificationValidate SSL certificates for security

Workflow Diagram

sequenceDiagram
    participant Client
    participant AntiBot as AntiBot Detector
    participant ProxyMgr as Proxy Manager
    participant Target as Target Website
    
    Client->>AntiBot: Send Request
    AntiBot->>ProxyMgr: Request Proxy
    ProxyMgr->>AntiBot: Return Proxy
    AntiBot->>Target: Forward Request via Proxy
    
    alt Bot Detected
        Target-->>AntiBot: Bot Challenge Response
        AntiBot->>ProxyMgr: Request Different Proxy
        ProxyMgr-->>AntiBot: New Proxy
        AntiBot->>Target: Retry with New Proxy
    else Success
        Target-->>Client: Return Content
    end

Error Handling

Common Error Scenarios

ErrorCauseResolution
403 ForbiddenIP blockedRotate proxy
429 Too Many RequestsRate limitedBackoff and retry
CAPTCHA RequiredBot detectedSwitch strategy
Proxy TimeoutProxy unavailableHealth check and replace

Retry Logic

The system implements exponential backoff for retries:

retry_config = {
    "max_attempts": 3,
    "base_delay": 1.0,      # seconds
    "max_delay": 60.0,       # seconds
    "exponential_base": 2
}

Summary

Crawl4ai's anti-bot detection and proxy management system provides a robust framework for evading bot detection while maintaining reliable web crawling operations. The integration between AntiBotDetector and ProxyStrategy enables automatic fallback mechanisms that significantly improve crawling success rates against protected websites.

Sources: crawl4ai/antibot_detector.py

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high [Bug]: arun() and arun_many() type hinting needs fixing

First-time setup may fail or require extra isolation and rollback planning.

high [Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked as failed), no error messages or failure reason…

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

high [Bug]: MCP scrape tools lack wait_until / SPA support that REST API and CLI provide

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

high [Bug]: `remove_empty_elements_fast()` drops trailing text when removing empty elements with non-empty .tail

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

Doramagic Pitfall Log

Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: [Bug]: arun() and arun_many() type hinting needs fixing

  • Severity: high
  • Finding: Installation risk is backed by a source signal: [Bug]: arun() and arun_many() type hinting needs fixing. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1898

2. Configuration risk: [Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked as failed), no error messages or failure reason…

  • Severity: high
  • Finding: Configuration risk is backed by a source signal: [Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked as failed), no error messages or failure reason…. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1949

3. Configuration risk: [Bug]: MCP scrape tools lack wait_until / SPA support that REST API and CLI provide

  • Severity: high
  • Finding: Configuration risk is backed by a source signal: [Bug]: MCP scrape tools lack wait_until / SPA support that REST API and CLI provide. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1963

4. Configuration risk: [Bug]: `remove_empty_elements_fast()` drops trailing text when removing empty elements with non-empty .tail

  • Severity: high
  • Finding: Configuration risk is backed by a source signal: [Bug]: remove_empty_elements_fast() drops trailing text when removing empty elements with non-empty .tail. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1938

5. Security or permission risk: [Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-3x token overhead for CJK content

  • Severity: high
  • Finding: Security or permission risk is backed by a source signal: [Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-3x token overhead for CJK content. Treat it as a review item until the current version is checked.
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1962

6. Installation risk: [Bug] AsyncLogger writes to stdout, breaking MCP stdio transport

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: [Bug] AsyncLogger writes to stdout, breaking MCP stdio transport. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1968

7. Installation risk: [Bug]: The install with pip on just about any system rarely works. It requires an env or it only partial installs

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: [Bug]: The install with pip on just about any system rarely works. It requires an env or it only partial installs. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1950

8. Installation risk: [Bug]: enable_stealth=True is a silent no-op β€” StealthAdapter imports symbols that don't exist in playwright-stealth 2.x

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: [Bug]: enable_stealth=True is a silent no-op β€” StealthAdapter imports symbols that don't exist in playwright-stealth 2.x. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1959

9. Installation risk: v0.7.1:Update

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v0.7.1:Update. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/releases/tag/v0.7.1

10. Installation risk: v0.7.2: CI/CD & Dependency Optimization Update

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v0.7.2: CI/CD & Dependency Optimization Update. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/releases/tag/v0.7.2

11. Configuration risk: [Bug]: Markdown export loses heading hierarchy and table structure

  • Severity: medium
  • Finding: Configuration risk is backed by a source signal: [Bug]: Markdown export loses heading hierarchy and table structure. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1964

12. Capability assumption: README/documentation is current enough for a first validation pass.

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: capability.assumptions | github_repo:798201435 | https://github.com/unclecode/crawl4ai | README/documentation is current enough for a first validation pass.

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using crawl4ai with real data or production workflows.

  • [[Bug] AsyncLogger writes to stdout, breaking MCP stdio transport](https://github.com/unclecode/crawl4ai/issues/1968) - github / github_issue
  • [[Bug]: Markdown text extraction drops text when element contains empty e](https://github.com/unclecode/crawl4ai/issues/1966) - github / github_issue
  • [[Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-](https://github.com/unclecode/crawl4ai/issues/1962) - github / github_issue
  • [[Bug]: MCP scrape tools lack wait_until / SPA support that REST API and](https://github.com/unclecode/crawl4ai/issues/1963) - github / github_issue
  • [[Bug]: Markdown export loses heading hierarchy and table structure](https://github.com/unclecode/crawl4ai/issues/1964) - github / github_issue
  • [[Bug]: enable_stealth=True is a silent no-op β€” StealthAdapter imports sy](https://github.com/unclecode/crawl4ai/issues/1959) - github / github_issue
  • [[Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked](https://github.com/unclecode/crawl4ai/issues/1949) - github / github_issue
  • [[Bug]: arun() and arun_many() type hinting needs fixing](https://github.com/unclecode/crawl4ai/issues/1898) - github / github_issue
  • [[Bug]: The install with pip on just about any system rarely works. It re](https://github.com/unclecode/crawl4ai/issues/1950) - github / github_issue
  • [[Bug]: remove_empty_elements_fast() drops trailing text when removing](https://github.com/unclecode/crawl4ai/issues/1938) - github / github_issue
  • Release v0.7.7 - github / github_release
  • Release v0.7.5 - github / github_release

Source: Project Pack community evidence and pitfall evidence