`-`

# https://github.com/unclecode/crawl4ai 项目说明书

生成时间：2026-05-15 08:23:04 UTC

## 目录

- [Introduction to Crawl4AI](#introduction)
- [Installation Guide](#installation)
- [Quick Start Guide](#quickstart)
- [System Architecture](#architecture)
- [Browser Management](#browser_management)
- [Async Web Crawler](#async_crawler)
- [Markdown Generation](#markdown_generation)
- [Extraction Strategies](#extraction_strategies)
- [Deep Crawling Strategies](#deep_crawling)
- [Anti-Bot Detection and Proxy Management](#anti_bot_detection)

<a id='introduction'></a>

## Introduction to Crawl4AI

### 相关页面

相关主题：[Installation Guide](#installation), [Quick Start Guide](#quickstart)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/unclecode/crawl4ai/blob/main/README.md)
- [MISSION.md](https://github.com/unclecode/crawl4ai/blob/main/MISSION.md)
- [pyproject.toml](https://github.com/unclecode/crawl4ai/blob/main/pyproject.toml)
- [docs/README.md](https://github.com/unclecode/crawl4ai/blob/main/docs/README.md)
- [CONTRIBUTING.md](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md)
- [sbom/README.md](https://github.com/unclecode/crawl4ai/blob/main/sbom/README.md)
</details>

# Introduction to Crawl4AI

## Overview

Crawl4AI is an open-source AI-powered web crawling framework designed to extract structured data from web pages and deliver clean, LLM-ready output. It serves as a modern alternative to traditional web scraping tools by combining intelligent crawling capabilities with AI-driven content extraction and formatting.

The project emphasizes ease of use, providing both programmatic APIs and command-line interfaces for rapid integration into data pipelines, research workflows, and AI applications.

## Purpose and Scope

Crawl4AI addresses the fundamental challenge of extracting meaningful data from unstructured web content. While conventional web crawlers focus on fetching page content, Crawl4AI goes further by:

- **Semantic Understanding**: Analyzing page content to identify and extract relevant information based on context rather than rigid selectors
- **Structured Output**: Delivering data in formats optimized for large language model consumption, including Markdown, JSON, and structured extractions
- **Performance Optimization**: Enabling high-throughput crawling with configurable browser automation and connection pooling
- ** flexibility**: Supporting various output strategies including simple crawling, chunked extraction, and memory-aware processing

The scope encompasses web crawling, content extraction, link navigation, media handling, and output formatting—providing an end-to-end solution from URL input to structured data output.

## Core Architecture

Crawl4AI follows a modular architecture composed of distinct processing stages:

```mermaid
graph TD
    A[URL Input] --> B[Crawl Strategy]
    B --> C[Browser Automation]
    C --> D[Content Extraction]
    D --> E[AI Processing]
    E --> F[Output Formatter]
    F --> G[Structured Output]
    
    C -->|JS Rendering| H[JavaScript Executor]
    D -->|Media| I[Media Handler]
    E -->|Memory| J[Memory Manager]
```

### Processing Pipeline

| Stage | Component | Description |
|-------|-----------|-------------|
| Input | URL Parser | Validates and normalizes target URLs |
| Crawl | Strategy Engine | Selects crawling approach based on configuration |
| Render | Browser Pool | Manages headless browser instances |
| Extract | AI Extractor | Uses ML models to identify relevant content |
| Format | Output Serializer | Converts to target format (JSON/Markdown/HTML) |

## Key Features

### 1. AI-Powered Extraction

Crawl4AI leverages machine learning models to understand page content semantically. Rather than relying solely on CSS selectors or XPath expressions, the extractor can identify:

- Main article content and metadata
- Structured data elements (tables, lists, forms)
- Semantic sections and their relationships
- Relevant versus boilerplate content

### 2. Multiple Output Strategies

The framework supports various extraction strategies optimized for different use cases:

| Strategy | Use Case | Output |
|----------|----------|--------|
| `default` | General purpose | Clean Markdown with metadata |
| `cosine` | Semantic clustering | Grouped content chunks |
| `no-cache` | Fresh data | Bypass internal caching |
| `passive` | Low resource | Minimal processing |
| `brainless` | Simple fetch | Raw HTML without AI processing |

### 3. Browser Automation

Integrated headless browser support enables:

- JavaScript rendering for single-page applications
- Cookie and session management
- Custom headers and authentication
- Screenshot capture
- PDF generation

### 4. Media Handling

Crawl4AI processes various media types during extraction:

- **Images**: Download, compress, and embed with alt-text preservation
- **Videos**: Extract metadata and embed URLs
- **Audio**: Handle media references for podcasts and audio content
- **Documents**: Process embedded PDFs and downloadable files

## Installation and Setup

### Prerequisites

- Python 3.9 or higher
- Chrome/Chromium browser (for browser automation features)
- pip or poetry package manager

### Basic Installation

```bash
pip install crawl4ai
```

### Development Installation

```bash
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e ".[dev]"
```

### Verify Installation

```python
from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
    print(result.markdown)
```

## Basic Usage Patterns

### Simple Crawl

```python
from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
    print(f"Title: {result.metadata.get('title')}")
    print(f"Content: {result.markdown}")
```

### Configured Extraction

```python
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler

config = CrawlerRunConfig(
    mode="aggressive",
    word_count_threshold=10,
    remove_hidden_text=True,
    process_iframes=True,
    scroll_delay=1.0
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com/article",
        config=config
    )
    print(result.markdown)
```

### Batch Crawling

```python
from crawl4ai import AsyncWebCrawler

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(urls=urls)
    for result in results:
        print(f"URL: {result.url}")
        print(f"Status: {result.status_code}")
```

## Configuration Reference

### CrawlerRunConfig Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `mode` | str | `"default"` | Extraction strategy |
| `headless` | bool | `True` | Run browser in headless mode |
| `verbose` | bool | `False` | Enable verbose logging |
| `text_threshold` | int | | Minimum text length filter |
| `word_count_threshold` | int | | Words per chunk threshold |
| `skip_download_images` | bool | `False` | Skip image downloads |
| `page_timeout` | int | `30000` | Page load timeout (ms) |
| `scroll_delay` | float | `0` | Delay between scrolls |

### Browser Configuration

| Parameter | Type | Description |
|-----------|------|-------------|
| `browser_type` | str | Chromium/Firefox/WebKit |
| `headless` | bool | Headless mode toggle |
| `proxy` | dict | Proxy configuration |
| `user_agent` | str | Custom user agent string |

## Project Structure

```
crawl4ai/
├── src/crawl4ai/          # Main package source
│   ├── core/              # Core crawling engine
│   ├── extractors/        # Content extraction strategies
│   ├── formatters/        # Output formatters
│   └── utils/             # Utility functions
├── examples/              # Usage examples
├── tests/                 # Test suite
├── docs/                  # Documentation
├── scripts/               # Build and utility scripts
└── sbom/                  # Software Bill of Materials
```

## Dependencies and SBOM

Crawl4AI maintains a comprehensive Software Bill of Materials (SBOM) documenting all direct and transitive dependencies. This SBOM is generated using CycloneDX format and regenerated on a best-effort basis through automated scripts.

### Regenerating SBOM

```bash
./scripts/gen-sbom.sh
```

The SBOM provides visibility into the project's dependency tree, supporting security audits and license compliance verification.

## Extensibility

### Custom Extractors

Extend the extraction framework by implementing the base extractor interface:

```python
from crawl4ai.extractors import BaseExtractor

class CustomExtractor(BaseExtractor):
    async def extract(self, html: str, url: str) -> dict:
        # Custom extraction logic
        return {"content": html, "custom_field": "value"}
```

### Output Formatters

Create custom output formats by implementing the formatter interface:

```python
from crawl4ai.formatters import BaseFormatter

class CustomFormatter(BaseFormatter):
    def format(self, result: CrawlResult) -> str:
        # Custom formatting logic
        return custom_string
```

## Contributing

The project welcomes contributions from the community. Developers interested in contributing should:

1. Fork the repository
2. Create a feature branch
3. Follow the established coding standards
4. Add tests for new functionality
5. Submit a pull request with clear documentation

## Resources and Documentation

| Resource | Location |
|----------|----------|
| Source Code | [GitHub Repository](https://github.com/unclecode/crawl4ai) |
| Documentation | `/docs` directory |
| Examples | `/examples` directory |
| SBOM | `/sbom` directory |
| Issue Tracker | GitHub Issues |

## License

Crawl4AI is released under open-source licensing terms. Refer to the LICENSE file in the repository for specific terms and conditions.

---

<a id='installation'></a>

## Installation Guide

### 相关页面

相关主题：[Quick Start Guide](#quickstart)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [setup.py](https://github.com/unclecode/crawl4ai/blob/main/setup.py)
- [requirements.txt](https://github.com/unclecode/crawl4ai/blob/main/requirements.txt)
- [Dockerfile](https://github.com/unclecode/crawl4ai/blob/main/Dockerfile)
- [crawl4ai/install.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/install.py)
</details>

# Installation Guide

## Overview

This guide covers all supported methods for installing crawl4ai, a powerful web crawling and data extraction framework. The installation process handles automatic setup of core dependencies including Playwright for browser automation, supporting multiple installation scenarios from simple pip installations to Docker containerized deployments.

## System Requirements

### Hardware Requirements

| Component | Minimum | Recommended |
|-----------|---------|-------------|
| RAM | 4 GB | 8 GB |
| Disk Space | 2 GB | 5 GB |
| CPU | 2 cores | 4+ cores |

### Software Prerequisites

| Requirement | Version | Notes |
|-------------|---------|-------|
| Python | >= 3.9 | Tested up to 3.12 |
| pip | Latest | For pip installations |
| Docker | 20.10+ | For Docker installations |
| Chrome/Chromium | Latest | Auto-installed by Playwright |

## Installation Methods

### Method 1: pip Installation (Recommended)

The simplest and most common installation method uses pip package manager.

```bash
pip install crawl4ai
```

After pip installation, you must run the post-installation setup to configure browser dependencies:

```bash
python -m crawl4ai install
```

This command installs Playwright browsers and configures the necessary system dependencies. 资料来源：[crawl4ai/install.py:1-50]()

### Method 2: Docker Installation

Docker provides an isolated environment with all dependencies pre-configured.

#### Pulling the Official Image

```bash
docker pull unclecode/crawl4ai
```

#### Running with Docker

```bash
docker run -d \
  --name crawl4ai \
  -p 8000:8000 \
  unclecode/crawl4ai
```

#### Building Custom Docker Image

You can build your own image using the provided Dockerfile:

```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "-m", "crawl4ai"]
```

资料来源：[Dockerfile](https://github.com/unclecode/crawl4ai/blob/main/Dockerfile)

### Method 3: Installation from Source

For development or customization purposes, install from the source repository:

```bash
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
```

## Dependencies Management

### Core Dependencies

The project defines dependencies in `setup.py` for package distribution and `requirements.txt` for development.

**setup.py dependencies:**

| Package | Purpose |
|---------|---------|
| playwright | Browser automation |
| asyncio | Async operations |
| aiohttp | HTTP client |
| beautifulsoup4 | HTML parsing |
| lxml | XML/HTML processing |

资料来源：[setup.py](https://github.com/unclecode/crawl4ai/blob/main/setup.py)

### Runtime Dependency Installation

The `crawl4ai/install.py` module handles automatic installation of runtime dependencies:

```python
# Key installation steps in install.py
import subprocess
import sys

def install_browsers():
    subprocess.check_call([
        sys.executable, "-m", "playwright", "install", "chromium"
    ])
```

资料来源：[crawl4ai/install.py:20-30]()

### Installing Optional Dependencies

| Extra | Command | Description |
|-------|---------|-------------|
| All extras | `pip install crawl4ai[all]` | Install all optional packages |
| Dev tools | `pip install crawl4ai[dev]` | Development dependencies |
| LLM support | `pip install crawl4ai[llm]` | Language model integration |

## Installation Workflow

```mermaid
graph TD
    A[Start Installation] --> B{Installation Method}
    B -->|pip| C[Run pip install]
    B -->|Docker| D[Pull/Build Image]
    B -->|Source| E[Clone Repository]
    C --> F[Run post-install]
    F --> G[Install Playwright Browsers]
    D --> H[Run Container]
    E --> I[Install in Editable Mode]
    I --> F
    G --> J[Verify Installation]
    H --> J
    J --> K{Success?}
    K -->|Yes| L[Installation Complete]
    K -->|No| M[Troubleshoot]
    M --> G
```

## Post-Installation Verification

Verify that crawl4ai is installed correctly by checking the installation status:

```bash
python -m crawl4ai --version
```

Or test the installation programmatically:

```python
import crawl4ai
print(crawl4ai.__version__)
```

### Browser Verification

Ensure Playwright browsers are properly installed:

```bash
python -m playwright install-deps chromium
python -m playwright install chromium
```

## Environment Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `CRAWL4AI_BROWSER_HEADLESS` | `true` | Run browser in headless mode |
| `CRAWL4AI_MAX_CONCURRENT` | `5` | Maximum concurrent crawls |
| `CRAWL4AI_CACHE_DIR` | `~/.crawl4ai/cache` | Cache directory path |

### Configuration File

Create `~/.crawl4ai/config.json` for persistent configuration:

```json
{
  "browser": {
    "headless": true,
    "viewport": {"width": 1920, "height": 1080}
  },
  "cache": {
    "enabled": true,
    "ttl": 3600
  }
}
```

## Common Installation Issues

### Issue: Playwright Installation Fails

**Solution:** Install system dependencies manually:

```bash
# Ubuntu/Debian
apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 libcairo2

# Then retry browser installation
python -m playwright install chromium
```

### Issue: Permission Denied

**Solution:** Use virtual environment or `--user` flag:

```bash
python -m venv venv
source venv/bin/activate  # Linux/Mac
pip install crawl4ai
```

### Issue: Import Errors After Installation

**Solution:** Verify Python path and reinstall:

```bash
pip uninstall crawl4ai
pip install crawl4ai --force-reinstall
```

## Docker-Specific Configuration

### Volume Mounts

Mount local directories for persistent data:

```bash
docker run -v /path/to/data:/data crawl4ai
```

### Network Configuration

For web crawling behind proxies:

```bash
docker run -e HTTP_PROXY=http://proxy:8080 \
           -e HTTPS_PROXY=https://proxy:8080 \
           crawl4ai
```

## Upgrading crawl4ai

### pip Upgrade

```bash
pip install crawl4ai --upgrade
```

### Docker Upgrade

```bash
docker pull unclecode/crawl4ai
docker stop old_container
docker rm old_container
docker run crawl4ai
```

### Source Upgrade

```bash
git pull origin main
pip install -e . --force-reinstall
```

## Next Steps

After successful installation, proceed to:

1. **Quick Start** - Run your first crawl operation
2. **Configuration Guide** - Customize crawl behavior
3. **API Reference** - Explore available methods and options
4. **Examples** - Review usage patterns and best practices

---

<a id='quickstart'></a>

## Quick Start Guide

### 相关页面

相关主题：[Async Web Crawler](#async_crawler), [Markdown Generation](#markdown_generation)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [docs/examples/quickstart.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart.py)
- [docs/examples/hello_world.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hello_world.py)
- [crawl4ai/__init__.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/__init__.py)
- [sbom/README.md](https://github.com/unclecode/crawl4ai/blob/main/sbom/README.md)
</details>

# Quick Start Guide

## Overview

The **Quick Start Guide** provides developers with a rapid introduction to using crawl4ai for web crawling and data extraction tasks. It serves as the entry point for users who want to begin extracting structured data from websites within minutes of installation.

### Purpose and Scope

The Quick Start Guide is designed to:

- Demonstrate the simplest possible usage pattern for crawling web pages
- Show how to extract and structure content from HTML pages
- Provide copy-paste-ready code examples for immediate experimentation
- Bridge the gap between installation and production usage

## Installation

### Prerequisites

| Requirement | Description |
|-------------|-------------|
| Python | Version 3.8 or higher |
| pip | Latest version recommended |
| Browser | Chrome/Chromium (for JavaScript rendering) |

### Installation Command

```bash
pip install crawl4ai
```

## Basic Usage Pattern

The fundamental workflow in crawl4ai follows a simple three-step pattern:

```mermaid
graph TD
    A[Create AsyncWebCrawler Instance] --> B[Configure Parameters]
    B --> C[Call crawl Method with URL]
    C --> D[Process Result Object]
    D --> E[Extract Content/Markdown/HTML]
```

### Hello World Example

The simplest possible usage demonstrates core functionality:

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.crawl(url="https://example.com")
        
        if result.success:
            print(f"Content: {result.markdown}")
            print(f"Links found: {len(result.links)}")
        else:
            print(f"Crawl failed: {result.error_message}")

asyncio.run(main())
```

## Core Components

### AsyncWebCrawler

The primary entry point for all crawling operations:

| Parameter | Type | Description |
|-----------|------|-------------|
| `verbose` | bool | Enable detailed logging output |
| `headless` | bool | Run browser in headless mode |
| `browser_type` | str | Specify browser engine |

资料来源：[crawl4ai/__init__.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/__init__.py)

### CrawlResult Object

The return value from `crawler.crawl()` contains extracted data:

| Property | Type | Description |
|----------|------|-------------|
| `success` | bool | Whether crawl completed successfully |
| `markdown` | str | Extracted content as markdown |
| `html` | str | Raw HTML content |
| `links` | dict | Dictionary of internal/external links |
| `media` | dict | Images, videos, and other media |
| `error_message` | str | Error details if `success` is False |

## Common Usage Patterns

### Pattern 1: Simple Content Extraction

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def extract_content():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.crawl(
            url="https://example.com",
            word_count_threshold=10
        )
        
        if result.success:
            return result.markdown
        return None

content = asyncio.run(extract_content())
```

### Pattern 2: Batch Crawling

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_multiple(urls):
    async with AsyncWebCrawler() as crawler:
        tasks = [crawler.crawl(url=url) for url in urls]
        results = await asyncio.gather(*tasks)
        return [r for r in results if r.success]

urls = ["https://example.com", "https://example.org"]
successful_results = asyncio.run(crawl_multiple(urls))
```

## Configuration Options

### Browser Configuration

```python
from crawl4ai import BrowserConfig, CrawlerRunConfig

browser_config = BrowserConfig(
    headless=True,
    verbose=False
)

run_config = CrawlerRunConfig(
    word_count_threshold=10,
    page_timeout=30000
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.crawl(
        url="https://example.com",
        config=run_config
    )
```

## Error Handling

Always check the `success` property before accessing extracted content:

```python
result = await crawler.crawl(url="https://example.com")

if result.success:
    process_data(result.markdown)
else:
    log_error(f"Crawl failed: {result.error_message}")
    handle_failure()
```

## Next Steps

After completing the Quick Start Guide, users should explore:

- Advanced extraction strategies with CSS selectors and XPath
- JavaScript-heavy page crawling
- Rate limiting and polite crawling practices
- Integration with AI/LLM pipelines for content analysis

---

## Summary

The Quick Start Guide provides the essential foundation for using crawl4ai effectively. By following the patterns shown, developers can immediately begin extracting structured web content with minimal configuration overhead.

---

<a id='architecture'></a>

## System Architecture

### 相关页面

相关主题：[Browser Management](#browser_management), [Async Web Crawler](#async_crawler)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [deploy/docker/ARCHITECTURE.md](https://github.com/unclecode/crawl4ai/blob/main/deploy/docker/ARCHITECTURE.md)
- [crawl4ai/async_webcrawler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py)
- [crawl4ai/browser_manager.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/browser_manager.py)
- [crawl4ai/async_dispatcher.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_dispatcher.py)
- [crawl4ai/models.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/models.py)
</details>

# System Architecture

## Overview

Crawl4AI is a high-performance web crawling framework designed for AI applications. It enables efficient extraction of web content along with metadata, supporting both single-page crawling and large-scale asynchronous crawling operations. The architecture emphasizes separation of concerns between browser management, crawling logic, and result processing.

## Core Components

The system is built around three primary modules that work in coordination:

| Component | File | Responsibility |
|-----------|------|----------------|
| AsyncWebCrawler | `crawl4ai/async_webcrawler.py` | Main entry point for crawling operations |
| BrowserManager | `crawl4ai/browser_manager.py` | Handles browser lifecycle and page interactions |
| AsyncDispatcher | `crawl4ai/async_dispatcher.py` | Manages concurrent crawling tasks |

## Component Architecture

```mermaid
graph TD
    A[User / API Client] --> B[AsyncWebCrawler]
    B --> C[BrowserManager]
    B --> D[AsyncDispatcher]
    C --> E[Browser Instance]
    D --> F[Task Queue]
    F --> E
    E --> G[Content Extraction]
    G --> H[Result Models]
    H --> B
```

## AsyncWebCrawler

The `AsyncWebCrawler` class serves as the primary interface for initiating crawl operations. It accepts configuration parameters and coordinates the crawling workflow.

### Key Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| config | CrawlerRunConfig | Configuration for the crawl session |
| browser_manager | BrowserManager | Shared browser manager instance |
| dispatcher | AsyncDispatcher | Task dispatcher for async operations |

资料来源：[crawl4ai/async_webcrawler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py)

### Workflow

```mermaid
graph LR
    A[Initialize Crawler] --> B[Configure Browser]
    B --> C[Create BrowserContext]
    C --> D[Navigate to URL]
    D --> E[Extract Content]
    E --> F[Return CrawlResult]
```

## BrowserManager

The `BrowserManager` handles the lifecycle of browser instances, managing Chrome/Chromium processes and providing isolated contexts for crawling sessions.

资料来源：[crawl4ai/browser_manager.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/browser_manager.py)

### Browser Lifecycle

```mermaid
graph TD
    A[Launch Browser] --> B[Create Context]
    B --> C[Create Page]
    C --> D[Execute Crawl]
    D --> E[Close Context]
    E --> F[Repeat or Shutdown]
    F --> A
```

## AsyncDispatcher

The `AsyncDispatcher` enables concurrent crawling operations, managing task queues and coordinating multiple browser contexts for parallel extraction.

资料来源：[crawl4ai/async_dispatcher.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_dispatcher.py)

### Parallel Execution Model

```mermaid
graph TD
    A[URL List] --> B[Dispatcher Queue]
    B --> C[Worker 1]
    B --> D[Worker 2]
    B --> E[Worker N]
    C --> F[Results Aggregator]
    D --> F
    E --> F
    F --> G[Combined Output]
```

## Data Models

Results from crawling operations are structured using Pydantic models defined in `models.py`.

资料来源：[crawl4ai/models.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/models.py)

| Model | Purpose |
|-------|---------|
| CrawlResult | Container for extracted content and metadata |
| CrawlerRunConfig | Configuration parameters for crawl sessions |

## Docker Deployment Architecture

The project includes Docker deployment specifications that containerize the crawling infrastructure.

```mermaid
graph TD
    A[Docker Compose] --> B[Crawl4AI Container]
    A --> C[Redis Cache]
    A --> D[Chrome Browser]
    B --> C
    B --> D
```

资料来源：[deploy/docker/ARCHITECTURE.md](https://github.com/unclecode/crawl4ai/blob/main/deploy/docker/ARCHITECTURE.md)

## Technology Stack

| Layer | Technology |
|-------|------------|
| Runtime | Python 3.10+ |
| Browser Engine | Chrome/Chromium via Playwright |
| Async Framework | asyncio |
| Data Validation | Pydantic |
| Containerization | Docker |

## Configuration

The system supports extensive configuration options through `CrawlerRunConfig`, including:

- JavaScript execution toggles
- Memory management settings
- Request throttling parameters
- Content extraction strategies

## Dependency Management

The project maintains a Software Bill of Materials (SBOM) for tracking dependencies and ensuring reproducible builds.

资料来源：[sbom/README.md](https://github.com/unclecode/crawl4ai/blob/main/sbom/README.md)

To regenerate the SBOM:

```bash
./scripts/gen-sbom.sh

---

<a id='browser_management'></a>

## Browser Management

### 相关页面

相关主题：[Anti-Bot Detection and Proxy Management](#anti_bot_detection)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [crawl4ai/browser_manager.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/browser_manager.py)
- [crawl4ai/browser_adapter.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/browser_adapter.py)
- [crawl4ai/browser_profiler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/browser_profiler.py)
- [crawl4ai/js_snippet](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/js_snippet)
- [docs/codebase/browser.md](https://github.com/unclecode/crawl4ai/blob/main/docs/codebase/browser.md)
</details>

# Browser Management

## Overview

Browser Management in crawl4ai provides a comprehensive abstraction layer for controlling and orchestrating browser instances used during web crawling and scraping operations. The system abstracts the complexity of browser automation, allowing users to focus on data extraction rather than browser lifecycle management.

## Architecture Overview

The browser management system follows a modular architecture with distinct components that handle specific responsibilities:

```mermaid
graph TD
    A[BrowserManager] --> B[BrowserAdapter]
    A --> C[BrowserProfiler]
    B --> D[Playwright/Chromium]
    C --> E[JS Snippets]
    F[User Request] --> A
    A --> G[Crawled Result]
```

## Core Components

### BrowserManager

The central orchestrator responsible for:

- Browser instance lifecycle (creation, configuration, teardown)
- Session management and isolation
- Resource allocation and cleanup
- Coordination between adapters and profilers

**Key Responsibilities:**

| Responsibility | Description |
|----------------|-------------|
| Instance Creation | Creates and initializes browser contexts |
| Configuration | Applies user-defined browser settings |
| Lifecycle Control | Manages startup and shutdown sequences |
| Pool Management | Handles browser pool for concurrent operations |

资料来源：[crawl4ai/browser_manager.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/browser_manager.py)

### BrowserAdapter

The adapter pattern implementation that provides a consistent interface for interacting with different browser engines (Playwright, Chromium, Firefox, WebKit).

**Adapter Features:**

| Feature | Description |
|---------|-------------|
| Engine Abstraction | Unified API across browser backends |
| Command Translation | Converts high-level commands to browser-specific instructions |
| Response Normalization | Standardizes browser responses |

资料来源：[crawl4ai/browser_adapter.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/browser_adapter.py)

### BrowserProfiler

Handles JavaScript injection and performance profiling during browser operations.

**Profiler Capabilities:**

| Capability | Purpose |
|------------|---------|
| JS Injection | Execute custom JavaScript in page context |
| Performance Tracking | Monitor page load and execution metrics |
| Resource Profiling | Track network requests and responses |

资料来源：[crawl4ai/browser_profiler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/browser_profiler.py)

## JavaScript Integration

The `js_snippet` module provides pre-built JavaScript utilities for browser automation tasks:

```mermaid
graph LR
    A[Browser Context] --> B[js_snippet Module]
    B --> C[DOM Manipulation]
    B --> D[Data Extraction]
    B --> E[Event Handling]
```

**Common JS Snippet Categories:**

- DOM traversal and manipulation
- Content extraction
- Scroll management
- Wait conditions
- Network request interception

资料来源：[crawl4ai/js_snippet](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/js_snippet)

## Configuration Options

### Browser Launch Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| headless | bool | true | Run browser in headless mode |
| args | list | [] | Additional browser arguments |
| timeout | int | 30000 | Navigation timeout in milliseconds |
| viewport | dict | {"width": 1920, "height": 1080} | Browser viewport dimensions |
| user_agent | str | None | Custom user agent string |
| proxy | dict | None | Proxy configuration |

### Context Options

| Option | Type | Description |
|--------|------|-------------|
| java_script_enabled | bool | Enable/disable JavaScript |
| ignore_https_errors | bool | Ignore SSL certificate errors |
| java_script_enabled | bool | Browser context JavaScript state |

资料来源：[docs/codebase/browser.md](https://github.com/unclecode/crawl4ai/blob/main/docs/codebase/browser.md)

## Browser Lifecycle

```mermaid
stateDiagram-v2
    [*] --> Initializing: Create BrowserManager
    Initializing --> Launching: Launch Browser
    Launching --> Ready: Browser Context Created
    Ready --> Navigating: Load URL
    Navigating --> Ready: Page Loaded
    Ready --> Executing: Run JS/Commands
    Executing --> Ready: Commands Complete
    Ready --> Closing: Shutdown Request
    Closing --> [*]: Resources Freed
```

## Usage Patterns

### Basic Browser Usage

```python
from crawl4ai import BrowserManager

# Initialize browser manager
browser_mgr = BrowserManager(
    headless=True,
    viewport={"width": 1920, "height": 1080}
)

# Create browser context
context = browser_mgr.new_context()

# Use context for crawling
result = await context.goto("https://example.com")
```

### Advanced Configuration

```python
browser_mgr = BrowserManager(
    headless=False,
    args=[
        "--disable-blink-features=AutomationControlled",
        "--disable-dev-shm-usage"
    ],
    timeout=60000,
    user_agent="Custom User Agent"
)
```

## Session Management

The system supports multiple concurrent sessions through isolated browser contexts:

```mermaid
graph TD
    A[BrowserManager] --> B1[Session 1 Context]
    A --> B2[Session 2 Context]
    A --> B3[Session N Context]
    B1 --> C1[Page 1]
    B2 --> C2[Page 2]
    B3 --> C3[Page N]
```

## Error Handling

The browser management system implements comprehensive error handling:

| Error Type | Handling Strategy |
|------------|-------------------|
| Navigation Timeout | Retry with exponential backoff |
| Browser Crash | Automatic restart and context recreation |
| Resource Exhaustion | Automatic cleanup of stale contexts |
| Network Errors | Graceful degradation with cached content |

## Performance Considerations

### Optimization Strategies

1. **Context Reuse**: Reuse browser contexts for multiple pages when possible
2. **Lazy Loading**: Only load resources when explicitly requested
3. **Resource Limits**: Configure memory and CPU limits per context
4. **Connection Pooling**: Maintain warm browser instances for rapid access

### Memory Management

| Strategy | Description |
|----------|-------------|
| Context Isolation | Each session runs in isolated context |
| Automatic Cleanup | Temporary files and caches cleared automatically |
| Resource Limits | Configurable memory caps per browser instance |

## Related Documentation

- [Browser Adapter API Reference](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/browser_adapter.py)
- [JavaScript Snippets Library](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/js_snippet)
- [Codebase Overview](https://github.com/unclecode/crawl4ai/blob/main/docs/codebase/browser.md)

---

<a id='async_crawler'></a>

## Async Web Crawler

### 相关页面

相关主题：[Markdown Generation](#markdown_generation), [Extraction Strategies](#extraction_strategies)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [crawl4ai/async_webcrawler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py)
- [crawl4ai/async_configs.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_configs.py)
- [crawl4ai/cache_context.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/cache_context.py)
- [crawl4ai/types.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/types.py)
- [docs/md_v2/api/async-webcrawler.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/api/async-webcrawler.md)
</details>

# Async Web Crawler

## Overview

The Async Web Crawler is the core component of crawl4ai, providing an asynchronous, high-performance web crawling engine built on Python's `asyncio` framework. It enables concurrent crawling of multiple URLs with built-in caching, configurable extraction strategies, and comprehensive result handling.

The primary purpose of this module is to fetch web pages, extract meaningful content, and return structured results that include HTML, markdown, media assets, metadata, and optional AI-generated summaries. The async design allows for efficient I/O-bound operations, making it suitable for large-scale web scraping projects.

资料来源：[crawl4ai/async_webcrawler.py:1-50]()

## Architecture

### System Components

The async web crawler system consists of several interconnected components that work together to provide a seamless crawling experience.

```mermaid
graph TD
    A[AsyncWebCrawler] --> B[Browser Manager]
    A --> C[Cache Layer]
    A --> D[Extraction Strategy]
    A --> E[Result Processor]
    
    B --> F[Playwright/Chromium]
    C --> G[File System Cache]
    C --> H[Memory Cache]
    
    D --> I[LLM-based Extraction]
    D --> J[CSS/XPath Extraction]
    
    E --> K[CrawlResult]
    E --> L[Raw HTML]
    E --> M[Markdown]
```

资料来源：[crawl4ai/async_webcrawler.py:1-30]()

### Core Classes

| Class | File | Purpose |
|-------|------|---------|
| `AsyncWebCrawler` | async_webcrawler.py | Main crawler entry point with `arun()` and `arun_many()` methods |
| `BrowserConfig` | async_configs.py | Configuration for headless browser behavior |
| `CrawlCache` | cache_context.py | Manages caching strategies for crawled content |
| `CrawlResult` | types.py | Data model for returning crawl results |

资料来源：[crawl4ai/async_configs.py:1-30]()

## AsyncWebCrawler Class

### Initialization

The `AsyncWebCrawler` class can be initialized with optional configuration parameters:

```python
class AsyncWebCrawler:
    def __init__(
        self,
        config: BrowserConfig | None = None,
        verbose: bool = False
    ) -> None:
```

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `config` | `BrowserConfig` | `None` | Browser configuration object |
| `verbose` | `bool` | `False` | Enable verbose logging output |

资料来源：[crawl4ai/async_webcrawler.py:50-70]()

### Core Methods

#### `arun()`

Single URL crawling with comprehensive result extraction:

```python
async def arun(
    self,
    url: str,
    config: BrowserConfig | None = None,
    **kwargs
) -> CrawlResult
```

**Parameters:**

| Parameter | Type | Description |
|-----------|------|-------------|
| `url` | `str` | Target URL to crawl |
| `config` | `BrowserConfig` | Override browser configuration |

资料来源：[crawl4ai/async_webcrawler.py:100-150]()

#### `arun_many()`

Batch crawling for multiple URLs concurrently:

```python
async def arun_many(
    self,
    urls: list[str],
    config: BrowserConfig | None = None,
    **kwargs
) -> list[CrawlResult]
```

**Parameters:**

| Parameter | Type | Description |
|-----------|------|-------------|
| `urls` | `list[str]` | List of target URLs |
| `config` | `BrowserConfig` | Shared configuration for all URLs |

资料来源：[crawl4ai/async_webcrawler.py:200-250]()

### Context Manager Support

The `AsyncWebCrawler` implements the async context manager protocol for proper resource cleanup:

```python
async def __aenter__(self) -> "AsyncWebCrawler":
    await self.start()
    return self

async def __aexit__(
    self,
    exc_type, exc_val, exc_tb
) -> None:
    await self.close()
```

资料来源：[crawl4ai/async_webcrawler.py:80-100]()

## CrawlResult Data Model

The `CrawlResult` class encapsulates all information retrieved from a crawled page:

```mermaid
classDiagram
    class CrawlResult {
        +str url
        +str html
        +str markdown
        +list~MediaItem~ media
        +list~Link~ links
        +dict metadata
        +str|None success
        +str|None error
        +dict~str, Any~ extracted_content
        +int status_code
        +datetime created_at
    }
```

资料来源：[crawl4ai/types.py:1-80]()

### Properties

| Property | Type | Description |
|----------|------|-------------|
| `url` | `str` | Original request URL |
| `html` | `str` | Raw HTML content |
| `markdown` | `str` | Converted markdown content |
| `media` | `list[MediaItem]` | Extracted images, videos, audio |
| `links` | `list[Link]` | Internal and external links |
| `metadata` | `dict` | Page metadata (title, description) |
| `success` | `str \| None` | Success status message |
| `error` | `str \| None` | Error message if failed |
| `status_code` | `int` | HTTP response status code |
| `created_at` | `datetime` | Timestamp of crawl operation |

资料来源：[crawl4ai/types.py:50-100]()

## Configuration

### BrowserConfig

The `BrowserConfig` class provides fine-grained control over browser behavior:

```python
@dataclass
class BrowserConfig:
    headless: bool = True
    browser_type: str = "chromium"
    viewport_size: dict = {"width": 1920, "height": 1080}
    user_agent: str | None = None
    verbose: bool = False
```

资料来源：[crawl4ai/async_configs.py:30-80]()

### Configuration Options

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `headless` | `bool` | `True` | Run browser in headless mode |
| `browser_type` | `str` | `"chromium"` | Browser engine (chromium, firefox, webkit) |
| `viewport_size.width` | `int` | `1920` | Viewport width in pixels |
| `viewport_size.height` | `int` | `1080` | Viewport height in pixels |
| `user_agent` | `str \| None` | `None` | Custom user agent string |
| `verbose` | `bool` | `False` | Enable debug output |

资料来源：[crawl4ai/async_configs.py:40-90]()

### Advanced Configuration

Additional crawling parameters can be passed via kwargs:

| Parameter | Type | Description |
|-----------|------|-------------|
| `word_count_threshold` | `int` | Minimum word count for content extraction |
| `extraction_strategy` | `ExtractionStrategy` | Strategy for content extraction |
| `cache_mode` | `CacheMode` | Caching behavior (enabled/disabled/bypass) |
| `js_enabled` | `bool` | Enable JavaScript execution |
| `wait_for` | `str` | CSS selector to wait for before returning |
| `delay_before_return_html` | `float` | Delay in seconds before capturing HTML |

资料来源：[docs/md_v2/api/async-webcrawler.md:1-60]()

## Caching System

### Cache Modes

The crawl4ai framework implements a multi-layered caching strategy:

```mermaid
graph LR
    A[Request] --> B{Memory Cache}
    B -->|Hit| C[Return Cached]
    B -->|Miss| D{File System Cache}
    D -->|Hit| E[Return Cached]
    D -->|Miss| F[Fetch Remote]
    F --> G[Store in Both Layers]
```

资料来源：[crawl4ai/cache_context.py:1-50]()

### CacheMode Enum

| Mode | Description |
|------|-------------|
| `ENABLED` | Use cache if available, otherwise fetch and cache |
| `DISABLED` | Always fetch fresh content, bypass cache |
| `BYPASS` | Fetch and update cache but don't read from it |
| `READ_ONLY` | Only read from cache, never fetch |

资料来源：[crawl4ai/cache_context.py:30-60]()

### Cache Context Manager

```python
async with cache_context(cache_mode=CacheMode.ENABLED):
    result = await crawler.arun(url="https://example.com")
```

资料来源：[crawl4ai/cache_context.py:60-90]()

## Usage Examples

### Basic Single URL Crawl

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=BrowserConfig(headless=True)
        )
        
        print(f"Success: {result.success}")
        print(f"Markdown content: {result.markdown[:500]}")

asyncio.run(main())
```

资料来源：[crawl4ai/async_webcrawler.py:150-200]()

### Batch Crawling

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ]
    
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls=urls)
        
        for result in results:
            print(f"URL: {result.url}, Status: {result.status_code}")

asyncio.run(main())
```

资料来源：[docs/md_v2/api/async-webcrawler.md:60-100]()

### With Custom Extraction Strategy

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def main():
    config = BrowserConfig(
        headless=True,
        verbose=True
    )
    
    strategy = LLMExtractionStrategy(
        provider="openai/gpt-4",
        api_token="your-token"
    )
    
    async with AsyncWebCrawler(config=config) as crawler:
        result = await crawler.arun(
            url="https://news-site.com/article",
            extraction_strategy=strategy
        )
        
        print(result.extracted_content)

asyncio.run(main())
```

资料来源：[crawl4ai/async_configs.py:90-130]()

## Error Handling

The crawler returns comprehensive error information through the `CrawlResult` object:

```python
result = await crawler.arun(url="https://invalid-url.xyz")

if not result.success:
    print(f"Error: {result.error}")
    print(f"Status Code: {result.status_code}")
```

| Error Scenario | `success` Value | `error` Field |
|----------------|-----------------|---------------|
| Network timeout | `False` | Connection timeout message |
| Invalid URL | `False` | URL validation error |
| JavaScript error | `False` | Browser console error |
| HTTP 404/500 | `False` | HTTP status message |
| Success | `True` | `None` |

资料来源：[crawl4ai/types.py:80-120]()

## Performance Considerations

### Async Benefits

The async architecture provides several performance advantages:

1. **Concurrent Requests**: Multiple URLs can be crawled simultaneously
2. **Non-blocking I/O**: Browser operations don't block other tasks
3. **Resource Efficiency**: Single event loop manages all crawling tasks

### Best Practices

| Practice | Benefit |
|----------|---------|
| Use `arun_many()` for batch operations | Reduces connection overhead |
| Enable caching for repeated URLs | Avoids redundant network requests |
| Set appropriate `word_count_threshold` | Reduces unnecessary processing |
| Use `headless=True` in production | Reduces memory usage |

资料来源：[crawl4ai/async_webcrawler.py:250-300]()

## Related Components

| Component | File | Relationship |
|-----------|------|--------------|
| `ExtractionStrategy` | extraction_strategy.py | Defines how content is extracted |
| `MediaItem` | types.py | Represents extracted media |
| `Link` | types.py | Represents extracted links |
| `CacheBackend` | cache_backend.py | Abstract cache implementation |

资料来源：[crawl4ai/types.py:1-30]()

## Summary

The Async Web Crawler is the foundational building block of crawl4ai, providing:

- **Asynchronous operation** for high-performance concurrent crawling
- **Flexible configuration** via `BrowserConfig` dataclass
- **Comprehensive result types** through `CrawlResult` model
- **Multi-layered caching** with configurable modes
- **Extensible extraction** via pluggable strategies
- **Production-ready error handling** with detailed error reporting

This architecture enables developers to build scalable web scraping solutions while maintaining clean, readable code patterns familiar to Python async developers.

---

<a id='markdown_generation'></a>

## Markdown Generation

### 相关页面

相关主题：[Extraction Strategies](#extraction_strategies), [Async Web Crawler](#async_crawler)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [crawl4ai/markdown_generation_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/markdown_generation_strategy.py)
- [crawl4ai/content_filter_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/content_filter_strategy.py)
- [crawl4ai/html2text](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/html2text)
- [docs/md_v2/core/markdown-generation.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/core/markdown-generation.md)
- [crawl4ai/extraction_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategy.py)
</details>

# Markdown Generation

Markdown Generation is a core feature in crawl4ai that transforms raw HTML content into clean, readable Markdown format. This system provides flexible strategies for content extraction, filtering, and conversion with extensive customization options.

## Overview

The Markdown Generation system converts web page HTML into structured Markdown text suitable for LLM consumption, RAG systems, or documentation purposes. It offers multiple generation strategies, content filtering capabilities, and fine-grained control over extraction behavior.

资料来源：[docs/md_v2/core/markdown-generation.md:1-15]()

## Architecture

```mermaid
graph TD
    A[HTML Input] --> B[Content Filter Strategy]
    B --> C[HTML Processing]
    C --> D[Markdown Generation Strategy]
    D --> E[Markdown Output]
    
    F[Configuration] --> B
    F --> D
    
    G[BestProvider] --> B
    G --> D
```

## Core Components

### MarkdownGenerationStrategy

The primary abstraction for generating Markdown from HTML content.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `provider` | `str` | `"best"` | Content extraction provider |
| `configs` | `dict` | `{}` | Provider-specific configurations |
| `strict` | `bool` | `False` | Raise errors on failure |
| `override_system_prompt` | `str` | `None` | Custom system prompt |
| `override_user_prompt` | `str` | `None` | Custom user prompt |

资料来源：[crawl4ai/markdown_generation_strategy.py:1-50]()

### ContentFilterStrategy

Abstract base class for filtering and selecting content before Markdown conversion.

| Filter | Description |
|--------|-------------|
| `PruningContentFilter` | Removes low-value content nodes |
| `BM25ContentFilter` | Uses BM25 ranking for content selection |
| `OrgAnnContentFilter` | Organic annotation-based filtering |

资料来源：[crawl4ai/content_filter_strategy.py:1-100]()

## Generation Providers

### Available Providers

| Provider | Description |
|----------|-------------|
| `best` | Automatically selects optimal provider |
| `playwright` | Uses Playwright for JavaScript rendering |
| `curl` | Lightweight extraction via curl |
| `trafilatura` | Trafilatura library extraction |
| `lxml` | LXML-based HTML parsing |
| `readability` | Mozilla Readability algorithm |

资料来源：[crawl4ai/markdown_generation_strategy.py:50-150]()

### Best Provider Selection

The `BestProvider` class intelligently selects the most appropriate extraction method based on content characteristics.

```python
class BestProvider:
    def get_strategy(self, html: str) -> MarkdownGenerationStrategy:
        # Analyzes HTML and selects optimal provider
        pass
```

资料来源：[crawl4ai/markdown_generation_strategy.py:150-200]()

## Workflow

```mermaid
graph LR
    A[Fetch HTML] --> B{Content Filter Enabled?}
    B -->|Yes| C[Apply Filter Strategy]
    B -->|No| D[Skip Filtering]
    C --> E[Generate Markdown]
    D --> E
    E --> F{Post-Processing?}
    F -->|Yes| G[Apply Custom Rules]
    F -->|No| H[Return Result]
    G --> H
```

## Configuration Options

### Generator Config

```python
{
    "word_threshold": 50,          # Minimum words per chunk
    "language": "en",              # Content language
    "skip_internal_links": True,   # Ignore internal links
    "content_type": "markdown"     # Output format
}
```

### BM25 Filter Config

```python
{
    "query": "relevant keywords",
    "top_n": 5,                    # Number of chunks
    "use_stem": True               # Apply stemming
}
```

资料来源：[crawl4ai/content_filter_strategy.py:100-180]()

## HTML to Markdown Conversion

### html2text Module

The `html2text` submodule handles low-level HTML to Markdown conversion.

| Method | Purpose |
|--------|---------|
| `handle_anchor` | Convert `<a>` tags to `[text](url)` |
| `handle_image` | Convert `<img>` to `![alt](src)` |
| `handle_heading` | Convert `<h1>`-`<h6>` to `#` - `######` |
| `handle_table` | Convert `<table>` to Markdown tables |
| `handle_code` | Preserve `<code>` and `<pre>` formatting |

资料来源：[crawl4ai/html2text:core.py:1-100]()

### Conversion Features

- **Link Preservation**: External links converted to Markdown format with titles
- **Image Extraction**: Images extracted with alt text and sources
- **Table Conversion**: HTML tables converted to GFM tables
- **Code Block Handling**: Syntax-aware code block extraction
- **List Recognition**: Ordered and unordered lists properly formatted

## Usage Examples

### Basic Usage

```python
from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        markdown_generator={
            "provider": "best",
            "configs": {"word_threshold": 100}
        }
    )
    print(result.markdown)
```

### With Content Filter

```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter

filter_strategy = BM25ContentFilter(
    query="getting started installation",
    top_n=10
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://docs.example.com",
        markdown_generator={
            "provider": "playwright",
            "filter_strategy": filter_strategy
        }
    )
```

## Advanced Configuration

### Custom Prompts

Override system and user prompts for specialized extraction:

```python
markdown_generator = {
    "override_system_prompt": "Extract only technical documentation...",
    "override_user_prompt": "Focus on API endpoints and code examples..."
}
```

### Strict Mode

Enable strict mode to raise exceptions on extraction failures:

```python
markdown_generator = {
    "strict": True,
    "provider": "playwright"
}
```

## Performance Considerations

| Aspect | Recommendation |
|--------|----------------|
| Large Pages | Use `BM25ContentFilter` to reduce content |
| JavaScript-heavy Sites | Use `playwright` provider |
| Simple Pages | Use `lxml` or `trafilatura` for speed |
| Batch Processing | Set appropriate `word_threshold` |

## Error Handling

The system provides graceful degradation:

1. **Provider Fallback**: Falls back to alternative provider on failure
2. **Strict Mode**: Raises exceptions when enabled
3. **Partial Results**: Returns available content on partial failures

资料来源：[crawl4ai/extraction_strategy.py:50-120]()

## Related Components

- **Chunking**: Content can be further processed with text chunking strategies
- **Extraction**: Works alongside extraction strategies for structured data
- **Cache**: Generated Markdown can be cached for repeated access

---

<a id='extraction_strategies'></a>

## Extraction Strategies

### 相关页面

相关主题：[Markdown Generation](#markdown_generation), [Deep Crawling Strategies](#deep_crawling)

<details>
<summary>Relevant Source Files</summary>

以下源码文件用于生成本页说明：

- [crawl4ai/extraction_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategy.py)
- [crawl4ai/chunking_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/chunking_strategy.py)
- [crawl4ai/content_scraping_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/content_scraping_strategy.py)
- [crawl4ai/table_extraction.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/table_extraction.py)
- [docs/md_v2/extraction/llm-strategies.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/extraction/llm-strategies.md)
- [docs/md_v2/extraction/no-llm-strategies.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/extraction/no-llm-strategies.md)
</details>

# Extraction Strategies

Extraction Strategies in crawl4ai define how content is parsed, structured, and extracted from crawled web pages. They form the core abstraction layer that determines whether unstructured HTML becomes meaningful, machine-readable data.

## Overview

Extraction Strategies handle the transformation pipeline from raw HTML to structured output. The system supports two primary categories:

| Category | Use Case | Performance |
|----------|----------|-------------|
| LLM-based | Complex, semantic extraction | Slower, higher accuracy |
| No-LLM | Fast, pattern-based extraction | Faster, rule-dependent |

资料来源：[crawl4ai/extraction_strategy.py:1-50](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategy.py)

## Architecture

```mermaid
graph TD
    A[HTML Content] --> B[Content Scraping Strategy]
    B --> C[Chunking Strategy]
    C --> D{Extraction Strategy}
    D -->|LLM-based| E[LLM Strategy]
    D -->|No-LLM| F[No-LLM Strategy]
    E --> G[Structured JSON/Markdown]
    F --> G
    G --> H[Table Extraction Optional]
    H --> I[Final Output]
```

The extraction pipeline flows through scraping → chunking → extraction, with table extraction as an optional final step.

资料来源：[crawl4ai/extraction_strategy.py:50-100](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategy.py)

## LLM-Based Strategies

LLM strategies leverage large language models for semantic understanding and intelligent content extraction.

### Supported Providers

| Provider | Model Support | Configuration |
|----------|--------------|---------------|
| OpenAI | GPT-4, GPT-3.5 | `OPENAI_API_KEY` |
| Anthropic | Claude 3, Claude 2 | `ANTHROPIC_API_KEY` |
| Azure OpenAI | Custom deployments | `AZURE_API_KEY`, `AZURE_API_BASE` |
| Ollama | Local models | `OLLAMA_BASE_URL` |

### Configuration Parameters

```python
class LLMExtractionStrategy:
    def __init__(
        self,
        provider: str = "openai",
        model: str = "gpt-4",
        api_token: Optional[str] = None,
        system_prompt: Optional[str] = None,
        user_prompt: Optional[str] = None,
        extraction_type: str = "block",
        input_format: str = "html",
        instruction: Optional[str] = None
    )
```

资料来源：[docs/md_v2/extraction/llm-strategies.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/extraction/llm-strategies.md)

### Extraction Types

| Type | Description | Best For |
|------|-------------|----------|
| `block` | Block-level extraction | Paragraphs, sections |
| `schema` | Schema-based extraction | Structured data, forms |
| `custom` | Custom instructions | Specific extraction needs |

资料来源：[crawl4ai/extraction_strategy.py:100-150](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategy.py)

## No-LLM Strategies

No-LLM strategies provide fast, deterministic extraction without external API dependencies.

### Available Strategies

| Strategy | Purpose |
|----------|---------|
| `NoExtractionStrategy` | Pass-through, no extraction |
| `JsonCssExtractionStrategy` | CSS selector-based JSON extraction |
| `RegexExtractionStrategy` | Regex pattern matching |
| `XPathExtractionStrategy` | XPath-based extraction |

资料来源：[docs/md_v2/extraction/no-llm-strategies.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/extraction/no-llm-strategies.md)

### JsonCssExtractionStrategy

```python
from crawl4ai import JsonCssExtractionStrategy

strategy = JsonCssExtractionStrategy(
    schema={
        "name": "ProductList",
        "baseSelector": "div.product",
        "fields": [
            {"name": "title", "selector": "h2.title", "type": "text"},
            {"name": "price", "selector": "span.price", "type": "text"},
            {"name": "image", "selector": "img", "attribute": "src"}
        ]
    }
)
```

资料来源：[crawl4ai/extraction_strategy.py:150-200](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategy.py)

## Chunking Strategies

Chunking strategies split content into manageable pieces before extraction.

### Default Chunking Behavior

```mermaid
graph LR
    A[Large Content] --> B[Character Split]
    B --> C[Overlap Application]
    C --> D[Token Count Check]
    D -->|Under limit| E[Chunk Ready]
    D -->|Over limit| F[Recursive Split]
    F --> E
```

### Configuration Options

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `chunk_token_size` | int | 1000 | Target tokens per chunk |
| `overlap` | int | 100 | Overlapping tokens between chunks |
| `max_chunk_size` | int | 3000 | Hard maximum chunk size |
| `splitting_regex` | str | `\n\n+` | Regex for splitting points |

资料来源：[crawl4ai/chunking_strategy.py:1-80](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/chunking_strategy.py)

## Content Scraping Strategy

The content scraping strategy determines initial content extraction from HTML.

```mermaid
graph TD
    A[Raw HTML] --> B{Scraping Strategy}
    B -->|BeautifulSoup| C[Parse DOM]
    B -->|Playwright| D[Dynamic Render]
    B -->|Raw| E[Minimal Processing]
    C --> F[Content Cleaned]
    D --> F
    E --> F
```

### Strategy Selection

| Strategy | JavaScript | Speed | Use Case |
|----------|------------|-------|----------|
| `BeautifulSoup` | No | Fast | Static pages |
| `Playwright` | Yes | Medium | SPAs, dynamic content |
| `RawContent` | No | Fastest | Pre-processed HTML |

资料来源：[crawl4ai/content_scraping_strategy.py:1-60](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/content_scraping_strategy.py)

## Table Extraction

Table extraction handles tabular data structures within web pages.

```python
class TableExtractionStrategy:
    def __init__(
        self,
        table_styles: Optional[List[str]] = None,
        ignore_tables: Optional[List[str]] = None,
        merge_multiple_headers: bool = False
    )
```

### Extraction Configuration

| Parameter | Type | Description |
|-----------|------|-------------|
| `table_styles` | List[str] | CSS classes to include as tables |
| `ignore_tables` | List[str] | CSS classes to exclude |
| `merge_multiple_headers` | bool | Merge multi-row headers |
| `extract_header` | bool | Include header row (default: True) |

资料来源：[crawl4ai/table_extraction.py:1-100](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/table_extraction.py)

## Complete Pipeline Example

```python
from crawl4ai import (
    AsyncWebCrawler,
    LLMExtractionStrategy,
    JsonCssExtractionStrategy,
    RegexExtractionStrategy,
    TableExtractionStrategy
)

async with AsyncWebCrawler() as crawler:
    # LLM-based extraction
    llm_result = await crawler.arun(
        url="https://example.com/article",
        extraction_strategy=LLMExtractionStrategy(
            provider="openai",
            model="gpt-4",
            instruction="Extract article title, author, and key points"
        )
    )
    
    # CSS-based extraction
    css_result = await crawler.arun(
        url="https://example.com/products",
        extraction_strategy=JsonCssExtractionStrategy(schema=product_schema)
    )
```

资料来源：[crawl4ai/extraction_strategy.py:200-250](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategy.py)

## Strategy Selection Guide

```mermaid
graph TD
    A[Start] --> B{Need semantic understanding?}
    B -->|Yes| C{External API acceptable?}
    B -->|No| D[No-LLM Strategy]
    C -->|Yes| E{Local deployment needed?}
    C -->|No| F[Ollama/Local LLM]
    E -->|Yes| F
    E -->|No| G[OpenAI/Anthropic]
    D --> H{Data has tabular structure?}
    H -->|Yes| I[Add TableExtractionStrategy]
    H -->|No| J[Complete]
    G --> J
    F --> J
    I --> J
```

### Decision Matrix

| Requirement | Recommended Strategy |
|-------------|---------------------|
| Simple CSS extraction | `JsonCssExtractionStrategy` |
| Complex semantic parsing | `LLMExtractionStrategy` |
| High-volume, low-latency | No-LLM strategies |
| Schema-agnostic | LLM-based strategies |
| Tabular data focus | `TableExtractionStrategy` |

资料来源：[docs/md_v2/extraction/llm-strategies.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/extraction/llm-strategies.md), [docs/md_v2/extraction/no-llm-strategies.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/extraction/no-llm-strategies.md)

## Environment Variables

| Variable | Required For | Description |
|----------|--------------|-------------|
| `OPENAI_API_KEY` | OpenAI LLM | API key for GPT models |
| `ANTHROPIC_API_KEY` | Anthropic LLM | API key for Claude models |
| `AZURE_API_KEY` | Azure OpenAI | Azure OpenAI API key |
| `OLLAMA_BASE_URL` | Local LLM | Base URL for Ollama server |

资料来源：[crawl4ai/extraction_strategy.py:250-300](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategy.py)

---

<a id='deep_crawling'></a>

## Deep Crawling Strategies

### 相关页面

相关主题：[Extraction Strategies](#extraction_strategies), [Async Web Crawler](#async_crawler)

<details>
<summary>Related Source Files</summary>

以下源码文件用于生成本页说明：

- [crawl4ai/deep_crawling/bfs_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/bfs_strategy.py)
- [crawl4ai/deep_crawling/dfs_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/dfs_strategy.py)
- [crawl4ai/deep_crawling/bff_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/bff_strategy.py)
- [crawl4ai/deep_crawling/base_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/base_strategy.py)
- [crawl4ai/deep_crawling/filters.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/filters.py)
- [crawl4ai/deep_crawling/scorers.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/scorers.py)
- [docs/md_v2/core/deep-crawling.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/core/deep-crawling.md)
</details>

# Deep Crawling Strategies

## Overview

Deep Crawling Strategies in crawl4ai provide systematic approaches to traverse and extract content from websites beyond a single page. These strategies enable controlled, scalable web crawling by managing URL discovery, prioritization, filtering, and scoring mechanisms. The deep crawling module supports multiple traversal algorithms (BFS, DFS, BFF) with extensible filtering and scoring systems.

资料来源：[base_strategy.py:1-50](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/base_strategy.py)

## Architecture

```mermaid
graph TD
    A[Seed URLs] --> B[DeepCrawlingStrategy]
    B --> C[URL Filters]
    B --> D[URL Scorers]
    B --> E[Traversal Algorithm]
    C --> F[Valid URLs]
    D --> G[Prioritized URLs]
    E --> H[Crawl Queue]
    G --> H
    H --> I[Crawl4AI Extractor]
    I --> J[Extracted Content]
    J --> K[Links Extracted]
    K --> B
```

The architecture follows a producer-consumer pattern where the strategy continuously discovers URLs from crawled pages and feeds them back into the crawl queue based on prioritization rules.

资料来源：[base_strategy.py:50-100](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/base_strategy.py)

## Core Components

### Base Strategy

All crawling strategies inherit from `DeepCrawlingStrategy`, which provides the foundational interface and shared functionality:

| Property/Method | Type | Description |
|-----------------|------|-------------|
| `url_scorer` | `URLScorer` | Scores URLs for prioritization |
| `url_filter` | `URLFilter` | Filters URLs for validity |
| `keywords` | `Set[str]` | Keywords for relevance matching |
| `keywords_or` | `Set[str]` | Alternative keyword matching |
| `max_depth` | `int` | Maximum crawl depth |
| `included_domains` | `Set[str]` | Allowed domains |
| `excluded_domains` | `Set[str]` | Blocked domains |
| `crawl_enabled` | `bool` | Enable/disable crawling |
| `check_keywords` | `Callable` | Custom keyword validation |

资料来源：[base_strategy.py:100-150](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/base_strategy.py)

## Traversal Algorithms

### BFS (Breadth-First Search) Strategy

The BFS strategy explores pages level by level, ensuring comprehensive coverage before going deeper:

```mermaid
graph LR
    A[Level 0: Seed] --> B[Level 1: depth=1]
    B --> C[Level 2: depth=2]
    C --> D[Level 3: depth=3]
    style A fill:#90EE90
    style B fill:#87CEEB
    style C fill:#DDA0DD
    style D fill:#F0E68C
```

**Characteristics:**
- Systematic exploration of shallow depths first
- Ideal for site maps and directory-style sites
- Higher memory usage due to large frontier sets
- Better for finding all accessible pages at shallower depths

资料来源：[bfs_strategy.py:1-80](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/bfs_strategy.py)

**Configuration Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `max_depth` | `int` | `3` | Maximum crawl depth |
| `max_pages` | `int` | `50` | Maximum pages to crawl |
| `priority` | `int` | `0` | Base priority score |
| `include_external` | `bool` | `False` | Allow external domain crawling |

### DFS (Depth-First Search) Strategy

The DFS strategy explores as deep as possible before backtracking:

```mermaid
graph TD
    A[Start] --> B[Depth 1]
    B --> C[Depth 2]
    C --> D[Depth 3]
    D --> E[Backtrack]
    E --> F[Next Branch]
    F --> G[Continue Deep]
    style A fill:#90EE90
    style D fill:#FF6B6B
```

**Characteristics:**
- Deep exploration of specific paths first
- Lower memory footprint
- Suitable for following navigation chains
- Risk of getting stuck in deep site sections

资料来源：[dfs_strategy.py:1-80](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/dfs_strategy.py)

**Configuration Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `max_depth` | `int` | `10` | Maximum crawl depth |
| `max_pages` | `int` | `100` | Maximum pages to crawl |
| `priority` | `int` | `0` | Base priority score |

### BFF (Best-First with Filters) Strategy

The BFF strategy combines filtering with score-based prioritization, crawling the most relevant pages first:

```mermaid
graph TD
    A[URL Discovered] --> B{URL Filter}
    B -->|Pass| C{Score URL}
    B -->|Fail| X[Skip]
    C --> D[Priority Queue]
    D --> E[Crawl Next Best]
    E --> F[Extract Links]
    F --> A
```

**Characteristics:**
- Relevance-based crawling using keyword matching
- Configurable scoring functions
- Filters out irrelevant content early
- Most efficient for targeted data extraction

资料来源：[bff_strategy.py:1-100](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/bff_strategy.py)

## Filtering System

The filtering system determines which URLs are eligible for crawling:

### FilterChain

Multiple filters can be chained together for comprehensive URL validation:

```python
from crawl4ai.deep_crawling.filters import FilterChain, SameDomainFilter, ExtensionFilter

filter_chain = FilterChain([
    SameDomainFilter(allowed_domains=["example.com"]),
    ExtensionFilter(excluded_extensions=[".pdf", ".zip"]),
])
```

### Available Filters

| Filter | Purpose | Key Parameters |
|--------|---------|----------------|
| `SameDomainFilter` | Restrict to same domain | `allowed_domains`, `strict` |
| `ExtensionFilter` | Block by file extension | `excluded_extensions` |
| ` robots.txt Filter` | Respect robots directives | `user_agent` |
| `RegexFilter` | Custom pattern matching | `patterns`, `exclude` |

资料来源：[filters.py:1-120](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/filters.py)

## Scoring System

URL scoring determines crawl priority within the queue:

### ScorerChain

Scorers can be combined in a chain for multi-factor evaluation:

```python
from crawl4ai.deep_crawling.scorers import ScorerChain, KeywordRelevanceScorer, DepthScorer

scorer = ScorerChain([
    KeywordRelevanceScorer(keywords=["api", "docs"]),
    DepthScorer(max_depth=5, decay=0.5),
])
```

### Available Scorers

| Scorer | Function | Parameters |
|--------|----------|------------|
| `KeywordRelevanceScorer` | Match keywords in URL/text | `keywords`, `weight` |
| `DepthScorer` | Penalize deep pages | `max_depth`, `decay` |
| `FreshnessScorer` | Prefer recent content | `date_field`, `decay` |
| `CustomScorer` | User-defined scoring | `scoring_fn` |

资料来源：[scorers.py:1-150](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/scorers.py)

## Usage Examples

### Basic BFS Crawling

```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.deep_crawling import BFSDistanceStrategy

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        strategy=BFSDistanceStrategy(
            max_depth=3,
            max_pages=50
        )
    )
```

### Targeted Crawling with BFF

```python
from crawl4ai.deep_crawling import BFFStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from crawl4ai.deep_crawling.filters import SameDomainFilter

strategy = BFFStrategy(
    max_depth=5,
    max_pages=100,
    keywords={"documentation", "api", "guide"},
    url_filter=SameDomainFilter(allowed_domains=["example.com"]),
    url_scorer=KeywordRelevanceScorer(keywords={"documentation"}),
)
```

资料来源：[docs/md_v2/core/deep-crawling.md:1-100](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/core/deep-crawling.md)

## Workflow States

```mermaid
stateDiagram-v2
    [*] --> Initializing: Start crawl
    Initializing --> Crawling: Load seed URLs
    Crawling --> Processing: Fetch page
    Processing --> Filtering: Extract links
    Filtering --> Scoring: Validate URLs
    Scoring --> Queuing: Rank by priority
    Queuing --> Crawling: Next URL
    Crawling --> [*]: Queue empty or max reached
```

## Strategy Selection Guide

| Use Case | Recommended Strategy | Reason |
|----------|---------------------|--------|
| Site mapping | BFS | Comprehensive shallow coverage |
| Documentation sites | BFS | Find all pages systematically |
| Article/navigation chains | DFS | Follow deep links naturally |
| Targeted data extraction | BFF | Prioritize relevant pages |
| API documentation | BFF | Filter by keyword relevance |
| Limited resources | DFS | Lower memory footprint |

## Configuration Reference

### Common Parameters

```python
@dataclass
class DeepCrawlConfig:
    # Scope
    included_domains: Set[str] = None
    excluded_domains: Set[str] = None
    
    # Limits
    max_depth: int = 3
    max_pages: int = 100
    max_total_pages: int = 1000
    
    # Filtering
    allow_external: bool = False
    check_keywords: bool = True
    
    # Scoring
    scoring: ScorerChain = None
    filter: FilterChain = None
```

### Keyword Matching

```python
# AND matching (all keywords must match)
keywords = {"documentation", "api"}
keywords_or = False

# OR matching (any keyword matches)
keywords = {"guide", "tutorial", "docs"}
keywords_or = True
```

资料来源：[base_strategy.py:150-200](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/base_strategy.py)

## Advanced Features

### Custom Filters

```python
from crawl4ai.deep_crawling.filters import URLFilter

class CustomFilter(URLFilter):
    def should_crawl(self, url: str) -> bool:
        # Custom logic
        return "product" in url and not url.endswith(".jpg")
```

### Custom Scorers

```python
from crawl4ai.deep_crawling.scorers import URLScorer

class PriorityScorer(URLScorer):
    def score(self, url: str, context: dict) -> float:
        base_score = 1.0
        if "important" in url:
            base_score *= 2.0
        return base_score
```

资料来源：[filters.py:150-200](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/filters.py), [scorers.py:150-200](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/deep_crawling/scorers.py)

## Best Practices

1. **Set reasonable limits**: Always configure `max_pages` and `max_depth` to prevent runaway crawling
2. **Use BFF for targeted extraction**: When you know what content you need, BFF reduces noise
3. **Filter early, score late**: Apply filters before scoring to reduce unnecessary processing
4. **Respect robots.txt**: Configure filters to respect site crawling directives
5. **Monitor memory usage**: BFS uses more memory; switch to DFS for resource-constrained environments
6. **Combine keyword strategies**: Use both `keywords` (AND) and `keywords_or` (OR) for flexible matching

## See Also

- [AsyncWebCrawler API Reference](../api/async-web-crawler.md)
- [Content Extraction](../guides/content-extraction.md)
- [Browser Configuration](../guides/browser-config.md)

---

<a id='anti_bot_detection'></a>

## Anti-Bot Detection and Proxy Management

### 相关页面

相关主题：[Browser Management](#browser_management)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [crawl4ai/antibot_detector.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/antibot_detector.py)
- [crawl4ai/proxy_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/proxy_strategy.py)
- [crawl4ai/async_configs.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_configs.py)
- [docs/md_v2/advanced/anti-bot-and-fallback.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/advanced/anti-bot-and-fallback.md)
- [docs/md_v2/advanced/proxy-security.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/advanced/proxy-security.md)
- [sbom/README.md](https://github.com/unclecode/crawl4ai/blob/main/sbom/README.md)
</details>

# Anti-Bot Detection and Proxy Management

## Overview

Crawl4ai provides sophisticated anti-bot detection evasion and proxy management capabilities to ensure reliable web crawling operations. These features work together to detect and circumvent bot protection mechanisms while maintaining request anonymity through proxy rotation.

## Architecture Overview

```mermaid
graph TD
    A[Client Request] --> B[AntiBotDetector]
    B --> C{Bot Detection?}
    C -->|Yes| D[Apply Evasion Strategy]
    C -->|No| E[Direct Request]
    D --> F[ProxySelector]
    F --> G[Rotating Proxies]
    G --> H[Target Website]
    H --> I{Response Valid?}
    I -->|No| J[Fallback Mechanism]
    J --> B
    I -->|Yes| K[Return Content]
```

## Anti-Bot Detection System

### Purpose and Scope

The anti-bot detection module (`antibot_detector.py`) analyzes responses from target websites to determine if bot protection mechanisms have been triggered. When detection occurs, the system can automatically apply evasion strategies or fall back to alternative methods.

资料来源：[crawl4ai/antibot_detector.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/antibot_detector.py)

### Detection Strategies

| Strategy | Description | Use Case |
|----------|-------------|----------|
| Header Analysis | Examines HTTP headers for bot detection signals | Standard bot checks |
| Content Analysis | Scans response content for CAPTCHAs or blocking messages | Challenge pages |
| Status Code Monitoring | Tracks HTTP status codes indicating blocks | 403, 429 responses |
| JavaScript Challenge Detection | Identifies JS-based bot challenges | Cloudflare, PerimeterX |

### Key Components

The anti-bot detector integrates with the async configuration system to provide seamless fallback handling:

```python
# Pseudocode representation based on async_configs.py integration
class AntiBotConfig:
    enabled: bool = True
    detection_threshold: float = 0.7
    auto_fallback: bool = True
    max_retries: int = 3
```

资料来源：[crawl4ai/async_configs.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_configs.py)

## Proxy Management System

### Purpose and Scope

The proxy strategy module (`proxy_strategy.py`) manages proxy rotation, selection, and health checking to maintain request anonymity and distribute load across multiple IP addresses.

资料来源：[crawl4ai/proxy_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/proxy_strategy.py)

### Proxy Rotation Strategies

| Strategy | Description | Best For |
|----------|-------------|----------|
| Round Robin | Sequential proxy selection | Even distribution |
| Random | Random proxy selection | Avoiding pattern detection |
| Weighted | Prioritize faster/reliable proxies | Performance optimization |
| Geographic | Match proxy location to target | Region-specific content |

### Configuration Parameters

```python
class ProxyConfig:
    proxies: List[str] = []          # List of proxy URLs
    rotation_strategy: str = "round_robin"
    health_check_interval: int = 300 # seconds
    timeout: int = 30                # proxy timeout in seconds
    retry_on_failure: bool = True
```

### Proxy Health Monitoring

The system continuously monitors proxy health through periodic health checks, removing failed proxies from the active pool and re-evaluating them after a cooldown period.

## Integration with Async Configuration

### Fallback Mechanisms

When anti-bot detection triggers, the system can automatically switch to fallback modes:

1. **Proxy Fallback**: Rotate to a different proxy server
2. **Strategy Fallback**: Switch to alternative crawling strategies
3. **User-Agent Fallback**: Use different browser fingerprints

资料来源：[docs/md_v2/advanced/anti-bot-and-fallback.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/advanced/anti-bot-and-fallback.md)

### Configuration Example

```yaml
anti_bot:
  enabled: true
  detection_sensitivity: "medium"
  auto_fallback: true
  
proxy:
  enabled: true
  strategy: "weighted"
  proxies:
    - "http://proxy1.example.com:8080"
    - "http://proxy2.example.com:8080"
  health_check:
    enabled: true
    interval: 300
```

## Security Considerations

### Proxy Security

When configuring proxies, consider the following security aspects:

- **Proxy Protocol**: Use HTTPS proxies to encrypt traffic
- **Authentication**: Implement proxy authentication where supported
- **Provider Reputation**: Use trusted proxy providers
- **IP Rotation**: Avoid predictable IP patterns

资料来源：[docs/md_v2/advanced/proxy-security.md](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/advanced/proxy-security.md)

### Best Practices

| Practice | Description |
|----------|-------------|
| Rate Limiting | Respect target site limits to avoid IP bans |
| Request Delays | Implement delays between requests |
| Header Randomization | Vary User-Agent and other headers |
| Cookie Management | Handle cookies appropriately per session |
| SSL Verification | Validate SSL certificates for security |

## Workflow Diagram

```mermaid
sequenceDiagram
    participant Client
    participant AntiBot as AntiBot Detector
    participant ProxyMgr as Proxy Manager
    participant Target as Target Website
    
    Client->>AntiBot: Send Request
    AntiBot->>ProxyMgr: Request Proxy
    ProxyMgr->>AntiBot: Return Proxy
    AntiBot->>Target: Forward Request via Proxy
    
    alt Bot Detected
        Target-->>AntiBot: Bot Challenge Response
        AntiBot->>ProxyMgr: Request Different Proxy
        ProxyMgr-->>AntiBot: New Proxy
        AntiBot->>Target: Retry with New Proxy
    else Success
        Target-->>Client: Return Content
    end
```

## Error Handling

### Common Error Scenarios

| Error | Cause | Resolution |
|-------|-------|------------|
| `403 Forbidden` | IP blocked | Rotate proxy |
| `429 Too Many Requests` | Rate limited | Backoff and retry |
| `CAPTCHA Required` | Bot detected | Switch strategy |
| `Proxy Timeout` | Proxy unavailable | Health check and replace |

### Retry Logic

The system implements exponential backoff for retries:

```python
retry_config = {
    "max_attempts": 3,
    "base_delay": 1.0,      # seconds
    "max_delay": 60.0,       # seconds
    "exponential_base": 2
}
```

## Summary

Crawl4ai's anti-bot detection and proxy management system provides a robust framework for evading bot detection while maintaining reliable web crawling operations. The integration between `AntiBotDetector` and `ProxyStrategy` enables automatic fallback mechanisms that significantly improve crawling success rates against protected websites.

---

---

## Doramagic 踩坑日志

项目：unclecode/crawl4ai

摘要：发现 21 个潜在踩坑项，其中 5 个为 high/blocking；最高优先级：安装坑 - 来源证据：[Bug]: arun() and arun_many() type hinting needs fixing。

## 1. 安装坑 · 来源证据：[Bug]: arun() and arun_many() type hinting needs fixing

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug]: arun() and arun_many() type hinting needs fixing
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_d3b6cfd3700147f690e0e65875f15424 | https://github.com/unclecode/crawl4ai/issues/1898 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 2. 配置坑 · 来源证据：[Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked as failed), no error messages or failure reason…

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：[Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked as failed), no error messages or failure reason is shown
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_ad61b108bf894cc286ca7966e8c86758 | https://github.com/unclecode/crawl4ai/issues/1949 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 3. 配置坑 · 来源证据：[Bug]: MCP scrape tools lack wait_until / SPA support that REST API and CLI provide

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：[Bug]: MCP scrape tools lack wait_until / SPA support that REST API and CLI provide
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_1ee99f5d72f143f4b064732cc19e0c85 | https://github.com/unclecode/crawl4ai/issues/1963 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 4. 配置坑 · 来源证据：[Bug]: `remove_empty_elements_fast()` drops trailing text when removing empty elements with non-empty .tail

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：[Bug]: `remove_empty_elements_fast()` drops trailing text when removing empty elements with non-empty .tail
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_d7fa967632a948008efbc182d1f2c96b | https://github.com/unclecode/crawl4ai/issues/1938 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 5. 安全/权限坑 · 来源证据：[Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-3x token overhead for CJK content

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：[Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-3x token overhead for CJK content
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_2e9fbf659fbb40aba437886a87f8e2d7 | https://github.com/unclecode/crawl4ai/issues/1962 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 6. 安装坑 · 来源证据：[Bug] AsyncLogger writes to stdout, breaking MCP stdio transport

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug] AsyncLogger writes to stdout, breaking MCP stdio transport
- 对用户的影响：可能影响升级、迁移或版本选择。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_af29278fd7294d4a8f0f6f37ab987b5c | https://github.com/unclecode/crawl4ai/issues/1968 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 7. 安装坑 · 来源证据：[Bug]: The install with pip on just about any system rarely works. It requires an env or it only partial installs

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug]: The install with pip on just about any system rarely works. It requires an env or it only partial installs
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_97d44cedb21a4908a7743fde11209954 | https://github.com/unclecode/crawl4ai/issues/1950 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 8. 安装坑 · 来源证据：[Bug]: enable_stealth=True is a silent no-op — StealthAdapter imports symbols that don't exist in playwright-stealth 2.x

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：[Bug]: enable_stealth=True is a silent no-op — StealthAdapter imports symbols that don't exist in playwright-stealth 2.x
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_ae45861377894b99a57d6bbdc06af313 | https://github.com/unclecode/crawl4ai/issues/1959 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 9. 安装坑 · 来源证据：v0.7.1:Update

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：v0.7.1:Update
- 对用户的影响：可能影响升级、迁移或版本选择。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_a6ae9133fff54443b712725f51769fa1 | https://github.com/unclecode/crawl4ai/releases/tag/v0.7.1 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 10. 安装坑 · 来源证据：v0.7.2: CI/CD & Dependency Optimization Update

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：v0.7.2: CI/CD & Dependency Optimization Update
- 对用户的影响：可能影响升级、迁移或版本选择。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_14954e0431ca426ebeaa4bb31778d4af | https://github.com/unclecode/crawl4ai/releases/tag/v0.7.2 | 来源讨论提到 docker 相关条件，需在安装/试用前复核。

## 11. 配置坑 · 来源证据：[Bug]: Markdown export loses heading hierarchy and table structure

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：[Bug]: Markdown export loses heading hierarchy and table structure
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源问题仍为 open，Pack Agent 需要复核是否仍影响当前版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_c3eac8ab81e34bf3b6cc050f7f8e9826 | https://github.com/unclecode/crawl4ai/issues/1964 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 12. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 建议检查：将假设转成下游验证清单。
- 防护动作：假设必须转成验证项；没有验证结果前不能写成事实。
- 证据：capability.assumptions | github_repo:798201435 | https://github.com/unclecode/crawl4ai | README/documentation is current enough for a first validation pass.

## 13. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 建议检查：补 GitHub 最近 commit、release、issue/PR 响应信号。
- 防护动作：维护活跃度未知时，推荐强度不能标为高信任。
- 证据：evidence.maintainer_signals | github_repo:798201435 | https://github.com/unclecode/crawl4ai | last_activity_observed missing

## 14. 安全/权限坑 · 下游验证发现风险项

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：下游已经要求复核，不能在页面中弱化。
- 建议检查：进入安全/权限治理复核队列。
- 防护动作：下游风险存在时必须保持 review/recommendation 降级。
- 证据：downstream_validation.risk_items | github_repo:798201435 | https://github.com/unclecode/crawl4ai | no_demo; severity=medium

## 15. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 建议检查：把风险写入边界卡，并确认是否需要人工复核。
- 防护动作：评分风险必须进入边界卡，不能只作为内部分数。
- 证据：risks.scoring_risks | github_repo:798201435 | https://github.com/unclecode/crawl4ai | no_demo; severity=medium

## 16. 安全/权限坑 · 来源证据：Release v0.7.3

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Release v0.7.3
- 对用户的影响：可能阻塞安装或首次运行。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_e2b75670cbcc4814a86423818b9f6f48 | https://github.com/unclecode/crawl4ai/releases/tag/v0.7.3 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 17. 安全/权限坑 · 来源证据：Release v0.7.5

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Release v0.7.5
- 对用户的影响：可能影响升级、迁移或版本选择。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_056d1470d7534cacb39eeb894e054496 | https://github.com/unclecode/crawl4ai/releases/tag/v0.7.5 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 18. 安全/权限坑 · 来源证据：Release v0.7.7

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Release v0.7.7
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_e157445f88744795b5c6234783eca692 | https://github.com/unclecode/crawl4ai/releases/tag/v0.7.7 | 来源讨论提到 docker 相关条件，需在安装/试用前复核。

## 19. 安全/权限坑 · 来源证据：[Bug]: Markdown text extraction drops text when element contains empty elements

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：[Bug]: Markdown text extraction drops text when element contains empty elements
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_dffa926853d147ebb487a03fdfd1818e | https://github.com/unclecode/crawl4ai/issues/1966 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 20. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 建议检查：抽样最近 issue/PR，判断是否长期无人处理。
- 防护动作：issue/PR 响应未知时，必须提示维护风险。
- 证据：evidence.maintainer_signals | github_repo:798201435 | https://github.com/unclecode/crawl4ai | issue_or_pr_quality=unknown

## 21. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 建议检查：确认最近 release/tag 和 README 安装命令是否一致。
- 防护动作：发布节奏未知或过期时，安装说明必须标注可能漂移。
- 证据：evidence.maintainer_signals | github_repo:798201435 | https://github.com/unclecode/crawl4ai | release_recency=unknown

<!-- canonical_name: unclecode/crawl4ai; human_manual_source: deepwiki_human_wiki -->