Doramagic Project Pack Β· Human Manual
crawl4ai
Related topics: Installation Guide, Quick Start Guide
Introduction to Crawl4AI
Related topics: Installation Guide, Quick Start Guide
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Installation Guide, Quick Start Guide
Introduction to Crawl4AI
Overview
Crawl4AI is an open-source AI-powered web crawling framework designed to extract structured data from web pages and deliver clean, LLM-ready output. It serves as a modern alternative to traditional web scraping tools by combining intelligent crawling capabilities with AI-driven content extraction and formatting.
The project emphasizes ease of use, providing both programmatic APIs and command-line interfaces for rapid integration into data pipelines, research workflows, and AI applications.
Purpose and Scope
Crawl4AI addresses the fundamental challenge of extracting meaningful data from unstructured web content. While conventional web crawlers focus on fetching page content, Crawl4AI goes further by:
- Semantic Understanding: Analyzing page content to identify and extract relevant information based on context rather than rigid selectors
- Structured Output: Delivering data in formats optimized for large language model consumption, including Markdown, JSON, and structured extractions
- Performance Optimization: Enabling high-throughput crawling with configurable browser automation and connection pooling
- flexibility: Supporting various output strategies including simple crawling, chunked extraction, and memory-aware processing
The scope encompasses web crawling, content extraction, link navigation, media handling, and output formattingβproviding an end-to-end solution from URL input to structured data output.
Core Architecture
Crawl4AI follows a modular architecture composed of distinct processing stages:
graph TD
A[URL Input] --> B[Crawl Strategy]
B --> C[Browser Automation]
C --> D[Content Extraction]
D --> E[AI Processing]
E --> F[Output Formatter]
F --> G[Structured Output]
C -->|JS Rendering| H[JavaScript Executor]
D -->|Media| I[Media Handler]
E -->|Memory| J[Memory Manager]Processing Pipeline
| Stage | Component | Description |
|---|---|---|
| Input | URL Parser | Validates and normalizes target URLs |
| Crawl | Strategy Engine | Selects crawling approach based on configuration |
| Render | Browser Pool | Manages headless browser instances |
| Extract | AI Extractor | Uses ML models to identify relevant content |
| Format | Output Serializer | Converts to target format (JSON/Markdown/HTML) |
Key Features
1. AI-Powered Extraction
Crawl4AI leverages machine learning models to understand page content semantically. Rather than relying solely on CSS selectors or XPath expressions, the extractor can identify:
- Main article content and metadata
- Structured data elements (tables, lists, forms)
- Semantic sections and their relationships
- Relevant versus boilerplate content
2. Multiple Output Strategies
The framework supports various extraction strategies optimized for different use cases:
| Strategy | Use Case | Output |
|---|---|---|
default | General purpose | Clean Markdown with metadata |
cosine | Semantic clustering | Grouped content chunks |
no-cache | Fresh data | Bypass internal caching |
passive | Low resource | Minimal processing |
brainless | Simple fetch | Raw HTML without AI processing |
3. Browser Automation
Integrated headless browser support enables:
- JavaScript rendering for single-page applications
- Cookie and session management
- Custom headers and authentication
- Screenshot capture
- PDF generation
4. Media Handling
Crawl4AI processes various media types during extraction:
- Images: Download, compress, and embed with alt-text preservation
- Videos: Extract metadata and embed URLs
- Audio: Handle media references for podcasts and audio content
- Documents: Process embedded PDFs and downloadable files
Installation and Setup
Prerequisites
- Python 3.9 or higher
- Chrome/Chromium browser (for browser automation features)
- pip or poetry package manager
Basic Installation
pip install crawl4ai
Development Installation
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e ".[dev]"
Verify Installation
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown)
Basic Usage Patterns
Simple Crawl
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(f"Title: {result.metadata.get('title')}")
print(f"Content: {result.markdown}")
Configured Extraction
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
config = CrawlerRunConfig(
mode="aggressive",
word_count_threshold=10,
remove_hidden_text=True,
process_iframes=True,
scroll_delay=1.0
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/article",
config=config
)
print(result.markdown)
Batch Crawling
from crawl4ai import AsyncWebCrawler
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls=urls)
for result in results:
print(f"URL: {result.url}")
print(f"Status: {result.status_code}")
Configuration Reference
CrawlerRunConfig Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
mode | str | "default" | Extraction strategy |
headless | bool | True | Run browser in headless mode |
verbose | bool | False | Enable verbose logging |
text_threshold | int | Minimum text length filter | |
word_count_threshold | int | Words per chunk threshold | |
skip_download_images | bool | False | Skip image downloads |
page_timeout | int | 30000 | Page load timeout (ms) |
scroll_delay | float | 0 | Delay between scrolls |
Browser Configuration
| Parameter | Type | Description |
|---|---|---|
browser_type | str | Chromium/Firefox/WebKit |
headless | bool | Headless mode toggle |
proxy | dict | Proxy configuration |
user_agent | str | Custom user agent string |
Project Structure
crawl4ai/
βββ src/crawl4ai/ # Main package source
β βββ core/ # Core crawling engine
β βββ extractors/ # Content extraction strategies
β βββ formatters/ # Output formatters
β βββ utils/ # Utility functions
βββ examples/ # Usage examples
βββ tests/ # Test suite
βββ docs/ # Documentation
βββ scripts/ # Build and utility scripts
βββ sbom/ # Software Bill of Materials
Dependencies and SBOM
Crawl4AI maintains a comprehensive Software Bill of Materials (SBOM) documenting all direct and transitive dependencies. This SBOM is generated using CycloneDX format and regenerated on a best-effort basis through automated scripts.
Regenerating SBOM
./scripts/gen-sbom.sh
The SBOM provides visibility into the project's dependency tree, supporting security audits and license compliance verification.
Extensibility
Custom Extractors
Extend the extraction framework by implementing the base extractor interface:
from crawl4ai.extractors import BaseExtractor
class CustomExtractor(BaseExtractor):
async def extract(self, html: str, url: str) -> dict:
# Custom extraction logic
return {"content": html, "custom_field": "value"}
Output Formatters
Create custom output formats by implementing the formatter interface:
from crawl4ai.formatters import BaseFormatter
class CustomFormatter(BaseFormatter):
def format(self, result: CrawlResult) -> str:
# Custom formatting logic
return custom_string
Contributing
The project welcomes contributions from the community. Developers interested in contributing should:
- Fork the repository
- Create a feature branch
- Follow the established coding standards
- Add tests for new functionality
- Submit a pull request with clear documentation
Resources and Documentation
| Resource | Location |
|---|---|
| Source Code | GitHub Repository |
| Documentation | /docs directory |
| Examples | /examples directory |
| SBOM | /sbom directory |
| Issue Tracker | GitHub Issues |
License
Crawl4AI is released under open-source licensing terms. Refer to the LICENSE file in the repository for specific terms and conditions.
Source: https://github.com/unclecode/crawl4ai / Human Manual
Installation Guide
Related topics: Quick Start Guide
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Quick Start Guide
Installation Guide
Overview
This guide covers all supported methods for installing crawl4ai, a powerful web crawling and data extraction framework. The installation process handles automatic setup of core dependencies including Playwright for browser automation, supporting multiple installation scenarios from simple pip installations to Docker containerized deployments.
System Requirements
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 4 GB | 8 GB |
| Disk Space | 2 GB | 5 GB |
| CPU | 2 cores | 4+ cores |
Software Prerequisites
| Requirement | Version | Notes |
|---|---|---|
| Python | >= 3.9 | Tested up to 3.12 |
| pip | Latest | For pip installations |
| Docker | 20.10+ | For Docker installations |
| Chrome/Chromium | Latest | Auto-installed by Playwright |
Installation Methods
Method 1: pip Installation (Recommended)
The simplest and most common installation method uses pip package manager.
pip install crawl4ai
After pip installation, you must run the post-installation setup to configure browser dependencies:
python -m crawl4ai install
This command installs Playwright browsers and configures the necessary system dependencies. Sources: crawl4ai/install.py:1-50
Method 2: Docker Installation
Docker provides an isolated environment with all dependencies pre-configured.
#### Pulling the Official Image
docker pull unclecode/crawl4ai
#### Running with Docker
docker run -d \
--name crawl4ai \
-p 8000:8000 \
unclecode/crawl4ai
#### Building Custom Docker Image
You can build your own image using the provided Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "-m", "crawl4ai"]
Sources: Dockerfile
Method 3: Installation from Source
For development or customization purposes, install from the source repository:
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
Dependencies Management
Core Dependencies
The project defines dependencies in setup.py for package distribution and requirements.txt for development.
setup.py dependencies:
| Package | Purpose |
|---|---|
| playwright | Browser automation |
| asyncio | Async operations |
| aiohttp | HTTP client |
| beautifulsoup4 | HTML parsing |
| lxml | XML/HTML processing |
Sources: setup.py
Runtime Dependency Installation
The crawl4ai/install.py module handles automatic installation of runtime dependencies:
# Key installation steps in install.py
import subprocess
import sys
def install_browsers():
subprocess.check_call([
sys.executable, "-m", "playwright", "install", "chromium"
])
Sources: crawl4ai/install.py:20-30
Installing Optional Dependencies
| Extra | Command | Description |
|---|---|---|
| All extras | pip install crawl4ai[all] | Install all optional packages |
| Dev tools | pip install crawl4ai[dev] | Development dependencies |
| LLM support | pip install crawl4ai[llm] | Language model integration |
Installation Workflow
graph TD
A[Start Installation] --> B{Installation Method}
B -->|pip| C[Run pip install]
B -->|Docker| D[Pull/Build Image]
B -->|Source| E[Clone Repository]
C --> F[Run post-install]
F --> G[Install Playwright Browsers]
D --> H[Run Container]
E --> I[Install in Editable Mode]
I --> F
G --> J[Verify Installation]
H --> J
J --> K{Success?}
K -->|Yes| L[Installation Complete]
K -->|No| M[Troubleshoot]
M --> GPost-Installation Verification
Verify that crawl4ai is installed correctly by checking the installation status:
python -m crawl4ai --version
Or test the installation programmatically:
import crawl4ai
print(crawl4ai.__version__)
Browser Verification
Ensure Playwright browsers are properly installed:
python -m playwright install-deps chromium
python -m playwright install chromium
Environment Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
CRAWL4AI_BROWSER_HEADLESS | true | Run browser in headless mode |
CRAWL4AI_MAX_CONCURRENT | 5 | Maximum concurrent crawls |
CRAWL4AI_CACHE_DIR | ~/.crawl4ai/cache | Cache directory path |
Configuration File
Create ~/.crawl4ai/config.json for persistent configuration:
{
"browser": {
"headless": true,
"viewport": {"width": 1920, "height": 1080}
},
"cache": {
"enabled": true,
"ttl": 3600
}
}
Common Installation Issues
Issue: Playwright Installation Fails
Solution: Install system dependencies manually:
# Ubuntu/Debian
apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 libcairo2
# Then retry browser installation
python -m playwright install chromium
Issue: Permission Denied
Solution: Use virtual environment or --user flag:
python -m venv venv
source venv/bin/activate # Linux/Mac
pip install crawl4ai
Issue: Import Errors After Installation
Solution: Verify Python path and reinstall:
pip uninstall crawl4ai
pip install crawl4ai --force-reinstall
Docker-Specific Configuration
Volume Mounts
Mount local directories for persistent data:
docker run -v /path/to/data:/data crawl4ai
Network Configuration
For web crawling behind proxies:
docker run -e HTTP_PROXY=http://proxy:8080 \
-e HTTPS_PROXY=https://proxy:8080 \
crawl4ai
Upgrading crawl4ai
pip Upgrade
pip install crawl4ai --upgrade
Docker Upgrade
docker pull unclecode/crawl4ai
docker stop old_container
docker rm old_container
docker run crawl4ai
Source Upgrade
git pull origin main
pip install -e . --force-reinstall
Next Steps
After successful installation, proceed to:
- Quick Start - Run your first crawl operation
- Configuration Guide - Customize crawl behavior
- API Reference - Explore available methods and options
- Examples - Review usage patterns and best practices
Sources: Dockerfile
Quick Start Guide
Related topics: Async Web Crawler, Markdown Generation
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Async Web Crawler, Markdown Generation
Quick Start Guide
Overview
The Quick Start Guide provides developers with a rapid introduction to using crawl4ai for web crawling and data extraction tasks. It serves as the entry point for users who want to begin extracting structured data from websites within minutes of installation.
Purpose and Scope
The Quick Start Guide is designed to:
- Demonstrate the simplest possible usage pattern for crawling web pages
- Show how to extract and structure content from HTML pages
- Provide copy-paste-ready code examples for immediate experimentation
- Bridge the gap between installation and production usage
Installation
Prerequisites
| Requirement | Description |
|---|---|
| Python | Version 3.8 or higher |
| pip | Latest version recommended |
| Browser | Chrome/Chromium (for JavaScript rendering) |
Installation Command
pip install crawl4ai
Basic Usage Pattern
The fundamental workflow in crawl4ai follows a simple three-step pattern:
graph TD
A[Create AsyncWebCrawler Instance] --> B[Configure Parameters]
B --> C[Call crawl Method with URL]
C --> D[Process Result Object]
D --> E[Extract Content/Markdown/HTML]Hello World Example
The simplest possible usage demonstrates core functionality:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.crawl(url="https://example.com")
if result.success:
print(f"Content: {result.markdown}")
print(f"Links found: {len(result.links)}")
else:
print(f"Crawl failed: {result.error_message}")
asyncio.run(main())
Core Components
AsyncWebCrawler
The primary entry point for all crawling operations:
| Parameter | Type | Description |
|---|---|---|
verbose | bool | Enable detailed logging output |
headless | bool | Run browser in headless mode |
browser_type | str | Specify browser engine |
Sources: crawl4ai/__init__.py
CrawlResult Object
The return value from crawler.crawl() contains extracted data:
| Property | Type | Description |
|---|---|---|
success | bool | Whether crawl completed successfully |
markdown | str | Extracted content as markdown |
html | str | Raw HTML content |
links | dict | Dictionary of internal/external links |
media | dict | Images, videos, and other media |
error_message | str | Error details if success is False |
Common Usage Patterns
Pattern 1: Simple Content Extraction
import asyncio
from crawl4ai import AsyncWebCrawler
async def extract_content():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.crawl(
url="https://example.com",
word_count_threshold=10
)
if result.success:
return result.markdown
return None
content = asyncio.run(extract_content())
Pattern 2: Batch Crawling
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_multiple(urls):
async with AsyncWebCrawler() as crawler:
tasks = [crawler.crawl(url=url) for url in urls]
results = await asyncio.gather(*tasks)
return [r for r in results if r.success]
urls = ["https://example.com", "https://example.org"]
successful_results = asyncio.run(crawl_multiple(urls))
Configuration Options
Browser Configuration
from crawl4ai import BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(
headless=True,
verbose=False
)
run_config = CrawlerRunConfig(
word_count_threshold=10,
page_timeout=30000
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.crawl(
url="https://example.com",
config=run_config
)
Error Handling
Always check the success property before accessing extracted content:
result = await crawler.crawl(url="https://example.com")
if result.success:
process_data(result.markdown)
else:
log_error(f"Crawl failed: {result.error_message}")
handle_failure()
Next Steps
After completing the Quick Start Guide, users should explore:
- Advanced extraction strategies with CSS selectors and XPath
- JavaScript-heavy page crawling
- Rate limiting and polite crawling practices
- Integration with AI/LLM pipelines for content analysis
Sources: crawl4ai/__init__.py
System Architecture
Related topics: Browser Management, Async Web Crawler
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Browser Management, Async Web Crawler
System Architecture
Overview
Crawl4AI is a high-performance web crawling framework designed for AI applications. It enables efficient extraction of web content along with metadata, supporting both single-page crawling and large-scale asynchronous crawling operations. The architecture emphasizes separation of concerns between browser management, crawling logic, and result processing.
Core Components
The system is built around three primary modules that work in coordination:
| Component | File | Responsibility |
|---|---|---|
| AsyncWebCrawler | crawl4ai/async_webcrawler.py | Main entry point for crawling operations |
| BrowserManager | crawl4ai/browser_manager.py | Handles browser lifecycle and page interactions |
| AsyncDispatcher | crawl4ai/async_dispatcher.py | Manages concurrent crawling tasks |
Component Architecture
graph TD
A[User / API Client] --> B[AsyncWebCrawler]
B --> C[BrowserManager]
B --> D[AsyncDispatcher]
C --> E[Browser Instance]
D --> F[Task Queue]
F --> E
E --> G[Content Extraction]
G --> H[Result Models]
H --> BAsyncWebCrawler
The AsyncWebCrawler class serves as the primary interface for initiating crawl operations. It accepts configuration parameters and coordinates the crawling workflow.
Key Parameters
| Parameter | Type | Description |
|---|---|---|
| config | CrawlerRunConfig | Configuration for the crawl session |
| browser_manager | BrowserManager | Shared browser manager instance |
| dispatcher | AsyncDispatcher | Task dispatcher for async operations |
Sources: crawl4ai/async_webcrawler.py
Workflow
graph LR
A[Initialize Crawler] --> B[Configure Browser]
B --> C[Create BrowserContext]
C --> D[Navigate to URL]
D --> E[Extract Content]
E --> F[Return CrawlResult]BrowserManager
The BrowserManager handles the lifecycle of browser instances, managing Chrome/Chromium processes and providing isolated contexts for crawling sessions.
Sources: crawl4ai/browser_manager.py
Browser Lifecycle
graph TD
A[Launch Browser] --> B[Create Context]
B --> C[Create Page]
C --> D[Execute Crawl]
D --> E[Close Context]
E --> F[Repeat or Shutdown]
F --> AAsyncDispatcher
The AsyncDispatcher enables concurrent crawling operations, managing task queues and coordinating multiple browser contexts for parallel extraction.
Sources: crawl4ai/async_dispatcher.py
Parallel Execution Model
graph TD
A[URL List] --> B[Dispatcher Queue]
B --> C[Worker 1]
B --> D[Worker 2]
B --> E[Worker N]
C --> F[Results Aggregator]
D --> F
E --> F
F --> G[Combined Output]Data Models
Results from crawling operations are structured using Pydantic models defined in models.py.
Sources: crawl4ai/models.py
| Model | Purpose |
|---|---|
| CrawlResult | Container for extracted content and metadata |
| CrawlerRunConfig | Configuration parameters for crawl sessions |
Docker Deployment Architecture
The project includes Docker deployment specifications that containerize the crawling infrastructure.
graph TD
A[Docker Compose] --> B[Crawl4AI Container]
A --> C[Redis Cache]
A --> D[Chrome Browser]
B --> C
B --> DSources: deploy/docker/ARCHITECTURE.md
Technology Stack
| Layer | Technology |
|---|---|
| Runtime | Python 3.10+ |
| Browser Engine | Chrome/Chromium via Playwright |
| Async Framework | asyncio |
| Data Validation | Pydantic |
| Containerization | Docker |
Configuration
The system supports extensive configuration options through CrawlerRunConfig, including:
- JavaScript execution toggles
- Memory management settings
- Request throttling parameters
- Content extraction strategies
Dependency Management
The project maintains a Software Bill of Materials (SBOM) for tracking dependencies and ensuring reproducible builds.
Sources: sbom/README.md
To regenerate the SBOM:
./scripts/gen-sbom.shSources: crawl4ai/async_webcrawler.py
Browser Management
Related topics: Anti-Bot Detection and Proxy Management
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Anti-Bot Detection and Proxy Management
Browser Management
Overview
Browser Management in crawl4ai provides a comprehensive abstraction layer for controlling and orchestrating browser instances used during web crawling and scraping operations. The system abstracts the complexity of browser automation, allowing users to focus on data extraction rather than browser lifecycle management.
Architecture Overview
The browser management system follows a modular architecture with distinct components that handle specific responsibilities:
graph TD
A[BrowserManager] --> B[BrowserAdapter]
A --> C[BrowserProfiler]
B --> D[Playwright/Chromium]
C --> E[JS Snippets]
F[User Request] --> A
A --> G[Crawled Result]Core Components
BrowserManager
The central orchestrator responsible for:
- Browser instance lifecycle (creation, configuration, teardown)
- Session management and isolation
- Resource allocation and cleanup
- Coordination between adapters and profilers
Key Responsibilities:
| Responsibility | Description |
|---|---|
| Instance Creation | Creates and initializes browser contexts |
| Configuration | Applies user-defined browser settings |
| Lifecycle Control | Manages startup and shutdown sequences |
| Pool Management | Handles browser pool for concurrent operations |
Sources: crawl4ai/browser_manager.py
BrowserAdapter
The adapter pattern implementation that provides a consistent interface for interacting with different browser engines (Playwright, Chromium, Firefox, WebKit).
Adapter Features:
| Feature | Description |
|---|---|
| Engine Abstraction | Unified API across browser backends |
| Command Translation | Converts high-level commands to browser-specific instructions |
| Response Normalization | Standardizes browser responses |
Sources: crawl4ai/browser_adapter.py
BrowserProfiler
Handles JavaScript injection and performance profiling during browser operations.
Profiler Capabilities:
| Capability | Purpose |
|---|---|
| JS Injection | Execute custom JavaScript in page context |
| Performance Tracking | Monitor page load and execution metrics |
| Resource Profiling | Track network requests and responses |
Sources: crawl4ai/browser_profiler.py
JavaScript Integration
The js_snippet module provides pre-built JavaScript utilities for browser automation tasks:
graph LR
A[Browser Context] --> B[js_snippet Module]
B --> C[DOM Manipulation]
B --> D[Data Extraction]
B --> E[Event Handling]Common JS Snippet Categories:
- DOM traversal and manipulation
- Content extraction
- Scroll management
- Wait conditions
- Network request interception
Sources: crawl4ai/js_snippet
Configuration Options
Browser Launch Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| headless | bool | true | Run browser in headless mode |
| args | list | [] | Additional browser arguments |
| timeout | int | 30000 | Navigation timeout in milliseconds |
| viewport | dict | {"width": 1920, "height": 1080} | Browser viewport dimensions |
| user_agent | str | None | Custom user agent string |
| proxy | dict | None | Proxy configuration |
Context Options
| Option | Type | Description |
|---|---|---|
| java_script_enabled | bool | Enable/disable JavaScript |
| ignore_https_errors | bool | Ignore SSL certificate errors |
| java_script_enabled | bool | Browser context JavaScript state |
Sources: docs/codebase/browser.md
Browser Lifecycle
stateDiagram-v2
[*] --> Initializing: Create BrowserManager
Initializing --> Launching: Launch Browser
Launching --> Ready: Browser Context Created
Ready --> Navigating: Load URL
Navigating --> Ready: Page Loaded
Ready --> Executing: Run JS/Commands
Executing --> Ready: Commands Complete
Ready --> Closing: Shutdown Request
Closing --> [*]: Resources FreedUsage Patterns
Basic Browser Usage
from crawl4ai import BrowserManager
# Initialize browser manager
browser_mgr = BrowserManager(
headless=True,
viewport={"width": 1920, "height": 1080}
)
# Create browser context
context = browser_mgr.new_context()
# Use context for crawling
result = await context.goto("https://example.com")
Advanced Configuration
browser_mgr = BrowserManager(
headless=False,
args=[
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage"
],
timeout=60000,
user_agent="Custom User Agent"
)
Session Management
The system supports multiple concurrent sessions through isolated browser contexts:
graph TD
A[BrowserManager] --> B1[Session 1 Context]
A --> B2[Session 2 Context]
A --> B3[Session N Context]
B1 --> C1[Page 1]
B2 --> C2[Page 2]
B3 --> C3[Page N]Error Handling
The browser management system implements comprehensive error handling:
| Error Type | Handling Strategy |
|---|---|
| Navigation Timeout | Retry with exponential backoff |
| Browser Crash | Automatic restart and context recreation |
| Resource Exhaustion | Automatic cleanup of stale contexts |
| Network Errors | Graceful degradation with cached content |
Performance Considerations
Optimization Strategies
- Context Reuse: Reuse browser contexts for multiple pages when possible
- Lazy Loading: Only load resources when explicitly requested
- Resource Limits: Configure memory and CPU limits per context
- Connection Pooling: Maintain warm browser instances for rapid access
Memory Management
| Strategy | Description |
|---|---|
| Context Isolation | Each session runs in isolated context |
| Automatic Cleanup | Temporary files and caches cleared automatically |
| Resource Limits | Configurable memory caps per browser instance |
Related Documentation
Sources: crawl4ai/browser_manager.py
Async Web Crawler
Related topics: Markdown Generation, Extraction Strategies
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Markdown Generation, Extraction Strategies
Async Web Crawler
Overview
The Async Web Crawler is the core component of crawl4ai, providing an asynchronous, high-performance web crawling engine built on Python's asyncio framework. It enables concurrent crawling of multiple URLs with built-in caching, configurable extraction strategies, and comprehensive result handling.
The primary purpose of this module is to fetch web pages, extract meaningful content, and return structured results that include HTML, markdown, media assets, metadata, and optional AI-generated summaries. The async design allows for efficient I/O-bound operations, making it suitable for large-scale web scraping projects.
Sources: crawl4ai/async_webcrawler.py:1-50
Architecture
System Components
The async web crawler system consists of several interconnected components that work together to provide a seamless crawling experience.
graph TD
A[AsyncWebCrawler] --> B[Browser Manager]
A --> C[Cache Layer]
A --> D[Extraction Strategy]
A --> E[Result Processor]
B --> F[Playwright/Chromium]
C --> G[File System Cache]
C --> H[Memory Cache]
D --> I[LLM-based Extraction]
D --> J[CSS/XPath Extraction]
E --> K[CrawlResult]
E --> L[Raw HTML]
E --> M[Markdown]Sources: crawl4ai/async_webcrawler.py:1-30
Core Classes
| Class | File | Purpose |
|---|---|---|
AsyncWebCrawler | async_webcrawler.py | Main crawler entry point with arun() and arun_many() methods |
BrowserConfig | async_configs.py | Configuration for headless browser behavior |
CrawlCache | cache_context.py | Manages caching strategies for crawled content |
CrawlResult | types.py | Data model for returning crawl results |
Sources: crawl4ai/async_configs.py:1-30
AsyncWebCrawler Class
Initialization
The AsyncWebCrawler class can be initialized with optional configuration parameters:
class AsyncWebCrawler:
def __init__(
self,
config: BrowserConfig | None = None,
verbose: bool = False
) -> None:
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
config | BrowserConfig | None | Browser configuration object |
verbose | bool | False | Enable verbose logging output |
Sources: crawl4ai/async_webcrawler.py:50-70
Core Methods
#### arun()
Single URL crawling with comprehensive result extraction:
async def arun(
self,
url: str,
config: BrowserConfig | None = None,
**kwargs
) -> CrawlResult
Parameters:
| Parameter | Type | Description |
|---|---|---|
url | str | Target URL to crawl |
config | BrowserConfig | Override browser configuration |
Sources: crawl4ai/async_webcrawler.py:100-150
#### arun_many()
Batch crawling for multiple URLs concurrently:
async def arun_many(
self,
urls: list[str],
config: BrowserConfig | None = None,
**kwargs
) -> list[CrawlResult]
Parameters:
| Parameter | Type | Description |
|---|---|---|
urls | list[str] | List of target URLs |
config | BrowserConfig | Shared configuration for all URLs |
Sources: crawl4ai/async_webcrawler.py:200-250
Context Manager Support
The AsyncWebCrawler implements the async context manager protocol for proper resource cleanup:
async def __aenter__(self) -> "AsyncWebCrawler":
await self.start()
return self
async def __aexit__(
self,
exc_type, exc_val, exc_tb
) -> None:
await self.close()
Sources: crawl4ai/async_webcrawler.py:80-100
CrawlResult Data Model
The CrawlResult class encapsulates all information retrieved from a crawled page:
classDiagram
class CrawlResult {
+str url
+str html
+str markdown
+list~MediaItem~ media
+list~Link~ links
+dict metadata
+str|None success
+str|None error
+dict~str, Any~ extracted_content
+int status_code
+datetime created_at
}Sources: crawl4ai/types.py:1-80
Properties
| Property | Type | Description | |
|---|---|---|---|
url | str | Original request URL | |
html | str | Raw HTML content | |
markdown | str | Converted markdown content | |
media | list[MediaItem] | Extracted images, videos, audio | |
links | list[Link] | Internal and external links | |
metadata | dict | Page metadata (title, description) | |
success | `str \ | None` | Success status message |
error | `str \ | None` | Error message if failed |
status_code | int | HTTP response status code | |
created_at | datetime | Timestamp of crawl operation |
Sources: crawl4ai/types.py:50-100
Configuration
BrowserConfig
The BrowserConfig class provides fine-grained control over browser behavior:
@dataclass
class BrowserConfig:
headless: bool = True
browser_type: str = "chromium"
viewport_size: dict = {"width": 1920, "height": 1080}
user_agent: str | None = None
verbose: bool = False
Sources: crawl4ai/async_configs.py:30-80
Configuration Options
| Option | Type | Default | Description | |
|---|---|---|---|---|
headless | bool | True | Run browser in headless mode | |
browser_type | str | "chromium" | Browser engine (chromium, firefox, webkit) | |
viewport_size.width | int | 1920 | Viewport width in pixels | |
viewport_size.height | int | 1080 | Viewport height in pixels | |
user_agent | `str \ | None` | None | Custom user agent string |
verbose | bool | False | Enable debug output |
Sources: crawl4ai/async_configs.py:40-90
Advanced Configuration
Additional crawling parameters can be passed via kwargs:
| Parameter | Type | Description |
|---|---|---|
word_count_threshold | int | Minimum word count for content extraction |
extraction_strategy | ExtractionStrategy | Strategy for content extraction |
cache_mode | CacheMode | Caching behavior (enabled/disabled/bypass) |
js_enabled | bool | Enable JavaScript execution |
wait_for | str | CSS selector to wait for before returning |
delay_before_return_html | float | Delay in seconds before capturing HTML |
Sources: docs/md_v2/api/async-webcrawler.md:1-60
Caching System
Cache Modes
The crawl4ai framework implements a multi-layered caching strategy:
graph LR
A[Request] --> B{Memory Cache}
B -->|Hit| C[Return Cached]
B -->|Miss| D{File System Cache}
D -->|Hit| E[Return Cached]
D -->|Miss| F[Fetch Remote]
F --> G[Store in Both Layers]Sources: crawl4ai/cache_context.py:1-50
CacheMode Enum
| Mode | Description |
|---|---|
ENABLED | Use cache if available, otherwise fetch and cache |
DISABLED | Always fetch fresh content, bypass cache |
BYPASS | Fetch and update cache but don't read from it |
READ_ONLY | Only read from cache, never fetch |
Sources: crawl4ai/cache_context.py:30-60
Cache Context Manager
async with cache_context(cache_mode=CacheMode.ENABLED):
result = await crawler.arun(url="https://example.com")
Sources: crawl4ai/cache_context.py:60-90
Usage Examples
Basic Single URL Crawl
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com",
config=BrowserConfig(headless=True)
)
print(f"Success: {result.success}")
print(f"Markdown content: {result.markdown[:500]}")
asyncio.run(main())
Sources: crawl4ai/async_webcrawler.py:150-200
Batch Crawling
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls=urls)
for result in results:
print(f"URL: {result.url}, Status: {result.status_code}")
asyncio.run(main())
Sources: docs/md_v2/api/async-webcrawler.md:60-100
With Custom Extraction Strategy
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
async def main():
config = BrowserConfig(
headless=True,
verbose=True
)
strategy = LLMExtractionStrategy(
provider="openai/gpt-4",
api_token="your-token"
)
async with AsyncWebCrawler(config=config) as crawler:
result = await crawler.arun(
url="https://news-site.com/article",
extraction_strategy=strategy
)
print(result.extracted_content)
asyncio.run(main())
Sources: crawl4ai/async_configs.py:90-130
Error Handling
The crawler returns comprehensive error information through the CrawlResult object:
result = await crawler.arun(url="https://invalid-url.xyz")
if not result.success:
print(f"Error: {result.error}")
print(f"Status Code: {result.status_code}")
| Error Scenario | success Value | error Field |
|---|---|---|
| Network timeout | False | Connection timeout message |
| Invalid URL | False | URL validation error |
| JavaScript error | False | Browser console error |
| HTTP 404/500 | False | HTTP status message |
| Success | True | None |
Sources: crawl4ai/types.py:80-120
Performance Considerations
Async Benefits
The async architecture provides several performance advantages:
- Concurrent Requests: Multiple URLs can be crawled simultaneously
- Non-blocking I/O: Browser operations don't block other tasks
- Resource Efficiency: Single event loop manages all crawling tasks
Best Practices
| Practice | Benefit |
|---|---|
Use arun_many() for batch operations | Reduces connection overhead |
| Enable caching for repeated URLs | Avoids redundant network requests |
Set appropriate word_count_threshold | Reduces unnecessary processing |
Use headless=True in production | Reduces memory usage |
Sources: crawl4ai/async_webcrawler.py:250-300
Related Components
| Component | File | Relationship |
|---|---|---|
ExtractionStrategy | extraction_strategy.py | Defines how content is extracted |
MediaItem | types.py | Represents extracted media |
Link | types.py | Represents extracted links |
CacheBackend | cache_backend.py | Abstract cache implementation |
Sources: crawl4ai/types.py:1-30
Summary
The Async Web Crawler is the foundational building block of crawl4ai, providing:
- Asynchronous operation for high-performance concurrent crawling
- Flexible configuration via
BrowserConfigdataclass - Comprehensive result types through
CrawlResultmodel - Multi-layered caching with configurable modes
- Extensible extraction via pluggable strategies
- Production-ready error handling with detailed error reporting
This architecture enables developers to build scalable web scraping solutions while maintaining clean, readable code patterns familiar to Python async developers.
Sources: crawl4ai/async_webcrawler.py:1-50
Markdown Generation
Related topics: Extraction Strategies, Async Web Crawler
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Extraction Strategies, Async Web Crawler
Markdown Generation
Markdown Generation is a core feature in crawl4ai that transforms raw HTML content into clean, readable Markdown format. This system provides flexible strategies for content extraction, filtering, and conversion with extensive customization options.
Overview
The Markdown Generation system converts web page HTML into structured Markdown text suitable for LLM consumption, RAG systems, or documentation purposes. It offers multiple generation strategies, content filtering capabilities, and fine-grained control over extraction behavior.
Sources: docs/md_v2/core/markdown-generation.md:1-15
Architecture
graph TD
A[HTML Input] --> B[Content Filter Strategy]
B --> C[HTML Processing]
C --> D[Markdown Generation Strategy]
D --> E[Markdown Output]
F[Configuration] --> B
F --> D
G[BestProvider] --> B
G --> DCore Components
MarkdownGenerationStrategy
The primary abstraction for generating Markdown from HTML content.
| Parameter | Type | Default | Description |
|---|---|---|---|
provider | str | "best" | Content extraction provider |
configs | dict | {} | Provider-specific configurations |
strict | bool | False | Raise errors on failure |
override_system_prompt | str | None | Custom system prompt |
override_user_prompt | str | None | Custom user prompt |
Sources: crawl4ai/markdown_generation_strategy.py:1-50
ContentFilterStrategy
Abstract base class for filtering and selecting content before Markdown conversion.
| Filter | Description |
|---|---|
PruningContentFilter | Removes low-value content nodes |
BM25ContentFilter | Uses BM25 ranking for content selection |
OrgAnnContentFilter | Organic annotation-based filtering |
Sources: crawl4ai/content_filter_strategy.py:1-100
Generation Providers
Available Providers
| Provider | Description |
|---|---|
best | Automatically selects optimal provider |
playwright | Uses Playwright for JavaScript rendering |
curl | Lightweight extraction via curl |
trafilatura | Trafilatura library extraction |
lxml | LXML-based HTML parsing |
readability | Mozilla Readability algorithm |
Sources: crawl4ai/markdown_generation_strategy.py:50-150
Best Provider Selection
The BestProvider class intelligently selects the most appropriate extraction method based on content characteristics.
class BestProvider:
def get_strategy(self, html: str) -> MarkdownGenerationStrategy:
# Analyzes HTML and selects optimal provider
pass
Sources: crawl4ai/markdown_generation_strategy.py:150-200
Workflow
graph LR
A[Fetch HTML] --> B{Content Filter Enabled?}
B -->|Yes| C[Apply Filter Strategy]
B -->|No| D[Skip Filtering]
C --> E[Generate Markdown]
D --> E
E --> F{Post-Processing?}
F -->|Yes| G[Apply Custom Rules]
F -->|No| H[Return Result]
G --> HConfiguration Options
Generator Config
{
"word_threshold": 50, # Minimum words per chunk
"language": "en", # Content language
"skip_internal_links": True, # Ignore internal links
"content_type": "markdown" # Output format
}
BM25 Filter Config
{
"query": "relevant keywords",
"top_n": 5, # Number of chunks
"use_stem": True # Apply stemming
}
Sources: crawl4ai/content_filter_strategy.py:100-180
HTML to Markdown Conversion
html2text Module
The html2text submodule handles low-level HTML to Markdown conversion.
| Method | Purpose |
|---|---|
handle_anchor | Convert <a> tags to text |
handle_image | Convert <img> to !alt |
handle_heading | Convert <h1>-<h6> to # - ###### |
handle_table | Convert <table> to Markdown tables |
handle_code | Preserve <code> and <pre> formatting |
Sources: crawl4ai/html2text:core.py:1-100
Conversion Features
- Link Preservation: External links converted to Markdown format with titles
- Image Extraction: Images extracted with alt text and sources
- Table Conversion: HTML tables converted to GFM tables
- Code Block Handling: Syntax-aware code block extraction
- List Recognition: Ordered and unordered lists properly formatted
Usage Examples
Basic Usage
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
markdown_generator={
"provider": "best",
"configs": {"word_threshold": 100}
}
)
print(result.markdown)
With Content Filter
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter
filter_strategy = BM25ContentFilter(
query="getting started installation",
top_n=10
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://docs.example.com",
markdown_generator={
"provider": "playwright",
"filter_strategy": filter_strategy
}
)
Advanced Configuration
Custom Prompts
Override system and user prompts for specialized extraction:
markdown_generator = {
"override_system_prompt": "Extract only technical documentation...",
"override_user_prompt": "Focus on API endpoints and code examples..."
}
Strict Mode
Enable strict mode to raise exceptions on extraction failures:
markdown_generator = {
"strict": True,
"provider": "playwright"
}
Performance Considerations
| Aspect | Recommendation |
|---|---|
| Large Pages | Use BM25ContentFilter to reduce content |
| JavaScript-heavy Sites | Use playwright provider |
| Simple Pages | Use lxml or trafilatura for speed |
| Batch Processing | Set appropriate word_threshold |
Error Handling
The system provides graceful degradation:
- Provider Fallback: Falls back to alternative provider on failure
- Strict Mode: Raises exceptions when enabled
- Partial Results: Returns available content on partial failures
Sources: crawl4ai/extraction_strategy.py:50-120
Related Components
- Chunking: Content can be further processed with text chunking strategies
- Extraction: Works alongside extraction strategies for structured data
- Cache: Generated Markdown can be cached for repeated access
Extraction Strategies
Related topics: Markdown Generation, Deep Crawling Strategies
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Markdown Generation, Deep Crawling Strategies
Extraction Strategies
Extraction Strategies in crawl4ai define how content is parsed, structured, and extracted from crawled web pages. They form the core abstraction layer that determines whether unstructured HTML becomes meaningful, machine-readable data.
Overview
Extraction Strategies handle the transformation pipeline from raw HTML to structured output. The system supports two primary categories:
| Category | Use Case | Performance |
|---|---|---|
| LLM-based | Complex, semantic extraction | Slower, higher accuracy |
| No-LLM | Fast, pattern-based extraction | Faster, rule-dependent |
Sources: crawl4ai/extraction_strategy.py:1-50
Architecture
graph TD
A[HTML Content] --> B[Content Scraping Strategy]
B --> C[Chunking Strategy]
C --> D{Extraction Strategy}
D -->|LLM-based| E[LLM Strategy]
D -->|No-LLM| F[No-LLM Strategy]
E --> G[Structured JSON/Markdown]
F --> G
G --> H[Table Extraction Optional]
H --> I[Final Output]The extraction pipeline flows through scraping β chunking β extraction, with table extraction as an optional final step.
Sources: crawl4ai/extraction_strategy.py:50-100
LLM-Based Strategies
LLM strategies leverage large language models for semantic understanding and intelligent content extraction.
Supported Providers
| Provider | Model Support | Configuration |
|---|---|---|
| OpenAI | GPT-4, GPT-3.5 | OPENAI_API_KEY |
| Anthropic | Claude 3, Claude 2 | ANTHROPIC_API_KEY |
| Azure OpenAI | Custom deployments | AZURE_API_KEY, AZURE_API_BASE |
| Ollama | Local models | OLLAMA_BASE_URL |
Configuration Parameters
class LLMExtractionStrategy:
def __init__(
self,
provider: str = "openai",
model: str = "gpt-4",
api_token: Optional[str] = None,
system_prompt: Optional[str] = None,
user_prompt: Optional[str] = None,
extraction_type: str = "block",
input_format: str = "html",
instruction: Optional[str] = None
)
Sources: docs/md_v2/extraction/llm-strategies.md
Extraction Types
| Type | Description | Best For |
|---|---|---|
block | Block-level extraction | Paragraphs, sections |
schema | Schema-based extraction | Structured data, forms |
custom | Custom instructions | Specific extraction needs |
Sources: crawl4ai/extraction_strategy.py:100-150
No-LLM Strategies
No-LLM strategies provide fast, deterministic extraction without external API dependencies.
Available Strategies
| Strategy | Purpose |
|---|---|
NoExtractionStrategy | Pass-through, no extraction |
JsonCssExtractionStrategy | CSS selector-based JSON extraction |
RegexExtractionStrategy | Regex pattern matching |
XPathExtractionStrategy | XPath-based extraction |
Sources: docs/md_v2/extraction/no-llm-strategies.md
JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
strategy = JsonCssExtractionStrategy(
schema={
"name": "ProductList",
"baseSelector": "div.product",
"fields": [
{"name": "title", "selector": "h2.title", "type": "text"},
{"name": "price", "selector": "span.price", "type": "text"},
{"name": "image", "selector": "img", "attribute": "src"}
]
}
)
Sources: crawl4ai/extraction_strategy.py:150-200
Chunking Strategies
Chunking strategies split content into manageable pieces before extraction.
Default Chunking Behavior
graph LR
A[Large Content] --> B[Character Split]
B --> C[Overlap Application]
C --> D[Token Count Check]
D -->|Under limit| E[Chunk Ready]
D -->|Over limit| F[Recursive Split]
F --> EConfiguration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_token_size | int | 1000 | Target tokens per chunk |
overlap | int | 100 | Overlapping tokens between chunks |
max_chunk_size | int | 3000 | Hard maximum chunk size |
splitting_regex | str | \n\n+ | Regex for splitting points |
Sources: crawl4ai/chunking_strategy.py:1-80
Content Scraping Strategy
The content scraping strategy determines initial content extraction from HTML.
graph TD
A[Raw HTML] --> B{Scraping Strategy}
B -->|BeautifulSoup| C[Parse DOM]
B -->|Playwright| D[Dynamic Render]
B -->|Raw| E[Minimal Processing]
C --> F[Content Cleaned]
D --> F
E --> FStrategy Selection
| Strategy | JavaScript | Speed | Use Case |
|---|---|---|---|
BeautifulSoup | No | Fast | Static pages |
Playwright | Yes | Medium | SPAs, dynamic content |
RawContent | No | Fastest | Pre-processed HTML |
Sources: crawl4ai/content_scraping_strategy.py:1-60
Table Extraction
Table extraction handles tabular data structures within web pages.
class TableExtractionStrategy:
def __init__(
self,
table_styles: Optional[List[str]] = None,
ignore_tables: Optional[List[str]] = None,
merge_multiple_headers: bool = False
)
Extraction Configuration
| Parameter | Type | Description |
|---|---|---|
table_styles | List[str] | CSS classes to include as tables |
ignore_tables | List[str] | CSS classes to exclude |
merge_multiple_headers | bool | Merge multi-row headers |
extract_header | bool | Include header row (default: True) |
Sources: crawl4ai/table_extraction.py:1-100
Complete Pipeline Example
from crawl4ai import (
AsyncWebCrawler,
LLMExtractionStrategy,
JsonCssExtractionStrategy,
RegexExtractionStrategy,
TableExtractionStrategy
)
async with AsyncWebCrawler() as crawler:
# LLM-based extraction
llm_result = await crawler.arun(
url="https://example.com/article",
extraction_strategy=LLMExtractionStrategy(
provider="openai",
model="gpt-4",
instruction="Extract article title, author, and key points"
)
)
# CSS-based extraction
css_result = await crawler.arun(
url="https://example.com/products",
extraction_strategy=JsonCssExtractionStrategy(schema=product_schema)
)
Sources: crawl4ai/extraction_strategy.py:200-250
Strategy Selection Guide
graph TD
A[Start] --> B{Need semantic understanding?}
B -->|Yes| C{External API acceptable?}
B -->|No| D[No-LLM Strategy]
C -->|Yes| E{Local deployment needed?}
C -->|No| F[Ollama/Local LLM]
E -->|Yes| F
E -->|No| G[OpenAI/Anthropic]
D --> H{Data has tabular structure?}
H -->|Yes| I[Add TableExtractionStrategy]
H -->|No| J[Complete]
G --> J
F --> J
I --> JDecision Matrix
| Requirement | Recommended Strategy |
|---|---|
| Simple CSS extraction | JsonCssExtractionStrategy |
| Complex semantic parsing | LLMExtractionStrategy |
| High-volume, low-latency | No-LLM strategies |
| Schema-agnostic | LLM-based strategies |
| Tabular data focus | TableExtractionStrategy |
Sources: docs/md_v2/extraction/llm-strategies.md, docs/md_v2/extraction/no-llm-strategies.md
Environment Variables
| Variable | Required For | Description |
|---|---|---|
OPENAI_API_KEY | OpenAI LLM | API key for GPT models |
ANTHROPIC_API_KEY | Anthropic LLM | API key for Claude models |
AZURE_API_KEY | Azure OpenAI | Azure OpenAI API key |
OLLAMA_BASE_URL | Local LLM | Base URL for Ollama server |
Sources: crawl4ai/extraction_strategy.py:1-50
Deep Crawling Strategies
Related topics: Extraction Strategies, Async Web Crawler
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Extraction Strategies, Async Web Crawler
Deep Crawling Strategies
Overview
Deep Crawling Strategies in crawl4ai provide systematic approaches to traverse and extract content from websites beyond a single page. These strategies enable controlled, scalable web crawling by managing URL discovery, prioritization, filtering, and scoring mechanisms. The deep crawling module supports multiple traversal algorithms (BFS, DFS, BFF) with extensible filtering and scoring systems.
Sources: base_strategy.py:1-50
Architecture
graph TD
A[Seed URLs] --> B[DeepCrawlingStrategy]
B --> C[URL Filters]
B --> D[URL Scorers]
B --> E[Traversal Algorithm]
C --> F[Valid URLs]
D --> G[Prioritized URLs]
E --> H[Crawl Queue]
G --> H
H --> I[Crawl4AI Extractor]
I --> J[Extracted Content]
J --> K[Links Extracted]
K --> BThe architecture follows a producer-consumer pattern where the strategy continuously discovers URLs from crawled pages and feeds them back into the crawl queue based on prioritization rules.
Sources: base_strategy.py:50-100
Core Components
Base Strategy
All crawling strategies inherit from DeepCrawlingStrategy, which provides the foundational interface and shared functionality:
| Property/Method | Type | Description |
|---|---|---|
url_scorer | URLScorer | Scores URLs for prioritization |
url_filter | URLFilter | Filters URLs for validity |
keywords | Set[str] | Keywords for relevance matching |
keywords_or | Set[str] | Alternative keyword matching |
max_depth | int | Maximum crawl depth |
included_domains | Set[str] | Allowed domains |
excluded_domains | Set[str] | Blocked domains |
crawl_enabled | bool | Enable/disable crawling |
check_keywords | Callable | Custom keyword validation |
Sources: base_strategy.py:100-150
Traversal Algorithms
BFS (Breadth-First Search) Strategy
The BFS strategy explores pages level by level, ensuring comprehensive coverage before going deeper:
graph LR
A[Level 0: Seed] --> B[Level 1: depth=1]
B --> C[Level 2: depth=2]
C --> D[Level 3: depth=3]
style A fill:#90EE90
style B fill:#87CEEB
style C fill:#DDA0DD
style D fill:#F0E68CCharacteristics:
- Systematic exploration of shallow depths first
- Ideal for site maps and directory-style sites
- Higher memory usage due to large frontier sets
- Better for finding all accessible pages at shallower depths
Sources: bfs_strategy.py:1-80
Configuration Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
max_depth | int | 3 | Maximum crawl depth |
max_pages | int | 50 | Maximum pages to crawl |
priority | int | 0 | Base priority score |
include_external | bool | False | Allow external domain crawling |
DFS (Depth-First Search) Strategy
The DFS strategy explores as deep as possible before backtracking:
graph TD
A[Start] --> B[Depth 1]
B --> C[Depth 2]
C --> D[Depth 3]
D --> E[Backtrack]
E --> F[Next Branch]
F --> G[Continue Deep]
style A fill:#90EE90
style D fill:#FF6B6BCharacteristics:
- Deep exploration of specific paths first
- Lower memory footprint
- Suitable for following navigation chains
- Risk of getting stuck in deep site sections
Sources: dfs_strategy.py:1-80
Configuration Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
max_depth | int | 10 | Maximum crawl depth |
max_pages | int | 100 | Maximum pages to crawl |
priority | int | 0 | Base priority score |
BFF (Best-First with Filters) Strategy
The BFF strategy combines filtering with score-based prioritization, crawling the most relevant pages first:
graph TD
A[URL Discovered] --> B{URL Filter}
B -->|Pass| C{Score URL}
B -->|Fail| X[Skip]
C --> D[Priority Queue]
D --> E[Crawl Next Best]
E --> F[Extract Links]
F --> ACharacteristics:
- Relevance-based crawling using keyword matching
- Configurable scoring functions
- Filters out irrelevant content early
- Most efficient for targeted data extraction
Sources: bff_strategy.py:1-100
Filtering System
The filtering system determines which URLs are eligible for crawling:
FilterChain
Multiple filters can be chained together for comprehensive URL validation:
from crawl4ai.deep_crawling.filters import FilterChain, SameDomainFilter, ExtensionFilter
filter_chain = FilterChain([
SameDomainFilter(allowed_domains=["example.com"]),
ExtensionFilter(excluded_extensions=[".pdf", ".zip"]),
])
Available Filters
| Filter | Purpose | Key Parameters |
|---|---|---|
SameDomainFilter | Restrict to same domain | allowed_domains, strict |
ExtensionFilter | Block by file extension | excluded_extensions |
robots.txt Filter | Respect robots directives | user_agent |
RegexFilter | Custom pattern matching | patterns, exclude |
Sources: filters.py:1-120
Scoring System
URL scoring determines crawl priority within the queue:
ScorerChain
Scorers can be combined in a chain for multi-factor evaluation:
from crawl4ai.deep_crawling.scorers import ScorerChain, KeywordRelevanceScorer, DepthScorer
scorer = ScorerChain([
KeywordRelevanceScorer(keywords=["api", "docs"]),
DepthScorer(max_depth=5, decay=0.5),
])
Available Scorers
| Scorer | Function | Parameters |
|---|---|---|
KeywordRelevanceScorer | Match keywords in URL/text | keywords, weight |
DepthScorer | Penalize deep pages | max_depth, decay |
FreshnessScorer | Prefer recent content | date_field, decay |
CustomScorer | User-defined scoring | scoring_fn |
Sources: scorers.py:1-150
Usage Examples
Basic BFS Crawling
from crawl4ai import AsyncWebCrawler
from crawl4ai.deep_crawling import BFSDistanceStrategy
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
strategy=BFSDistanceStrategy(
max_depth=3,
max_pages=50
)
)
Targeted Crawling with BFF
from crawl4ai.deep_crawling import BFFStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from crawl4ai.deep_crawling.filters import SameDomainFilter
strategy = BFFStrategy(
max_depth=5,
max_pages=100,
keywords={"documentation", "api", "guide"},
url_filter=SameDomainFilter(allowed_domains=["example.com"]),
url_scorer=KeywordRelevanceScorer(keywords={"documentation"}),
)
Sources: docs/md_v2/core/deep-crawling.md:1-100
Workflow States
stateDiagram-v2
[*] --> Initializing: Start crawl
Initializing --> Crawling: Load seed URLs
Crawling --> Processing: Fetch page
Processing --> Filtering: Extract links
Filtering --> Scoring: Validate URLs
Scoring --> Queuing: Rank by priority
Queuing --> Crawling: Next URL
Crawling --> [*]: Queue empty or max reachedStrategy Selection Guide
| Use Case | Recommended Strategy | Reason |
|---|---|---|
| Site mapping | BFS | Comprehensive shallow coverage |
| Documentation sites | BFS | Find all pages systematically |
| Article/navigation chains | DFS | Follow deep links naturally |
| Targeted data extraction | BFF | Prioritize relevant pages |
| API documentation | BFF | Filter by keyword relevance |
| Limited resources | DFS | Lower memory footprint |
Configuration Reference
Common Parameters
@dataclass
class DeepCrawlConfig:
# Scope
included_domains: Set[str] = None
excluded_domains: Set[str] = None
# Limits
max_depth: int = 3
max_pages: int = 100
max_total_pages: int = 1000
# Filtering
allow_external: bool = False
check_keywords: bool = True
# Scoring
scoring: ScorerChain = None
filter: FilterChain = None
Keyword Matching
# AND matching (all keywords must match)
keywords = {"documentation", "api"}
keywords_or = False
# OR matching (any keyword matches)
keywords = {"guide", "tutorial", "docs"}
keywords_or = True
Sources: base_strategy.py:150-200
Advanced Features
Custom Filters
from crawl4ai.deep_crawling.filters import URLFilter
class CustomFilter(URLFilter):
def should_crawl(self, url: str) -> bool:
# Custom logic
return "product" in url and not url.endswith(".jpg")
Custom Scorers
from crawl4ai.deep_crawling.scorers import URLScorer
class PriorityScorer(URLScorer):
def score(self, url: str, context: dict) -> float:
base_score = 1.0
if "important" in url:
base_score *= 2.0
return base_score
Sources: filters.py:150-200, scorers.py:150-200
Best Practices
- Set reasonable limits: Always configure
max_pagesandmax_depthto prevent runaway crawling - Use BFF for targeted extraction: When you know what content you need, BFF reduces noise
- Filter early, score late: Apply filters before scoring to reduce unnecessary processing
- Respect robots.txt: Configure filters to respect site crawling directives
- Monitor memory usage: BFS uses more memory; switch to DFS for resource-constrained environments
- Combine keyword strategies: Use both
keywords(AND) andkeywords_or(OR) for flexible matching
See Also
Sources: base_strategy.py:1-50
Anti-Bot Detection and Proxy Management
Related topics: Browser Management
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Browser Management
Anti-Bot Detection and Proxy Management
Overview
Crawl4ai provides sophisticated anti-bot detection evasion and proxy management capabilities to ensure reliable web crawling operations. These features work together to detect and circumvent bot protection mechanisms while maintaining request anonymity through proxy rotation.
Architecture Overview
graph TD
A[Client Request] --> B[AntiBotDetector]
B --> C{Bot Detection?}
C -->|Yes| D[Apply Evasion Strategy]
C -->|No| E[Direct Request]
D --> F[ProxySelector]
F --> G[Rotating Proxies]
G --> H[Target Website]
H --> I{Response Valid?}
I -->|No| J[Fallback Mechanism]
J --> B
I -->|Yes| K[Return Content]Anti-Bot Detection System
Purpose and Scope
The anti-bot detection module (antibot_detector.py) analyzes responses from target websites to determine if bot protection mechanisms have been triggered. When detection occurs, the system can automatically apply evasion strategies or fall back to alternative methods.
Sources: crawl4ai/antibot_detector.py
Detection Strategies
| Strategy | Description | Use Case |
|---|---|---|
| Header Analysis | Examines HTTP headers for bot detection signals | Standard bot checks |
| Content Analysis | Scans response content for CAPTCHAs or blocking messages | Challenge pages |
| Status Code Monitoring | Tracks HTTP status codes indicating blocks | 403, 429 responses |
| JavaScript Challenge Detection | Identifies JS-based bot challenges | Cloudflare, PerimeterX |
Key Components
The anti-bot detector integrates with the async configuration system to provide seamless fallback handling:
# Pseudocode representation based on async_configs.py integration
class AntiBotConfig:
enabled: bool = True
detection_threshold: float = 0.7
auto_fallback: bool = True
max_retries: int = 3
Sources: crawl4ai/async_configs.py
Proxy Management System
Purpose and Scope
The proxy strategy module (proxy_strategy.py) manages proxy rotation, selection, and health checking to maintain request anonymity and distribute load across multiple IP addresses.
Sources: crawl4ai/proxy_strategy.py
Proxy Rotation Strategies
| Strategy | Description | Best For |
|---|---|---|
| Round Robin | Sequential proxy selection | Even distribution |
| Random | Random proxy selection | Avoiding pattern detection |
| Weighted | Prioritize faster/reliable proxies | Performance optimization |
| Geographic | Match proxy location to target | Region-specific content |
Configuration Parameters
class ProxyConfig:
proxies: List[str] = [] # List of proxy URLs
rotation_strategy: str = "round_robin"
health_check_interval: int = 300 # seconds
timeout: int = 30 # proxy timeout in seconds
retry_on_failure: bool = True
Proxy Health Monitoring
The system continuously monitors proxy health through periodic health checks, removing failed proxies from the active pool and re-evaluating them after a cooldown period.
Integration with Async Configuration
Fallback Mechanisms
When anti-bot detection triggers, the system can automatically switch to fallback modes:
- Proxy Fallback: Rotate to a different proxy server
- Strategy Fallback: Switch to alternative crawling strategies
- User-Agent Fallback: Use different browser fingerprints
Sources: docs/md_v2/advanced/anti-bot-and-fallback.md
Configuration Example
anti_bot:
enabled: true
detection_sensitivity: "medium"
auto_fallback: true
proxy:
enabled: true
strategy: "weighted"
proxies:
- "http://proxy1.example.com:8080"
- "http://proxy2.example.com:8080"
health_check:
enabled: true
interval: 300
Security Considerations
Proxy Security
When configuring proxies, consider the following security aspects:
- Proxy Protocol: Use HTTPS proxies to encrypt traffic
- Authentication: Implement proxy authentication where supported
- Provider Reputation: Use trusted proxy providers
- IP Rotation: Avoid predictable IP patterns
Sources: docs/md_v2/advanced/proxy-security.md
Best Practices
| Practice | Description |
|---|---|
| Rate Limiting | Respect target site limits to avoid IP bans |
| Request Delays | Implement delays between requests |
| Header Randomization | Vary User-Agent and other headers |
| Cookie Management | Handle cookies appropriately per session |
| SSL Verification | Validate SSL certificates for security |
Workflow Diagram
sequenceDiagram
participant Client
participant AntiBot as AntiBot Detector
participant ProxyMgr as Proxy Manager
participant Target as Target Website
Client->>AntiBot: Send Request
AntiBot->>ProxyMgr: Request Proxy
ProxyMgr->>AntiBot: Return Proxy
AntiBot->>Target: Forward Request via Proxy
alt Bot Detected
Target-->>AntiBot: Bot Challenge Response
AntiBot->>ProxyMgr: Request Different Proxy
ProxyMgr-->>AntiBot: New Proxy
AntiBot->>Target: Retry with New Proxy
else Success
Target-->>Client: Return Content
endError Handling
Common Error Scenarios
| Error | Cause | Resolution |
|---|---|---|
403 Forbidden | IP blocked | Rotate proxy |
429 Too Many Requests | Rate limited | Backoff and retry |
CAPTCHA Required | Bot detected | Switch strategy |
Proxy Timeout | Proxy unavailable | Health check and replace |
Retry Logic
The system implements exponential backoff for retries:
retry_config = {
"max_attempts": 3,
"base_delay": 1.0, # seconds
"max_delay": 60.0, # seconds
"exponential_base": 2
}
Summary
Crawl4ai's anti-bot detection and proxy management system provides a robust framework for evading bot detection while maintaining reliable web crawling operations. The integration between AntiBotDetector and ProxyStrategy enables automatic fallback mechanisms that significantly improve crawling success rates against protected websites.
Sources: crawl4ai/antibot_detector.py
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
First-time setup may fail or require extra isolation and rollback planning.
Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Doramagic Pitfall Log
Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.
1. Installation risk: [Bug]: arun() and arun_many() type hinting needs fixing
- Severity: high
- Finding: Installation risk is backed by a source signal: [Bug]: arun() and arun_many() type hinting needs fixing. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1898
2. Configuration risk: [Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked as failed), no error messages or failure reasonβ¦
- Severity: high
- Finding: Configuration risk is backed by a source signal: [Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked as failed), no error messages or failure reasonβ¦. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1949
3. Configuration risk: [Bug]: MCP scrape tools lack wait_until / SPA support that REST API and CLI provide
- Severity: high
- Finding: Configuration risk is backed by a source signal: [Bug]: MCP scrape tools lack wait_until / SPA support that REST API and CLI provide. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1963
4. Configuration risk: [Bug]: `remove_empty_elements_fast()` drops trailing text when removing empty elements with non-empty .tail
- Severity: high
- Finding: Configuration risk is backed by a source signal: [Bug]:
remove_empty_elements_fast()drops trailing text when removing empty elements with non-empty .tail. Treat it as a review item until the current version is checked. - User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1938
5. Security or permission risk: [Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-3x token overhead for CJK content
- Severity: high
- Finding: Security or permission risk is backed by a source signal: [Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-3x token overhead for CJK content. Treat it as a review item until the current version is checked.
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1962
6. Installation risk: [Bug] AsyncLogger writes to stdout, breaking MCP stdio transport
- Severity: medium
- Finding: Installation risk is backed by a source signal: [Bug] AsyncLogger writes to stdout, breaking MCP stdio transport. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1968
7. Installation risk: [Bug]: The install with pip on just about any system rarely works. It requires an env or it only partial installs
- Severity: medium
- Finding: Installation risk is backed by a source signal: [Bug]: The install with pip on just about any system rarely works. It requires an env or it only partial installs. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1950
8. Installation risk: [Bug]: enable_stealth=True is a silent no-op β StealthAdapter imports symbols that don't exist in playwright-stealth 2.x
- Severity: medium
- Finding: Installation risk is backed by a source signal: [Bug]: enable_stealth=True is a silent no-op β StealthAdapter imports symbols that don't exist in playwright-stealth 2.x. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1959
9. Installation risk: v0.7.1:Update
- Severity: medium
- Finding: Installation risk is backed by a source signal: v0.7.1:Update. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/releases/tag/v0.7.1
10. Installation risk: v0.7.2: CI/CD & Dependency Optimization Update
- Severity: medium
- Finding: Installation risk is backed by a source signal: v0.7.2: CI/CD & Dependency Optimization Update. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/releases/tag/v0.7.2
11. Configuration risk: [Bug]: Markdown export loses heading hierarchy and table structure
- Severity: medium
- Finding: Configuration risk is backed by a source signal: [Bug]: Markdown export loses heading hierarchy and table structure. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/unclecode/crawl4ai/issues/1964
12. Capability assumption: README/documentation is current enough for a first validation pass.
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: capability.assumptions | github_repo:798201435 | https://github.com/unclecode/crawl4ai | README/documentation is current enough for a first validation pass.
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using crawl4ai with real data or production workflows.
- [[Bug] AsyncLogger writes to stdout, breaking MCP stdio transport](https://github.com/unclecode/crawl4ai/issues/1968) - github / github_issue
- [[Bug]: Markdown text extraction drops text when element contains empty e](https://github.com/unclecode/crawl4ai/issues/1966) - github / github_issue
- [[Bug] MCP Server json.dumps() escapes non-ASCII characters, causing 2.5-](https://github.com/unclecode/crawl4ai/issues/1962) - github / github_issue
- [[Bug]: MCP scrape tools lack wait_until / SPA support that REST API and](https://github.com/unclecode/crawl4ai/issues/1963) - github / github_issue
- [[Bug]: Markdown export loses heading hierarchy and table structure](https://github.com/unclecode/crawl4ai/issues/1964) - github / github_issue
- [[Bug]: enable_stealth=True is a silent no-op β StealthAdapter imports sy](https://github.com/unclecode/crawl4ai/issues/1959) - github / github_issue
- [[Bug]: After successful FETCH, and failed SCRAPE (COMPLETE being marked](https://github.com/unclecode/crawl4ai/issues/1949) - github / github_issue
- [[Bug]: arun() and arun_many() type hinting needs fixing](https://github.com/unclecode/crawl4ai/issues/1898) - github / github_issue
- [[Bug]: The install with pip on just about any system rarely works. It re](https://github.com/unclecode/crawl4ai/issues/1950) - github / github_issue
- [[Bug]:
remove_empty_elements_fast()drops trailing text when removing](https://github.com/unclecode/crawl4ai/issues/1938) - github / github_issue - Release v0.7.7 - github / github_release
- Release v0.7.5 - github / github_release
Source: Project Pack community evidence and pitfall evidence