This page is the dense, scrolling reference. For prose explanations, see the corresponding concept pages.
Core
| Name | Type | Default | Description |
|---|
| max_depth | int | 1 | Max link-hop distance from start URL. Architecture. |
| max_pages | int | 100 | Total page cap. |
| max_concurrency | int | 10 | Worker coroutines (1..100). |
| user_agent | str | yoink/<ver> (+github) | User-Agent header. |
| timeout | int | 30 | Per-request timeout (seconds). |
| follow_external | bool | False | Follow links to other domains. |
| extract_text | bool | True | Run trafilatura for clean text. |
| save_html | bool | False | Persist raw HTML on each Page. |
Rate limiting
| Name | Type | Default | Description |
|---|
| requests_per_second | float | 2.0 | Per-domain token-bucket fill rate. Rate limiting. |
| request_delay | float | 0.0 | Minimum seconds between requests to same domain. |
robots.txt
| Name | Type | Default | Description |
|---|
| respect_robots | bool | True | Fetch and apply robots.txt. robots.txt. |
JavaScript rendering (requires [browser])
| Name | Type | Default | Description |
|---|
| render_js | bool | False | Use Playwright. JS rendering. |
| headless | bool | True | Run browser without a UI window. |
| wait_strategy | WaitStrategy | NETWORKIDLE | load | domcontentloaded | networkidle | commit. |
| wait_selector | str | None | None | CSS selector to wait for. |
| browser_type | Literal | chromium | chromium | firefox | webkit. |
| browser_pool_size | int | 3 | Pooled browser contexts (1..10). |
| screenshot_dir | str | None | None | Debug screenshots directory. |
URL filtering (separate from CrawlConfig)
Pass to Crawler(url_filter=...). See CombinedFilter.from_config.
| Name | Type | Default | Description |
|---|
| include_patterns | list[str] | — | Glob/regex/substring; URL must match at least one. |
| exclude_patterns | list[str] | — | Glob/regex/substring; URL fails on any match. |
| skip_extensions | list[str] | — | File extensions to skip (no leading dot). |
| allowed_domains | list[str] | — | Domain allowlist with subdomain support. |
Checkpointing (separate from CrawlConfig)
Pass to Crawler(checkpoint_manager=...). See CheckpointManager.
| Name | Type | Default | Description |
|---|
| uri | str | — | Path or s3://bucket/key. |
| flush_interval | int | 10 | Pages between state writes. |
| resume | bool (crawl arg) | False | Set on Crawler.crawl() to load from checkpoint. |
CLI flag mapping
| CLI flag | Config field |
|---|
--depth, -d | max_depth |
--max-pages, -n | max_pages |
--concurrency, -c | max_concurrency |
--user-agent | user_agent |
--follow-external | follow_external |
--save-html | save_html |
--rate-limit, -r | requests_per_second |
--request-delay | request_delay |
--no-robots | respect_robots=False |
--render-js, --browser | render_js |
--wait-for | wait_strategy |
--wait-selector | wait_selector |
--browser-type | browser_type |
--no-headless | headless=False |
--include | url_filter.include_patterns |
--exclude | url_filter.exclude_patterns |
--skip-extensions | url_filter.skip_extensions |
--checkpoint | CheckpointManager.from_uri |
--checkpoint-interval | flush_interval |
--resume | crawler.crawl(resume=True) |
See also