Reference

Configuration reference

Quick-scan reference for every configuration option, organized by section.

docs/reference/configuration.mdx·edit on github ↗·

This page is the dense, scrolling reference. For prose explanations, see the corresponding concept pages.

Core

Name	Type	Default	Description
max_depth	int	1	Max link-hop distance from start URL. Architecture.
max_pages	int	100	Total page cap.
max_concurrency	int	10	Worker coroutines (1..100).
user_agent	str	yoink/<ver> (+github)	User-Agent header.
timeout	int	30	Per-request timeout (seconds).
follow_external	bool	False	Follow links to other domains.
extract_text	bool	True	Run trafilatura for clean text.
save_html	bool	False	Persist raw HTML on each Page.

Name	Type	Default	Description
requests_per_second	float	2.0	Per-domain token-bucket fill rate. Rate limiting.
request_delay	float	0.0	Minimum seconds between requests to same domain.

Name	Type	Default	Description
respect_robots	bool	True	Fetch and apply robots.txt. robots.txt.

Name	Type	Default	Description
render_js	bool	False	Use Playwright. JS rendering.
headless	bool	True	Run browser without a UI window.
wait_strategy	WaitStrategy	NETWORKIDLE	load \| domcontentloaded \| networkidle \| commit.
wait_selector	str \| None	None	CSS selector to wait for.
browser_type	Literal	chromium	chromium \| firefox \| webkit.
browser_pool_size	int	3	Pooled browser contexts (1..10).
screenshot_dir	str \| None	None	Debug screenshots directory.

Pass to Crawler(url_filter=...). See CombinedFilter.from_config.

Name	Type	Default	Description
include_patterns	list[str]	—	Glob/regex/substring; URL must match at least one.
exclude_patterns	list[str]	—	Glob/regex/substring; URL fails on any match.
skip_extensions	list[str]	—	File extensions to skip (no leading dot).
allowed_domains	list[str]	—	Domain allowlist with subdomain support.

Pass to Crawler(checkpoint_manager=...). See CheckpointManager.

Name	Type	Default	Description
uri	str	—	Path or s3://bucket/key.
flush_interval	int	10	Pages between state writes.
resume	bool (crawl arg)	False	Set on Crawler.crawl() to load from checkpoint.