Reference

Configuration reference

Quick-scan reference for every configuration option, organized by section.

docs/reference/configuration.mdx·edit on github ↗·

This page is the dense, scrolling reference. For prose explanations, see the corresponding concept pages.

Core

NameTypeDefaultDescription
max_depthint1Max link-hop distance from start URL. Architecture.
max_pagesint100Total page cap.
max_concurrencyint10Worker coroutines (1..100).
user_agentstryoink/<ver> (+github)User-Agent header.
timeoutint30Per-request timeout (seconds).
follow_externalboolFalseFollow links to other domains.
extract_textboolTrueRun trafilatura for clean text.
save_htmlboolFalsePersist raw HTML on each Page.

Rate limiting

NameTypeDefaultDescription
requests_per_secondfloat2.0Per-domain token-bucket fill rate. Rate limiting.
request_delayfloat0.0Minimum seconds between requests to same domain.

robots.txt

NameTypeDefaultDescription
respect_robotsboolTrueFetch and apply robots.txt. robots.txt.

JavaScript rendering (requires [browser])

NameTypeDefaultDescription
render_jsboolFalseUse Playwright. JS rendering.
headlessboolTrueRun browser without a UI window.
wait_strategyWaitStrategyNETWORKIDLEload | domcontentloaded | networkidle | commit.
wait_selectorstr | NoneNoneCSS selector to wait for.
browser_typeLiteralchromiumchromium | firefox | webkit.
browser_pool_sizeint3Pooled browser contexts (1..10).
screenshot_dirstr | NoneNoneDebug screenshots directory.

URL filtering (separate from CrawlConfig)

Pass to Crawler(url_filter=...). See CombinedFilter.from_config.

NameTypeDefaultDescription
include_patternslist[str]Glob/regex/substring; URL must match at least one.
exclude_patternslist[str]Glob/regex/substring; URL fails on any match.
skip_extensionslist[str]File extensions to skip (no leading dot).
allowed_domainslist[str]Domain allowlist with subdomain support.

Checkpointing (separate from CrawlConfig)

Pass to Crawler(checkpoint_manager=...). See CheckpointManager.

NameTypeDefaultDescription
uristrPath or s3://bucket/key.
flush_intervalint10Pages between state writes.
resumebool (crawl arg)FalseSet on Crawler.crawl() to load from checkpoint.

CLI flag mapping

CLI flagConfig field
--depth, -dmax_depth
--max-pages, -nmax_pages
--concurrency, -cmax_concurrency
--user-agentuser_agent
--follow-externalfollow_external
--save-htmlsave_html
--rate-limit, -rrequests_per_second
--request-delayrequest_delay
--no-robotsrespect_robots=False
--render-js, --browserrender_js
--wait-forwait_strategy
--wait-selectorwait_selector
--browser-typebrowser_type
--no-headlessheadless=False
--includeurl_filter.include_patterns
--excludeurl_filter.exclude_patterns
--skip-extensionsurl_filter.skip_extensions
--checkpointCheckpointManager.from_uri
--checkpoint-intervalflush_interval
--resumecrawler.crawl(resume=True)

See also