CrawlConfig
Every knob — depth, concurrency, rate limit, robots, JS rendering, browser pool. Pydantic-validated.
CrawlConfig is a Pydantic model that captures every dial you can turn. Validation runs on construction, so invalid combinations (negative depth, concurrency > 100) fail immediately.
Import
from yoink import CrawlConfig
from yoink.models import WaitStrategyCore settings
| Name | Type | Default | Description |
|---|---|---|---|
| max_depth | int | 1 | Maximum link-hop distance from the start URL. Depth 0 = start URL only. Validated >= 0. |
| max_pages | int | 100 | Hard cap on pages crawled. Validated >= 1. |
| max_concurrency | int | 10 | Number of concurrent worker coroutines. Validated 1..100. |
| user_agent | str | yoink/<ver> (+github) | User-Agent header sent on every request. |
| timeout | int | 30 | Per-request timeout in seconds. Validated >= 1. |
| follow_external | bool | False | If False, drop links whose domain differs from the start URL's. |
| extract_text | bool | True | Run trafilatura on each page's HTML to populate Page.text. |
| save_html | bool | False | Persist the raw HTML on each Page record. Drastically increases output size. |
robots.txt
| Name | Type | Default | Description |
|---|---|---|---|
| respect_robots | bool | True | If True, fetch and apply robots.txt for every domain crawled. |
Rate limiting
| Name | Type | Default | Description |
|---|---|---|---|
| requests_per_second | float | 2.0 | Token bucket fill rate per domain. Validated 0.1..100.0. |
| request_delay | float | 0.0 | Minimum seconds between consecutive requests to the same domain. Validated >= 0. |
JavaScript rendering
Requires the [browser] extra.
| Name | Type | Default | Description |
|---|---|---|---|
| render_js | bool | False | Use Playwright instead of the HTTP fetcher. |
| headless | bool | True | Run the browser headlessly (no UI window). |
| wait_strategy | WaitStrategy | NETWORKIDLE | Page load completion signal. One of LOAD, DOMCONTENTLOADED, NETWORKIDLE, COMMIT. |
| wait_selector | str | None | None | Optional CSS selector to wait for after wait_strategy fires. |
| browser_type | Literal['chromium', 'firefox', 'webkit'] | chromium | Playwright browser engine. |
| browser_pool_size | int | 3 | Number of pooled browser contexts. Validated 1..10. |
| screenshot_dir | str | None | None | If set, write a debug screenshot per page to this directory. |
WaitStrategy enum
from yoink.models import WaitStrategy
WaitStrategy.LOAD # "load"
WaitStrategy.DOMCONTENTLOADED # "domcontentloaded"
WaitStrategy.NETWORKIDLE # "networkidle"
WaitStrategy.COMMIT # "commit"You can pass a string or an enum value:
config = CrawlConfig(wait_strategy="networkidle") # OK
config = CrawlConfig(wait_strategy=WaitStrategy.NETWORKIDLE) # also OKExamples
Minimal
config = CrawlConfig(max_depth=2)Aggressive but polite
config = CrawlConfig(
max_depth=4,
max_pages=10_000,
max_concurrency=20,
requests_per_second=10.0,
follow_external=False,
)SPA crawl with debug screenshots
from yoink.models import WaitStrategy
config = CrawlConfig(
render_js=True,
browser_type="chromium",
wait_strategy=WaitStrategy.NETWORKIDLE,
wait_selector=".app-content",
headless=True,
browser_pool_size=5,
screenshot_dir="./debug",
)Loading from environment / config file
CrawlConfig is a standard Pydantic model, so you can use model_validate() with a dict from any source:
import json
from yoink import CrawlConfig
with open("crawl.json") as f:
raw = json.load(f)
config = CrawlConfig.model_validate(raw)See also
Crawler— uses this config.- Configuration reference — quick-scan view of every option.