Python API

CrawlConfig

Every knob — depth, concurrency, rate limit, robots, JS rendering, browser pool. Pydantic-validated.

CrawlConfig is a Pydantic model that captures every dial you can turn. Validation runs on construction, so invalid combinations (negative depth, concurrency > 100) fail immediately.

Import

from yoink import CrawlConfig
from yoink.models import WaitStrategy

Core settings

Name	Type	Default	Description
max_depth	int	1	Maximum link-hop distance from the start URL. Depth 0 = start URL only. Validated >= 0.
max_pages	int	100	Hard cap on pages crawled. Validated >= 1.
max_concurrency	int	10	Number of concurrent worker coroutines. Validated 1..100.
user_agent	str	yoink/<ver> (+github)	User-Agent header sent on every request.
timeout	int	30	Per-request timeout in seconds. Validated >= 1.
follow_external	bool	False	If False, drop links whose domain differs from the start URL's.
extract_text	bool	True	Run trafilatura on each page's HTML to populate Page.text.
save_html	bool	False	Persist the raw HTML on each Page record. Drastically increases output size.

robots.txt

Name	Type	Default	Description
respect_robots	bool	True	If True, fetch and apply robots.txt for every domain crawled.

Rate limiting

Name	Type	Default	Description
requests_per_second	float	2.0	Token bucket fill rate per domain. Validated 0.1..100.0.
request_delay	float	0.0	Minimum seconds between consecutive requests to the same domain. Validated >= 0.

JavaScript rendering

Requires the [browser] extra.

Name	Type	Default	Description
render_js	bool	False	Use Playwright instead of the HTTP fetcher.
headless	bool	True	Run the browser headlessly (no UI window).
wait_strategy	WaitStrategy	NETWORKIDLE	Page load completion signal. One of LOAD, DOMCONTENTLOADED, NETWORKIDLE, COMMIT.
wait_selector	str \| None	None	Optional CSS selector to wait for after wait_strategy fires.
browser_type	Literal['chromium', 'firefox', 'webkit']	chromium	Playwright browser engine.
browser_pool_size	int	3	Number of pooled browser contexts. Validated 1..10.
screenshot_dir	str \| None	None	If set, write a debug screenshot per page to this directory.

`WaitStrategy` enum

from yoink.models import WaitStrategy
 
WaitStrategy.LOAD              # "load"
WaitStrategy.DOMCONTENTLOADED  # "domcontentloaded"
WaitStrategy.NETWORKIDLE       # "networkidle"
WaitStrategy.COMMIT            # "commit"

You can pass a string or an enum value:

config = CrawlConfig(wait_strategy="networkidle")  # OK
config = CrawlConfig(wait_strategy=WaitStrategy.NETWORKIDLE)  # also OK

Examples

Minimal

config = CrawlConfig(max_depth=2)

Aggressive but polite

config = CrawlConfig(
    max_depth=4,
    max_pages=10_000,
    max_concurrency=20,
    requests_per_second=10.0,
    follow_external=False,
)

SPA crawl with debug screenshots

from yoink.models import WaitStrategy
 
config = CrawlConfig(
    render_js=True,
    browser_type="chromium",
    wait_strategy=WaitStrategy.NETWORKIDLE,
    wait_selector=".app-content",
    headless=True,
    browser_pool_size=5,
    screenshot_dir="./debug",
)

Loading from environment / config file

CrawlConfig is a standard Pydantic model, so you can use model_validate() with a dict from any source:

import json
from yoink import CrawlConfig
 
with open("crawl.json") as f:
    raw = json.load(f)
 
config = CrawlConfig.model_validate(raw)