Python API

CrawlConfig

Every knob — depth, concurrency, rate limit, robots, JS rendering, browser pool. Pydantic-validated.

docs/api/config.mdx·edit on github ↗·

CrawlConfig is a Pydantic model that captures every dial you can turn. Validation runs on construction, so invalid combinations (negative depth, concurrency > 100) fail immediately.

Import

from yoink import CrawlConfig
from yoink.models import WaitStrategy

Core settings

NameTypeDefaultDescription
max_depthint1Maximum link-hop distance from the start URL. Depth 0 = start URL only. Validated >= 0.
max_pagesint100Hard cap on pages crawled. Validated >= 1.
max_concurrencyint10Number of concurrent worker coroutines. Validated 1..100.
user_agentstryoink/<ver> (+github)User-Agent header sent on every request.
timeoutint30Per-request timeout in seconds. Validated >= 1.
follow_externalboolFalseIf False, drop links whose domain differs from the start URL's.
extract_textboolTrueRun trafilatura on each page's HTML to populate Page.text.
save_htmlboolFalsePersist the raw HTML on each Page record. Drastically increases output size.

robots.txt

NameTypeDefaultDescription
respect_robotsboolTrueIf True, fetch and apply robots.txt for every domain crawled.

Rate limiting

NameTypeDefaultDescription
requests_per_secondfloat2.0Token bucket fill rate per domain. Validated 0.1..100.0.
request_delayfloat0.0Minimum seconds between consecutive requests to the same domain. Validated >= 0.

JavaScript rendering

Requires the [browser] extra.

NameTypeDefaultDescription
render_jsboolFalseUse Playwright instead of the HTTP fetcher.
headlessboolTrueRun the browser headlessly (no UI window).
wait_strategyWaitStrategyNETWORKIDLEPage load completion signal. One of LOAD, DOMCONTENTLOADED, NETWORKIDLE, COMMIT.
wait_selectorstr | NoneNoneOptional CSS selector to wait for after wait_strategy fires.
browser_typeLiteral['chromium', 'firefox', 'webkit']chromiumPlaywright browser engine.
browser_pool_sizeint3Number of pooled browser contexts. Validated 1..10.
screenshot_dirstr | NoneNoneIf set, write a debug screenshot per page to this directory.

WaitStrategy enum

from yoink.models import WaitStrategy
 
WaitStrategy.LOAD              # "load"
WaitStrategy.DOMCONTENTLOADED  # "domcontentloaded"
WaitStrategy.NETWORKIDLE       # "networkidle"
WaitStrategy.COMMIT            # "commit"

You can pass a string or an enum value:

config = CrawlConfig(wait_strategy="networkidle")  # OK
config = CrawlConfig(wait_strategy=WaitStrategy.NETWORKIDLE)  # also OK

Examples

Minimal

config = CrawlConfig(max_depth=2)

Aggressive but polite

config = CrawlConfig(
    max_depth=4,
    max_pages=10_000,
    max_concurrency=20,
    requests_per_second=10.0,
    follow_external=False,
)

SPA crawl with debug screenshots

from yoink.models import WaitStrategy
 
config = CrawlConfig(
    render_js=True,
    browser_type="chromium",
    wait_strategy=WaitStrategy.NETWORKIDLE,
    wait_selector=".app-content",
    headless=True,
    browser_pool_size=5,
    screenshot_dir="./debug",
)

Loading from environment / config file

CrawlConfig is a standard Pydantic model, so you can use model_validate() with a dict from any source:

import json
from yoink import CrawlConfig
 
with open("crawl.json") as f:
    raw = json.load(f)
 
config = CrawlConfig.model_validate(raw)

See also