CLI

yoink crawl

Complete reference for the yoink crawl command — every flag, every option, with examples.

yoink crawl is the workhorse: it takes a URL, fetches pages, and writes the result.

yoink crawl URL [OPTIONS]

Examples

# The minimum
yoink crawl https://example.com
 
# Reasonable defaults for a small docs crawl
yoink crawl https://docs.example.com -d 2 -n 200 -o docs.jsonl
 
# JS-heavy SPA, output as Parquet
yoink crawl https://spa.com --render-js --format parquet -o data.parquet
 
# Resumable to S3
yoink crawl https://example.com \
  --checkpoint s3://my-bucket/crawl.jsonl \
  --resume \
  --rate-limit 5 \
  --depth 3

Core options

Name	Type	Default	Description
URL*	argument	—	The starting URL of the crawl. Must include scheme (http:// or https://).
--depth, -d	INTEGER	1	Maximum crawl depth from the start URL. Depth 0 = start URL only.
--max-pages, -n	INTEGER	100	Hard cap on total pages crawled across the whole run.
--concurrency, -c	INTEGER	10	Number of concurrent worker coroutines fetching pages.
--format, -f	json \| jsonl \| parquet \| text	jsonl	Output format. parquet requires the [parquet] extra.
--output, -o	PATH	crawl_output.<format>	Output file path. Skipped if --checkpoint is set without --output.
--follow-external	FLAG	false	Follow links to domains other than the start URL's domain.
--save-html	FLAG	false	Persist raw HTML on each Page record (large output).
--user-agent	TEXT	yoink/<ver> (+github)	Custom User-Agent string sent on every request.

URL filtering

Name	Type	Default	Description
--include	TEXT (repeatable)	—	Glob/regex/substring patterns. URL must match at least one to be queued.
--exclude	TEXT (repeatable)	—	Glob/regex/substring patterns. URL is dropped if it matches any.
--skip-extensions	TEXT	—	Comma-separated extensions to skip (e.g., pdf,zip,exe). No leading dot needed.

See URL filtering for pattern semantics.

Checkpointing

Name	Type	Default	Description
--checkpoint	TEXT	—	Local path or s3://bucket/key to enable resumable checkpointing.
--checkpoint-interval	INTEGER	10	Number of pages between state flushes. Higher = less I/O.
--resume	FLAG	false	Pick up where the checkpoint left off. Requires --checkpoint.

See Checkpointing for details.

Rate limiting

Name	Type	Default	Description
--rate-limit, -r	FLOAT	2.0	Maximum requests per second per domain (token bucket fill rate).
--request-delay	FLOAT	0.0	Minimum seconds between consecutive requests to the same domain.

See Rate limiting.

robots.txt

Name	Type	Default	Description
--no-robots	FLAG	false	Skip robots.txt entirely. Use responsibly — see warning in docs.

See robots.txt compliance.

JavaScript rendering

Requires the [browser] extra (pip install "yoink[browser]" and playwright install chromium).

Name	Type	Default	Description
--render-js, --browser	FLAG	false	Use Playwright instead of the HTTP fetcher.
--wait-for	load \| domcontentloaded \| networkidle \| commit	networkidle	Page load wait strategy. See JavaScript rendering docs.
--wait-selector	TEXT	—	CSS selector to wait for after wait_strategy fires.
--browser-type	chromium \| firefox \| webkit	chromium	Browser engine to launch.
--no-headless	FLAG	false	Show the browser window (debugging).

See JavaScript rendering.

Output

By default, yoink prints progress to stderr and a summary to stdout when finished:

Yoinking https://example.com...
Max depth: 2, Max pages: 100, Concurrency: 10
Rate limit: 2.0 req/s, Robots.txt: enabled
Yoinking pages: 100%|████████| 87/100 [00:42<00:00,  2.07page/s]
Yoinked 87 pages to crawl_output.jsonl
Total links found: 1,243
Total text extracted: 412,891 characters

Pipe stderr away if you only want the summary:

yoink crawl https://example.com 2>/dev/null

Exit codes

0 — always. The CLI prints errors to stderr but currently exits 0 on every code path, including bad config (e.g., --resume without --checkpoint) and write errors. If you script around yoink crawl and need to detect failure, scan stderr or check that the output file exists and is non-empty.