yoink crawl
Complete reference for the yoink crawl command — every flag, every option, with examples.
yoink crawl is the workhorse: it takes a URL, fetches pages, and writes the result.
yoink crawl URL [OPTIONS]Examples
# The minimum
yoink crawl https://example.com
# Reasonable defaults for a small docs crawl
yoink crawl https://docs.example.com -d 2 -n 200 -o docs.jsonl
# JS-heavy SPA, output as Parquet
yoink crawl https://spa.com --render-js --format parquet -o data.parquet
# Resumable to S3
yoink crawl https://example.com \
--checkpoint s3://my-bucket/crawl.jsonl \
--resume \
--rate-limit 5 \
--depth 3Core options
| Name | Type | Default | Description |
|---|---|---|---|
| URL* | argument | — | The starting URL of the crawl. Must include scheme (http:// or https://). |
| --depth, -d | INTEGER | 1 | Maximum crawl depth from the start URL. Depth 0 = start URL only. |
| --max-pages, -n | INTEGER | 100 | Hard cap on total pages crawled across the whole run. |
| --concurrency, -c | INTEGER | 10 | Number of concurrent worker coroutines fetching pages. |
| --format, -f | json | jsonl | parquet | text | jsonl | Output format. parquet requires the [parquet] extra. |
| --output, -o | PATH | crawl_output.<format> | Output file path. Skipped if --checkpoint is set without --output. |
| --follow-external | FLAG | false | Follow links to domains other than the start URL's domain. |
| --save-html | FLAG | false | Persist raw HTML on each Page record (large output). |
| --user-agent | TEXT | yoink/<ver> (+github) | Custom User-Agent string sent on every request. |
URL filtering
| Name | Type | Default | Description |
|---|---|---|---|
| --include | TEXT (repeatable) | — | Glob/regex/substring patterns. URL must match at least one to be queued. |
| --exclude | TEXT (repeatable) | — | Glob/regex/substring patterns. URL is dropped if it matches any. |
| --skip-extensions | TEXT | — | Comma-separated extensions to skip (e.g., pdf,zip,exe). No leading dot needed. |
See URL filtering for pattern semantics.
Checkpointing
| Name | Type | Default | Description |
|---|---|---|---|
| --checkpoint | TEXT | — | Local path or s3://bucket/key to enable resumable checkpointing. |
| --checkpoint-interval | INTEGER | 10 | Number of pages between state flushes. Higher = less I/O. |
| --resume | FLAG | false | Pick up where the checkpoint left off. Requires --checkpoint. |
See Checkpointing for details.
Rate limiting
| Name | Type | Default | Description |
|---|---|---|---|
| --rate-limit, -r | FLOAT | 2.0 | Maximum requests per second per domain (token bucket fill rate). |
| --request-delay | FLOAT | 0.0 | Minimum seconds between consecutive requests to the same domain. |
See Rate limiting.
robots.txt
| Name | Type | Default | Description |
|---|---|---|---|
| --no-robots | FLAG | false | Skip robots.txt entirely. Use responsibly — see warning in docs. |
JavaScript rendering
Requires the [browser] extra (pip install "yoink[browser]" and playwright install chromium).
| Name | Type | Default | Description |
|---|---|---|---|
| --render-js, --browser | FLAG | false | Use Playwright instead of the HTTP fetcher. |
| --wait-for | load | domcontentloaded | networkidle | commit | networkidle | Page load wait strategy. See JavaScript rendering docs. |
| --wait-selector | TEXT | — | CSS selector to wait for after wait_strategy fires. |
| --browser-type | chromium | firefox | webkit | chromium | Browser engine to launch. |
| --no-headless | FLAG | false | Show the browser window (debugging). |
See JavaScript rendering.
Output
By default, yoink prints progress to stderr and a summary to stdout when finished:
Yoinking https://example.com...
Max depth: 2, Max pages: 100, Concurrency: 10
Rate limit: 2.0 req/s, Robots.txt: enabled
Yoinking pages: 100%|████████| 87/100 [00:42<00:00, 2.07page/s]
Yoinked 87 pages to crawl_output.jsonl
Total links found: 1,243
Total text extracted: 412,891 characters
Pipe stderr away if you only want the summary:
yoink crawl https://example.com 2>/dev/nullExit codes
0— always. The CLI prints errors to stderr but currently exits0on every code path, including bad config (e.g.,--resumewithout--checkpoint) and write errors. If you script aroundyoink crawland need to detect failure, scan stderr or check that the output file exists and is non-empty.
See also
yoink stats— analyze the output of a crawl.- Quickstart — concrete examples.