CLI

yoink crawl

Complete reference for the yoink crawl command — every flag, every option, with examples.

docs/cli/crawl.mdx·edit on github ↗·

yoink crawl is the workhorse: it takes a URL, fetches pages, and writes the result.

yoink crawl URL [OPTIONS]

Examples

# The minimum
yoink crawl https://example.com
 
# Reasonable defaults for a small docs crawl
yoink crawl https://docs.example.com -d 2 -n 200 -o docs.jsonl
 
# JS-heavy SPA, output as Parquet
yoink crawl https://spa.com --render-js --format parquet -o data.parquet
 
# Resumable to S3
yoink crawl https://example.com \
  --checkpoint s3://my-bucket/crawl.jsonl \
  --resume \
  --rate-limit 5 \
  --depth 3

Core options

NameTypeDefaultDescription
URL*argumentThe starting URL of the crawl. Must include scheme (http:// or https://).
--depth, -dINTEGER1Maximum crawl depth from the start URL. Depth 0 = start URL only.
--max-pages, -nINTEGER100Hard cap on total pages crawled across the whole run.
--concurrency, -cINTEGER10Number of concurrent worker coroutines fetching pages.
--format, -fjson | jsonl | parquet | textjsonlOutput format. parquet requires the [parquet] extra.
--output, -oPATHcrawl_output.<format>Output file path. Skipped if --checkpoint is set without --output.
--follow-externalFLAGfalseFollow links to domains other than the start URL's domain.
--save-htmlFLAGfalsePersist raw HTML on each Page record (large output).
--user-agentTEXTyoink/<ver> (+github)Custom User-Agent string sent on every request.

URL filtering

NameTypeDefaultDescription
--includeTEXT (repeatable)Glob/regex/substring patterns. URL must match at least one to be queued.
--excludeTEXT (repeatable)Glob/regex/substring patterns. URL is dropped if it matches any.
--skip-extensionsTEXTComma-separated extensions to skip (e.g., pdf,zip,exe). No leading dot needed.

See URL filtering for pattern semantics.

Checkpointing

NameTypeDefaultDescription
--checkpointTEXTLocal path or s3://bucket/key to enable resumable checkpointing.
--checkpoint-intervalINTEGER10Number of pages between state flushes. Higher = less I/O.
--resumeFLAGfalsePick up where the checkpoint left off. Requires --checkpoint.

See Checkpointing for details.

Rate limiting

NameTypeDefaultDescription
--rate-limit, -rFLOAT2.0Maximum requests per second per domain (token bucket fill rate).
--request-delayFLOAT0.0Minimum seconds between consecutive requests to the same domain.

See Rate limiting.

robots.txt

NameTypeDefaultDescription
--no-robotsFLAGfalseSkip robots.txt entirely. Use responsibly — see warning in docs.

See robots.txt compliance.

JavaScript rendering

Requires the [browser] extra (pip install "yoink[browser]" and playwright install chromium).

NameTypeDefaultDescription
--render-js, --browserFLAGfalseUse Playwright instead of the HTTP fetcher.
--wait-forload | domcontentloaded | networkidle | commitnetworkidlePage load wait strategy. See JavaScript rendering docs.
--wait-selectorTEXTCSS selector to wait for after wait_strategy fires.
--browser-typechromium | firefox | webkitchromiumBrowser engine to launch.
--no-headlessFLAGfalseShow the browser window (debugging).

See JavaScript rendering.

Output

By default, yoink prints progress to stderr and a summary to stdout when finished:

Yoinking https://example.com...
Max depth: 2, Max pages: 100, Concurrency: 10
Rate limit: 2.0 req/s, Robots.txt: enabled
Yoinking pages: 100%|████████| 87/100 [00:42<00:00,  2.07page/s]
Yoinked 87 pages to crawl_output.jsonl
Total links found: 1,243
Total text extracted: 412,891 characters

Pipe stderr away if you only want the summary:

yoink crawl https://example.com 2>/dev/null

Exit codes

  • 0 — always. The CLI prints errors to stderr but currently exits 0 on every code path, including bad config (e.g., --resume without --checkpoint) and write errors. If you script around yoink crawl and need to detect failure, scan stderr or check that the output file exists and is non-empty.

See also