async by default
aiohttp-based concurrency with configurable workers. Hundreds of pages per minute on a single laptop, polite by default.
A focused Python crawler with rate limiting, robots.txt compliance, JS rendering, and resumable S3 checkpoints. ~3,200 lines, 134 tests, zero ceremony.
# Install
$ pip install yoink
# Crawl & save JSONL
$ yoink crawl https://docs.example.com --depth 2 -o data.jsonl
Yoinking pages: 100% ████████ 87/100
╰─► Yoinked 87 pages → data.jsonl
# Analyze
$ yoink stats data.jsonl --json | jq '.top_domains'
▊~3,200 lines of focused Python.
The rest is battle-tested libraries.
aiohttp-based concurrency with configurable workers. Hundreds of pages per minute on a single laptop, polite by default.
Trafilatura-powered extraction returns clean prose with the chrome stripped. Pipe straight into your training set.
Per-domain limits with crawl-delay honoring. Be a good citizen without thinking about it.
Append-only checkpoints to disk or S3. Survives Lambda timeouts, OOM kills, and Ctrl-C — pick right back up.
Drop-in Playwright for SPAs. Chromium / Firefox / WebKit, pooled contexts, smart wait strategies.
JSON, JSONL, Parquet, plain text. Stream millions of pages or load straight into pandas — no bespoke schema.
Each module does one thing. The crawler is the conductor. Swap the fetcher, storage backend, or extractor without forking — every seam is an interface, not magic.
Architecture deep-dive→Crawl docs sites, mirror knowledge bases, build embedding indexes — clean text out, no boilerplate.
S3 checkpoints + 14-min budget = crawls that survive across invocations indefinitely.
Parquet output drops straight into pandas / DuckDB / Athena. No schema gymnastics.
One pip install. Zero config required. Read the quickstart and have a crawl running before your coffee gets cold.