Introduction
A fast, async Python web crawler for extracting AI-ready data from public websites.
yoink is a focused, well-tested Python crawler that turns public websites into clean, structured data. It's the tool you reach for when you want to build a training set, mirror a documentation site, audit an API surface, or run a research crawl — without hand-rolling the boring parts.
What's in the box
- Async architecture built on
aiohttpwith configurable concurrency - Clean text extraction via trafilatura — no nav chrome, no boilerplate
- Per-domain rate limiting using a token bucket with burst support
- robots.txt compliance out of the box, including
Crawl-delayandSitemapdirectives - JavaScript rendering via Playwright for SPAs (optional extra)
- Resumable crawls with append-only checkpoints to disk or S3
- URL filtering with glob, regex, and extension matching
- First-class output formats — JSON, JSONL, Parquet, plain text
- Built-in stats for inspecting what you yoinked
Design principles
- Polite by default. Respects
robots.txt, identifies itself, rate-limits per domain, stays on the start domain. - Pluggable, not magic. Swap fetchers, storage backends, filters, and extractors without forking the crawler.
- Resumable, always. Long crawls die. Lambda runs time out. yoink should pick up where it left off.
- Output is the product. Clean JSONL/Parquet that drops straight into your pipeline beats a fancy CLI.
When to use yoink
✅ Good fit
- You want a few hundred to a few hundred thousand public pages, fast.
- You're feeding an LLM, building an embedding index, or training a model.
- You're mirroring documentation, doing SEO research, or running content analysis.
- You're shipping a Lambda job that needs to survive restarts.
❌ Not the right tool
- You need to log in, solve CAPTCHAs, or scrape at adversarial sites that explicitly forbid it.
- You want a UI-driven scraping product. yoink is a library + CLI.
- You need millions of pages a day at sustained throughput. Look at distributed systems like Apache Nutch.
Where to next
- Installation —
pip install yoinkand optional extras. - Quickstart — your first crawl in 30 seconds.
- Architecture — how the moving parts fit together.