Getting started

Introduction

A fast, async Python web crawler for extracting AI-ready data from public websites.

yoink is a focused, well-tested Python crawler that turns public websites into clean, structured data. It's the tool you reach for when you want to build a training set, mirror a documentation site, audit an API surface, or run a research crawl — without hand-rolling the boring parts.

What's in the box

Async architecture built on aiohttp with configurable concurrency
Clean text extraction via trafilatura — no nav chrome, no boilerplate
Per-domain rate limiting using a token bucket with burst support
robots.txt compliance out of the box, including Crawl-delay and Sitemap directives
JavaScript rendering via Playwright for SPAs (optional extra)
Resumable crawls with append-only checkpoints to disk or S3
URL filtering with glob, regex, and extension matching
First-class output formats — JSON, JSONL, Parquet, plain text
Built-in stats for inspecting what you yoinked

Design principles

Polite by default. Respects robots.txt, identifies itself, rate-limits per domain, stays on the start domain.
Pluggable, not magic. Swap fetchers, storage backends, filters, and extractors without forking the crawler.
Resumable, always. Long crawls die. Lambda runs time out. yoink should pick up where it left off.
Output is the product. Clean JSONL/Parquet that drops straight into your pipeline beats a fancy CLI.

When to use yoink

✅ Good fit

You want a few hundred to a few hundred thousand public pages, fast.
You're feeding an LLM, building an embedding index, or training a model.
You're mirroring documentation, doing SEO research, or running content analysis.
You're shipping a Lambda job that needs to survive restarts.

❌ Not the right tool

You need to log in, solve CAPTCHAs, or scrape at adversarial sites that explicitly forbid it.
You want a UI-driven scraping product. yoink is a library + CLI.
You need millions of pages a day at sustained throughput. Look at distributed systems like Apache Nutch.

Where to next

Installation — pip install yoink and optional extras.
Quickstart — your first crawl in 30 seconds.
Architecture — how the moving parts fit together.

Next→Installation