v0.1.0/134 / 134 tests passing/3,164 LOC/MIT licensed/Python 3.11+/async / await/S3-ready/playwright optional/v0.1.0/134 / 134 tests passing/3,164 LOC/MIT licensed/Python 3.11+/async / await/S3-ready/playwright optional/v0.1.0/134 / 134 tests passing/3,164 LOC/MIT licensed/Python 3.11+/async / await/S3-ready/playwright optional/
the public data crawler

yoink.fast, async, polite — extract clean data from any public website.

A focused Python crawler with rate limiting, robots.txt compliance, JS rendering, and resumable S3 checkpoints. ~3,200 lines, 134 tests, zero ceremony.

~/yoink — zshlive
# Install
$ pip install yoink

# Crawl & save JSONL
$ yoink crawl https://docs.example.com --depth 2 -o data.jsonl
  Yoinking pages: 100% ████████ 87/100
  ╰─► Yoinked 87 pages → data.jsonl

# Analyze
$ yoink stats data.jsonl --json | jq '.top_domains'

▸ what's in the box

Everything you need. Nothing you don't.

~3,200 lines of focused Python.
The rest is battle-tested libraries.

01

async by default

aiohttp-based concurrency with configurable workers. Hundreds of pages per minute on a single laptop, polite by default.

aiohttpasyncio
02

AI-ready text

Trafilatura-powered extraction returns clean prose with the chrome stripped. Pipe straight into your training set.

trafilaturalxml
03

token-bucket rate limiter

Per-domain limits with crawl-delay honoring. Be a good citizen without thinking about it.

robots.txtper-domain
04

resumable crawls

Append-only checkpoints to disk or S3. Survives Lambda timeouts, OOM kills, and Ctrl-C — pick right back up.

aioboto3JSONL
05

JS rendering, optional

Drop-in Playwright for SPAs. Chromium / Firefox / WebKit, pooled contexts, smart wait strategies.

playwrightchromium
06

output you can use

JSON, JSONL, Parquet, plain text. Stream millions of pages or load straight into pandas — no bespoke schema.

jsonlparquet

▸ how it works

One pipeline. Twelve modules.

Each module does one thing. The crawler is the conductor. Swap the fetcher, storage backend, or extractor without forking — every seam is an interface, not magic.

Architecture deep-dive
crawl pipelinecrawler.py
01
Schedulerqueue / dedup
02
RateLimitertoken bucket
03
Robotsis_allowed
04
Fetcheraiohttp / playwright
05
Parserlinks / metadata
06
Extractortrafilatura
07
CheckpointJSONL / S3
workers: Noutput: JSONL/Parquet/...

▸ used for

Public data is public. Treat it that way.

AI / RAG datasets

Crawl docs sites, mirror knowledge bases, build embedding indexes — clean text out, no boilerplate.

Lambda crawlers

S3 checkpoints + 14-min budget = crawls that survive across invocations indefinitely.

Content analysis

Parquet output drops straight into pandas / DuckDB / Athena. No schema gymnastics.

Ready to yoink?

One pip install. Zero config required. Read the quickstart and have a crawl running before your coffee gets cold.