v0.1.0/134 / 134 tests passing/3,164 LOC/MIT licensed/Python 3.11+/async / await/S3-ready/playwright optional/v0.1.0/134 / 134 tests passing/3,164 LOC/MIT licensed/Python 3.11+/async / await/S3-ready/playwright optional/v0.1.0/134 / 134 tests passing/3,164 LOC/MIT licensed/Python 3.11+/async / await/S3-ready/playwright optional/

the public data crawler

yoink.fast, async, polite — extract clean data from any public website.

A focused Python crawler with rate limiting, robots.txt compliance, JS rendering, and resumable S3 checkpoints. ~3,200 lines, 134 tests, zero ceremony.

Get started github.com/ErikkJs/yoink

~/yoink — zshlive

# Install
$ pip install yoink

# Crawl & save JSONL
$ yoink crawl https://docs.example.com --depth 2 -o data.jsonl
  Yoinking pages: 100% ████████ 87/100
  ╰─► Yoinked 87 pages → data.jsonl

# Analyze
$ yoink stats data.jsonl --json | jq '.top_domains'
▊

▸ what's in the box

Everything you need. Nothing you don't.

~3,200 lines of focused Python.
The rest is battle-tested libraries.

01 ─

async by default

aiohttp-based concurrency with configurable workers. Hundreds of pages per minute on a single laptop, polite by default.

aiohttpasyncio

02 ─

AI-ready text

Trafilatura-powered extraction returns clean prose with the chrome stripped. Pipe straight into your training set.

trafilaturalxml

03 ─

token-bucket rate limiter

Per-domain limits with crawl-delay honoring. Be a good citizen without thinking about it.

robots.txtper-domain

04 ─

resumable crawls

Append-only checkpoints to disk or S3. Survives Lambda timeouts, OOM kills, and Ctrl-C — pick right back up.

aioboto3JSONL

05 ─

JS rendering, optional

Drop-in Playwright for SPAs. Chromium / Firefox / WebKit, pooled contexts, smart wait strategies.

playwrightchromium

06 ─

output you can use

JSON, JSONL, Parquet, plain text. Stream millions of pages or load straight into pandas — no bespoke schema.

jsonlparquet

▸ how it works

One pipeline. Twelve modules.

Each module does one thing. The crawler is the conductor. Swap the fetcher, storage backend, or extractor without forking — every seam is an interface, not magic.

Architecture deep-dive→

crawl pipelinecrawler.py

Schedulerqueue / dedup

↓

RateLimitertoken bucket

↓

Robotsis_allowed

↓

Fetcheraiohttp / playwright

↓

Parserlinks / metadata

↓

Extractortrafilatura

↓

CheckpointJSONL / S3

workers: Noutput: JSONL/Parquet/...

▸ used for

Public data is public. Treat it that way.

AI / RAG datasets

Crawl docs sites, mirror knowledge bases, build embedding indexes — clean text out, no boilerplate.

Lambda crawlers

S3 checkpoints + 14-min budget = crawls that survive across invocations indefinitely.

Content analysis

Parquet output drops straight into pandas / DuckDB / Athena. No schema gymnastics.

Ready to yoink?

One pip install. Zero config required. Read the quickstart and have a crawl running before your coffee gets cold.

Quickstart →API reference