# yoink — full documentation
> Fast, async Python web crawler with rate limiting, robots.txt compliance, optional JavaScript rendering, and resumable S3-backed checkpoints. This file concatenates every documentation page so you can paste the whole thing into an AI assistant's context.
---
# Getting started
## Introduction
_Source: `docs/introduction.mdx` · https://yoink.goatsquadstudios.com/docs/introduction_
> A fast, async Python web crawler for extracting AI-ready data from public websites.
**yoink** is a focused, well-tested Python crawler that turns public websites into clean, structured data. It's the tool you reach for when you want to build a training set, mirror a documentation site, audit an API surface, or run a research crawl — without hand-rolling the boring parts.
## What's in the box
- **Async architecture** built on `aiohttp` with configurable concurrency
- **Clean text extraction** via [trafilatura](https://github.com/adbar/trafilatura) — no nav chrome, no boilerplate
- **Per-domain rate limiting** using a token bucket with burst support
- **robots.txt compliance** out of the box, including `Crawl-delay` and `Sitemap` directives
- **JavaScript rendering** via Playwright for SPAs (optional extra)
- **Resumable crawls** with append-only checkpoints to disk or S3
- **URL filtering** with glob, regex, and extension matching
- **First-class output formats** — JSON, JSONL, Parquet, plain text
- **Built-in stats** for inspecting what you yoinked
## Design principles
yoink is intentionally small (~3,200 lines of Python, 134 passing tests). The hard parts — HTTP, HTML parsing, text extraction, browser automation — are delegated to libraries that have been battle-tested for years.
1. **Polite by default.** Respects `robots.txt`, identifies itself, rate-limits per domain, stays on the start domain.
2. **Pluggable, not magic.** Swap fetchers, storage backends, filters, and extractors without forking the crawler.
3. **Resumable, always.** Long crawls die. Lambda runs time out. yoink should pick up where it left off.
4. **Output is the product.** Clean JSONL/Parquet that drops straight into your pipeline beats a fancy CLI.
## When to use yoink
✅ **Good fit**
- You want a few hundred to a few hundred thousand public pages, fast.
- You're feeding an LLM, building an embedding index, or training a model.
- You're mirroring documentation, doing SEO research, or running content analysis.
- You're shipping a Lambda job that needs to survive restarts.
❌ **Not the right tool**
- You need to log in, solve CAPTCHAs, or scrape at adversarial sites that explicitly forbid it.
- You want a UI-driven scraping product. yoink is a library + CLI.
- You need millions of pages a day at sustained throughput. Look at distributed systems like Apache Nutch.
## Where to next
- [Installation](/docs/installation) — `pip install yoink` and optional extras.
- [Quickstart](/docs/quickstart) — your first crawl in 30 seconds.
- [Architecture](/docs/concepts/architecture) — how the moving parts fit together.
## Installation
_Source: `docs/installation.mdx` · https://yoink.goatsquadstudios.com/docs/installation_
> Install yoink from source or PyPI, including optional extras for Parquet, S3, and JavaScript rendering.
yoink supports **Python 3.11+** and runs on Linux, macOS, and Windows.
## Standard install
yoink is currently distributed from source on GitHub:
```bash
git clone https://github.com/ErikkJs/yoink
cd yoink
pip install -e .
```
If you use [Poetry](https://python-poetry.org/), the project ships a `pyproject.toml`:
```bash
git clone https://github.com/ErikkJs/yoink
cd yoink
poetry install
```
The bare name `yoink` on PyPI is taken by an unrelated package (a podcast downloader). Until this crawler is published under a distinct distribution name, install from source as shown above. Running `pip install yoink` will fetch the wrong package.
You can also install directly from the GitHub URL without cloning:
```bash
pip install "git+https://github.com/ErikkJs/yoink.git"
```
## Optional extras
yoink keeps heavy dependencies behind extras so the core stays lean.
| Extra | Adds | When you need it |
| ---------- | ----------------------------- | ---------------------------------------------------- |
| `parquet` | `pyarrow` | Writing crawl output as columnar Parquet files |
| `s3` | `aioboto3` | Checkpointing to AWS S3 (Lambda, EC2, ECS workloads) |
| `browser` | `playwright` | Rendering JavaScript-heavy sites and SPAs |
| `all` | All of the above | When you don't want to think about it |
```bash
# Install with one extra (from a local clone)
pip install -e ".[parquet]"
# Multiple extras
pip install -e ".[s3,parquet]"
# Everything
pip install -e ".[all]"
# Or from GitHub
pip install "yoink[all] @ git+https://github.com/ErikkJs/yoink.git"
```
### Playwright browsers
The `browser` extra installs the Playwright Python package, but you also need to download the actual browser binaries (Chromium / Firefox / WebKit):
```bash
pip install -e ".[browser]"
playwright install chromium
```
For containerized environments, use `playwright install --with-deps chromium` to install both the browser and the system libraries it needs.
### S3 credentials
The `s3` extra brings in `aioboto3`, but the SDK still needs credentials. Any of these work:
```bash
# 1. AWS CLI profile (recommended for local development)
aws configure
# 2. Environment variables
export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
# 3. IAM role (automatic on EC2 / ECS / Lambda)
# No configuration needed
```
Minimum IAM permissions for the bucket you're checkpointing to:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:HeadObject"],
"Resource": "arn:aws:s3:::your-bucket-name/*"
}
]
}
```
## Verify the install
```bash
yoink version
# yoink version 0.1.0
# The public data crawler.
```
You're ready. Head to the [quickstart](/docs/quickstart).
## Quickstart
_Source: `docs/quickstart.mdx` · https://yoink.goatsquadstudios.com/docs/quickstart_
> Yoink your first website in under a minute — CLI and Python.
This page gets you from zero to a finished crawl twice: once on the CLI, once in Python.
## Your first CLI crawl
```bash
yoink crawl https://example.com
```
That's it. yoink will:
1. Fetch the start URL, parse it, extract text, and follow links.
2. Default to depth `1` and `100 pages` — adjust with `--depth` and `--max-pages`.
3. Rate-limit to `2` requests per second per domain and respect `robots.txt`.
4. Write results to `crawl_output.jsonl` in the current directory.
Open the file:
```bash
head -1 crawl_output.jsonl | python -m json.tool
```
## A more useful crawl
```bash
yoink crawl https://docs.python.org \
--depth 2 \
--max-pages 50 \
--include "*/tutorial/*" \
--skip-extensions pdf,zip \
--format jsonl \
-o python_tutorial.jsonl
```
What's happening:
- `--depth 2` follows two link hops from the start URL.
- `--include "*/tutorial/*"` only crawls URLs matching that glob.
- `--skip-extensions pdf,zip` ignores binary file links.
- `--format jsonl -o python_tutorial.jsonl` streams one JSON object per page to disk.
Then inspect what you got:
```bash
yoink stats python_tutorial.jsonl
```
You'll see total pages, link counts, depth distribution, top domains, and content quality metrics.
## Your first Python crawl
```python
import asyncio
from yoink import Crawler, CrawlConfig
async def main():
config = CrawlConfig(
max_depth=2,
max_pages=100,
max_concurrency=10,
requests_per_second=2.0,
)
crawler = Crawler(config=config)
pages = await crawler.crawl("https://example.com")
for page in pages:
print(f"{page.status_code} {page.url}")
print(f" title: {page.title}")
print(f" text: {len(page.text or '')} chars")
asyncio.run(main())
```
## Resumable crawls
Long crawls die. Plan for it from day one with checkpointing:
```python
from yoink import Crawler, CrawlConfig, CheckpointManager
async def main():
config = CrawlConfig(max_depth=3, max_pages=10_000)
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
# Pick up where we left off if the file already exists
pages = await crawler.crawl("https://docs.example.com", resume=True)
return pages
```
Same on the CLI:
```bash
# First run — interrupted with Ctrl-C
yoink crawl https://docs.example.com --checkpoint ./crawl.jsonl
# Resume
yoink crawl https://docs.example.com --checkpoint ./crawl.jsonl --resume
```
Replace the path with `s3://my-bucket/crawl.jsonl` and yoink will buffer writes and flush them to S3. Survives Lambda timeouts and restarts. See [Lambda + S3 checkpoints](/docs/examples/lambda-s3).
## What to read next
- **Concepts** — [architecture](/docs/concepts/architecture), [rate limiting](/docs/concepts/rate-limiting), [JS rendering](/docs/concepts/javascript-rendering).
- **CLI reference** — every `yoink crawl` flag, [explained](/docs/cli/crawl).
- **Python API** — [`Crawler`](/docs/api/crawler), [`CrawlConfig`](/docs/api/config), [`CheckpointManager`](/docs/api/checkpoint).
---
# Concepts
## Architecture
_Source: `docs/concepts/architecture.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/architecture_
> How yoink's modules fit together — fetcher, parser, scheduler, rate limiter, robots checker, checkpoint, storage.
yoink is built as a small graph of single-purpose modules. Each one does one thing, and the `Crawler` is the conductor that wires them together.
## The components
| Module | Responsibility |
| --------------------- | --------------------------------------------------------------------------- |
| `Crawler` | Owns the worker pool, drives the crawl loop, persists results. |
| `Fetcher` | Async HTTP client (aiohttp). Up to 3 attempts with exponential backoff on `ClientError`; immediate retry on `TimeoutError`; HTTP errors (4xx/5xx) returned as-is, not retried. |
| `PlaywrightFetcher` | Browser-based fetcher for JS-heavy sites. |
| `create_fetcher` | Factory function in `fetcher_factory.py`. Returns `PlaywrightFetcher` when `render_js=True` (and Playwright is importable), `Fetcher` otherwise. Emits `UserWarning` and falls back to `Fetcher` if `render_js=True` but Playwright isn't installed. |
| `Parser` | HTML → title, links, metadata (BeautifulSoup + lxml). |
| `Extractor` | HTML → clean text via trafilatura. |
| `Scheduler` | URL queue, depth tracking, deduplication, filter integration. |
| `RateLimiter` | Per-domain token bucket with `Crawl-delay` support. |
| `RobotsChecker` | Fetches & caches `robots.txt`, answers `is_allowed(url)`. |
| `URLFilter` / `DomainFilter` / `CombinedFilter` | Glob/regex/extension/domain matching (`filters.py`). |
| `CheckpointManager` | Append page records and crawl state to a `CheckpointStorage`. |
| `CheckpointStorage` | Pluggable backend (local file, S3). |
| `Writer` | Final-output serialization (JSON, JSONL, Parquet, text). |
| `CrawlStats` | Post-crawl analysis (depth, domains, content quality). |
## Lifecycle of a crawl
## Why this shape?
**Async workers, not threads.** Crawling is I/O bound. `asyncio` lets one process handle hundreds of concurrent requests without the overhead of OS threads.
**A real queue, not recursion.** Depth-limited BFS gives predictable memory usage and clean depth metadata. The scheduler also owns deduplication, so workers can't accidentally re-fetch the same URL.
**Rate limiting at the gate.** Token bucket per domain — workers compete for tokens, so even if you've got 50 concurrent requests, no single domain sees more than `requests_per_second`.
**Checkpoints as an append log.** Pages stream to checkpoint as soon as they're crawled, so a crash never costs you more than the in-flight batch. State (visited set, queue, filters) is written at the end and on every flush interval.
## Where to extend
- **Custom storage backend** — implement [`CheckpointStorage`](/docs/api/storage) (`write`, `read`, `exists`, `flush`, `close`) for Redis, GCS, Azure Blob, etc.
- **Custom filter** — implement `should_crawl(url) -> bool` and pass it via the `url_filter` argument to `Crawler`.
- **Custom extractor** — replace the default trafilatura-based `Extractor` for domain-specific extraction (PDFs, schema.org parsing, etc.).
## Code locations
The full source lives at github.com/ErikkJs/yoink/tree/master/src/yoink. ~3,200 lines of Python across 18 files, each module focused, with 134 passing tests under `tests/`.
## Rate limiting
_Source: `docs/concepts/rate-limiting.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/rate-limiting_
> Per-domain token bucket rate limiting with burst support, minimum delays, and Crawl-delay honoring.
yoink rate-limits at the **fetcher gate** — every outbound request has to acquire a token before going out. This protects target servers, keeps you on the right side of `robots.txt` `Crawl-delay`, and avoids tripping basic anti-abuse heuristics.
## The mechanism: token bucket
A token bucket fills at a constant rate (your `requests_per_second`) and holds a fixed maximum number of tokens (its capacity). Each request consumes one token. If the bucket is empty, the request waits until a token regenerates.
This gives you smooth traffic shaping at sustained `requests_per_second`. The default `burst_size=1` means there's no extra burst headroom — the very first request consumes the only token, and subsequent requests pace themselves at exactly your configured rate.
`burst_size` is a knob on the `RateLimiter` class but is **not** currently surfaced on `CrawlConfig` or the CLI. If you need bursts (e.g., 10 RPS sustained but happy to fire 5 in a row when idle), construct the limiter directly:
```python
from yoink.rate_limiter import RateLimiter
limiter = RateLimiter(requests_per_second=10.0, burst_size=5)
# then pass to your fetcher manually if subclassing
```
For most workloads, the `requests_per_second=2.0, burst_size=1` defaults are exactly what you want — polite, predictable, no surprises.
## Per-domain isolation
Rate limits are scoped to each domain you crawl. If `--follow-external` is enabled and your crawl visits both `docs.python.org` and `python.org`, they each get an independent bucket. Misbehaving on one domain can't slow another.
```python
config = CrawlConfig(
requests_per_second=5.0, # 5 RPS per domain
max_concurrency=20, # but only 20 concurrent overall
)
```
## `request_delay` — a wait-time floor
`request_delay` is a hard floor on the wait time computed by `acquire()` for each request to a given domain. With `request_delay=0.5`, every request to that domain (including the first) sleeps at least 500ms before being released, even if the token bucket has tokens available.
```bash
yoink crawl https://example.com --rate-limit 5.0 --request-delay 0.5
# Up to 5 RPS by token bucket, but every release sleeps ≥ 500ms
```
In Python:
```python
config = CrawlConfig(
requests_per_second=5.0,
request_delay=0.5, # seconds; per-acquire floor
)
```
`request_delay` raises the floor on `acquire()`'s wait calculation, so it's effectively a per-request "wait at least this long." With `burst_size=1` and `request_delay=0.5`, you get a steady cadence of one request every 500ms (or slower, if the bucket is empty). It's not literally measured "between consecutive completions" — it's the minimum sleep before each token is handed out.
## robots.txt `Crawl-delay`
When `respect_robots=True` (the default), yoink reads each domain's `robots.txt` and applies its `Crawl-delay` directive by reducing the bucket's refill rate to `1 / crawl_delay` requests per second — but only if that's stricter than your configured rate. The stricter limit always wins.
If your config says `requests_per_second=5.0` (1 request every 200ms) and the site's `robots.txt` has `Crawl-delay: 1`, the bucket's effective rate drops to 1 RPS for that domain — yoink will wait at least 1 second between requests there. Your config is the ceiling, not the floor.
Once a `Crawl-delay` reduces the bucket's rate, it stays reduced for the lifetime of that `RateLimiter` even if `robots.txt` is later refreshed with a less-restrictive value. In practice this only matters if you cache an extremely strict `Crawl-delay` and the site loosens it during your crawl — generally a non-issue.
## Picking sane defaults
A non-exhaustive heuristic:
| Target | Suggested `requests_per_second` |
| ------------------- | ------------------------------- |
| Personal blog | 1.0 |
| Documentation site | 2.0 – 5.0 |
| Public API / large news site | 5.0 – 10.0 |
| Your own staging server | Whatever you want |
If the site you're crawling publishes a `Crawl-delay`, honor it — yoink does this for you, but you can also set `request_delay` explicitly to make the constraint visible at the call site.
## Disabling rate limiting
You can't turn it fully off, but you can effectively disable it for testing:
```python
config = CrawlConfig(
requests_per_second=1000, # absurdly high
request_delay=0.0,
)
```
For real workloads: don't.
## See also
- [`CrawlConfig.requests_per_second`](/docs/api/config) and `request_delay` reference.
- [robots.txt compliance](/docs/concepts/robots-txt) — how `Crawl-delay` is parsed and applied.
- The `RateLimiter` module: src/yoink/rate_limiter.py.
## robots.txt compliance
_Source: `docs/concepts/robots-txt.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/robots-txt_
> How yoink parses, caches, and applies robots.txt directives — Allow, Disallow, Crawl-delay, Sitemap.
yoink respects [robots.txt](https://www.robotstxt.org/) by default. The `RobotsChecker` is consulted before every fetch, and disallowed URLs are filtered out before they ever hit the queue.
## What's supported
- ✅ `User-agent` matching — exact, partial substring, and `*` wildcard fallback.
- ✅ `Disallow` rules with wildcard (`*`) and end-anchor (`$`) patterns.
- ✅ `Allow` rules (longer/more-specific paths win).
- ✅ `Crawl-delay` — narrows the rate limiter for that domain.
- ✅ `Sitemap` directives — parsed and stored on each domain's `RobotsDirectives.sitemaps` list.
- ✅ Per-domain caching with a 1-hour default TTL.
## How it fits in
## Pattern matching
yoink approximates [RFC 9309](https://www.rfc-editor.org/rfc/rfc9309):
- `*` matches any sequence of characters (greedy).
- `$` at the end of a pattern anchors the match to the end of the URL path.
- Rules are sorted by path length (longest first), and the first match wins. **Tie-breaks** between equal-length `Allow` and `Disallow` rules go to whichever appears first in the file (Python's stable sort), not strictly to `Allow` as the RFC prefers. Author your `Allow`/`Disallow` rules with that in mind, or rely on the longer/more-specific path winning.
Examples:
```text
User-agent: *
Disallow: /private/
Disallow: /*.pdf$
Allow: /private/public-page.html
Crawl-delay: 2
```
| URL | Result | Why |
| -------------------------------- | -------- | ----------------------------------------- |
| `/about` | allowed | No matching rule |
| `/private/secrets` | blocked | `Disallow: /private/` |
| `/private/public-page.html` | allowed | `Allow` is more specific than `Disallow` |
| `/docs/manual.pdf` | blocked | `Disallow: /*.pdf$` |
| `/docs/manual.pdf?download=1` | allowed | The `$` anchor; query strings break the match |
## User-agent matching
yoink matches your configured `user_agent` against the `robots.txt` `User-agent` blocks in this order:
1. **Exact match** (case-insensitive).
2. **Partial match** — bidirectional substring (`a in b or b in a`). For example, `User-agent: yoink` matches the default UA `yoink/0.3.0 (+...)` because `"yoink"` is a substring of the UA.
3. **Wildcard fallback** (`User-agent: *`).
The substring check runs both directions, so a `robots.txt` block with `User-agent: yo` would also match `yoink/0.3.0`. If you publish or consume terse UAs, this can lead to surprising matches — use a distinctive UA string and you'll be fine.
## Caching
`robots.txt` is fetched once per domain and cached for 1 hour. This keeps yoink polite for long crawls without re-fetching `robots.txt` for every URL.
The cache is in-memory and per-`Crawler` instance — a fresh process or a new `Crawler()` will re-fetch.
## Disabling robots.txt checks
You can disable robots.txt enforcement, but it's the website operator's primary signal that they don't want a crawler. If you opt out, you take on the responsibility of knowing why and being able to defend it.
```bash
# CLI
yoink crawl https://example.com --no-robots
```
```python
# Python
config = CrawlConfig(respect_robots=False)
```
When disabled, yoink doesn't fetch `robots.txt` at all and crawls freely subject only to your other config.
## Inspecting the rules
The cleanest way to inspect what `RobotsChecker` saw is to share the `Crawler`'s instance — it already has the `Fetcher` wired up. Here's a one-shot script that prints what it learned about each domain it visited:
```python
import asyncio
from yoink import Crawler, CrawlConfig
async def main():
crawler = Crawler(CrawlConfig(max_pages=20))
await crawler.crawl("https://example.com")
rc = crawler.robots_checker
if rc is None:
return # respect_robots was disabled
for domain, cached in rc._cache.items():
for ua, directives in cached.directives.items():
print(f"[{domain}] User-agent: {ua}")
print(f" rules: {len(directives.rules)}")
print(f" crawl_delay: {directives.crawl_delay}")
print(f" sitemaps: {directives.sitemaps}")
asyncio.run(main())
```
For ad-hoc `is_allowed()` checks, use the public method (it's `async`):
```python
allowed = await crawler.robots_checker.is_allowed("https://example.com/private/")
```
`RobotsChecker` needs a `Fetcher` to download `robots.txt` from the network. The `Crawler` wires this for you. If you want to use `RobotsChecker` outside a `Crawler`, you have to call `set_fetcher(my_fetcher)` with an open `Fetcher` (`async with Fetcher() as f: ...`) before `is_allowed()` will check anything — otherwise it returns `True` unconditionally.
## See also
- [Rate limiting](/docs/concepts/rate-limiting) — how `Crawl-delay` interacts with your `requests_per_second`.
- The `RobotsChecker` source: src/yoink/robots.py.
## JavaScript rendering
_Source: `docs/concepts/javascript-rendering.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/javascript-rendering_
> Use Playwright to render single-page apps and sites whose content lives behind JS execution.
The default `Fetcher` is a fast async HTTP client. It works perfectly for traditional sites, documentation, blogs, and most APIs. But if a site renders content client-side — React/Vue/Svelte SPAs, infinite-scroll pages, content that requires JS execution — you need a real browser.
That's what the **Playwright fetcher** is for.
## Enable JS rendering
Install the optional `browser` extra and download a browser binary:
```bash
pip install "yoink[browser]"
playwright install chromium
```
Then turn it on with `--render-js` (CLI) or `render_js=True` (Python):
```bash
yoink crawl https://spa-site.com --render-js
```
```python
config = CrawlConfig(
render_js=True,
browser_type="chromium", # or firefox, webkit
wait_strategy="networkidle", # or load, domcontentloaded, commit
headless=True,
)
```
## How it works
When `render_js=True`, `create_fetcher()` returns a `PlaywrightFetcher` instead of the standard HTTP `Fetcher`. The crawler is otherwise unchanged — same scheduler, same rate limiter, same robots checker.
If you set `render_js=True` but the `playwright` package isn't importable, `create_fetcher()` emits a `UserWarning` and silently falls back to the HTTP `Fetcher`. The crawl still runs — you just won't get JS rendering. Install with `pip install "yoink[browser]" && playwright install chromium` to actually get the browser.
## Wait strategies
Playwright's notion of "loaded" is different from a plain HTTP fetch. Pick the strategy that matches what you need:
For sites that render content after `networkidle` (rare, but it happens), use a CSS selector to wait for a specific element:
```bash
yoink crawl https://spa.com --render-js --wait-selector ".article-content"
```
```python
config = CrawlConfig(
render_js=True,
wait_selector=".article-content",
)
```
## Browser pooling
Launching a browser is expensive. yoink reuses a pool of browser **contexts** (isolated cookie/localStorage scopes within a single browser process):
```python
config = CrawlConfig(
render_js=True,
browser_pool_size=3, # default
)
```
Workers borrow a context, render the page, and return it. Three contexts is a good default for `max_concurrency=10` — enough that workers rarely block on the pool, few enough that memory stays reasonable.
## Browser choice
| Browser | When to pick it |
| --------- | ---------------------------------------------------------------- |
| chromium | Default. Best site compatibility, fastest startup. |
| firefox | If you need to test against Firefox-specific behavior. |
| webkit | Closest approximation of Safari rendering. |
For data extraction, **Chromium is almost always the right choice.** The other engines exist for testing/cross-browser validation.
## Debugging
Run with a visible browser to watch what's happening:
```bash
yoink crawl https://spa.com --render-js --no-headless
```
For scripted runs that crash mysteriously, point Playwright at a screenshot directory:
```python
config = CrawlConfig(
render_js=True,
screenshot_dir="./debug-screenshots",
)
```
Each fetched page gets a PNG dropped in that directory, named `screenshot_<8-char-md5>.png` (e.g., `screenshot_a1b2c3d4.png`) where the 8 chars are the first 8 hex digits of the MD5 of the URL. Collisions are extremely rare in practice but possible on huge crawls.
## Cost & throughput
JS rendering is **10–50× slower** than plain HTTP fetching. A page that takes 200ms over HTTP might take 3–8 seconds with Playwright (network + render + wait). Plan accordingly:
- Lower `max_concurrency` (try 5 instead of 20).
- Use `wait_strategy="domcontentloaded"` if you don't need post-mount data.
- Keep `--render-js` off for the parts of your crawl that don't need it. yoink doesn't (yet) auto-detect; that's a per-target decision.
## When NOT to use it
If `curl https://site.com` returns the content you want, you don't need a browser. The default `Fetcher` is faster, lighter, and infinitely more reliable.
Try the HTTP fetcher first. Switch only when content is missing.
## See also
- [`CrawlConfig`](/docs/api/config) — full list of JS-related options.
- The Playwright fetcher source: src/yoink/playwright_fetcher.py.
## Checkpointing
_Source: `docs/concepts/checkpointing.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/checkpointing_
> Resumable crawls with append-only checkpoints. Survive Lambda timeouts, OOM kills, and Ctrl-C.
Long crawls die. Lambda timeouts hit. SSH connections drop. Servers OOM-kill your process. yoink's checkpointing system makes any crawl resumable with two lines of code.
## What gets checkpointed
A checkpoint file is an append-only log of three kinds of records:
1. **Metadata** — start URL, config snapshot, timestamp. Written once at the start.
2. **Pages** — one record per crawled page. Streamed as they finish.
3. **State** — the visited set, the queue, the filtered set. Written periodically and on shutdown.
The format is JSONL with a `type` discriminator on each line:
## CLI usage
```bash
# Run a crawl with checkpointing
yoink crawl https://example.com --checkpoint ./crawl.jsonl
# It crashed / you Ctrl-C'd. Resume:
yoink crawl https://example.com --checkpoint ./crawl.jsonl --resume
```
The same flags work with S3 URIs:
```bash
yoink crawl https://example.com --checkpoint s3://my-bucket/crawl.jsonl --resume
```
## Python usage
```python
from yoink import Crawler, CrawlConfig, CheckpointManager
async def main():
config = CrawlConfig(max_pages=10_000)
# Local file
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
# ...or S3
# checkpoint = CheckpointManager.from_uri("s3://my-bucket/crawl.jsonl")
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
# If the file exists, pick up where we left off
pages = await crawler.crawl("https://example.com", resume=True)
return pages
```
## Flush interval
Pages are written immediately. State is flushed every N pages (default `10`) and on shutdown:
```python
checkpoint = CheckpointManager.from_uri(
"./crawl.jsonl",
flush_interval=50, # write state every 50 pages
)
```
```bash
yoink crawl https://example.com --checkpoint ./crawl.jsonl --checkpoint-interval 50
```
Lower values give finer-grained resume but cost more I/O. For S3, every flush is an API call, so you generally want a higher interval (50–100).
## Storage backends
`CheckpointManager.from_uri(...)` picks a backend based on the URI scheme:
| URI | Backend | Implementation |
| ---------------------------------- | ------------------ | -------------------- |
| `./relative/path.jsonl` | `LocalFileStorage` | Async aiofiles append |
| `/absolute/path.jsonl` | `LocalFileStorage` | Async aiofiles append |
| `s3://bucket/key.jsonl` | `S3Storage` | Buffered → put_object |
Want a custom backend (Redis, GCS, Azure)? Implement [`CheckpointStorage`](/docs/api/storage) — five async methods.
## How resume works
When you call `crawler.crawl(url, resume=True)`:
1. The checkpoint file is read line by line.
2. **Pages** are restored into `crawler.pages`.
3. **State** restores `scheduler.visited`, `scheduler.queue`, `scheduler.filtered`.
4. If the start URL doesn't match the checkpoint metadata, you get a warning.
5. The crawl continues from the queue.
Restoring visited URLs means yoink will never re-fetch a page that finished before the crash. The crawl picks up exactly where it left off — same depth, same queue order.
## When to use checkpoints
✅ **Use them**
- Crawls expected to take more than 10 minutes.
- Lambda jobs (any execution > 30s).
- Containers that may be killed (autoscaling, spot instances).
- Anywhere the start URL might be re-invoked.
❌ **Skip them**
- Throwaway crawls (one-shot data pulls in dev).
- Tiny crawls where re-running is cheaper than checkpoint I/O.
## See also
- [Lambda + S3 checkpoints example](/docs/examples/lambda-s3) — a complete resumable Lambda handler.
- [`CheckpointManager` API](/docs/api/checkpoint).
- [Storage backends](/docs/api/storage).
## URL filtering
_Source: `docs/concepts/url-filtering.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/url-filtering_
> Include patterns, exclude patterns, file-extension filters, and domain filters — combine for precise targeting.
Most crawls don't want every URL. URL filters tell yoink which pages to follow and which to skip *before* they hit the queue.
## The filter pipeline
For each candidate URL, `CombinedFilter` checks filters in this order. The first one that says "no" wins:
`DomainFilter` runs first because it's a fast hostname check; if you've explicitly allowlisted a domain set, everything else is irrelevant for URLs outside it. Inside `URLFilter`, the order is extension → include → exclude — the cheap path-suffix check before any pattern matching.
## CLI usage
```bash
yoink crawl https://example.com \
--include "*/blog/*" \
--include "*/docs/*" \
--exclude "*/private/*" \
--skip-extensions pdf,zip,exe
```
- `--include` and `--exclude` are repeatable.
- `--skip-extensions` is comma-separated.
## Python usage
```python
from yoink import Crawler, CrawlConfig
from yoink.filters import CombinedFilter
url_filter = CombinedFilter.from_config(
include_patterns=["*/blog/*", "*/docs/*"],
exclude_patterns=["*/private/*"],
skip_extensions=["pdf", "zip", "exe"],
allowed_domains=["example.com", "blog.example.com"],
)
crawler = Crawler(config=CrawlConfig(), url_filter=url_filter)
pages = await crawler.crawl("https://example.com")
```
## Pattern syntax
yoink auto-detects the kind of pattern based on its shape:
| Pattern shape | Treated as | Example |
| --------------------------- | ---------------- | ------------------------------------- |
| Contains `*` or `?` | Glob | `*/blog/*`, `*.html` |
| Starts `^` / ends `$` / has `[` | Regex | `^https://example\.com/v\d+/.*$` |
| Anything else | Substring match | `/api/` |
### Glob examples
```python
# All blog posts
"*/blog/*"
# Anything under /docs/, any depth
"*/docs/*"
# Specific URL with placeholder
"https://example.com/posts/?"
```
### Regex examples
```python
# Versioned API URLs
r"^https://api\.example\.com/v\d+/.*$"
# Posts from 2024 or later
r"/posts/(202[4-9]|20[3-9]\d)/.*"
```
### Substring examples
```python
"/api/" # any URL containing /api/
"draft" # any URL containing 'draft'
```
`"/api"` matches `/api/users` *and* `/apiserver/foo`. Use globs (`"*/api/*"`) when you mean a path segment.
## Extension filtering
Inside `URLFilter`, `skip_extensions` is checked before include/exclude patterns because it's cheap. It matches the lowercased URL path:
```python
skip_extensions=["pdf", "zip", "exe", "jpg", "png"]
```
You don't need the leading dot — yoink strips it. `pdf`, `.pdf`, and `PDF` all work.
## Domain filtering
By default, yoink stays on the start URL's domain. With `--follow-external`, it'll follow links anywhere. To allow specific external domains only:
```python
from yoink.filters import DomainFilter, CombinedFilter
domain_filter = DomainFilter(allowed_domains=["example.com", "docs.example.com"])
url_filter = CombinedFilter(domain_filter=domain_filter)
crawler = Crawler(
config=CrawlConfig(follow_external=True),
url_filter=url_filter,
)
```
Domain matching honors subdomains: `allowed_domains=["example.com"]` matches `example.com`, `www.example.com`, and `blog.example.com` — but not `evil-example.com`.
## Combining filters
Use `CombinedFilter.from_config(...)` for the common case:
```python
from yoink.filters import CombinedFilter
url_filter = CombinedFilter.from_config(
include_patterns=["*/api/*"],
exclude_patterns=["*/api/internal/*"],
skip_extensions=["pdf"],
allowed_domains=["api.example.com"],
)
```
Or compose lower-level filters explicitly:
```python
from yoink.filters import URLFilter, DomainFilter, CombinedFilter
url_filter = CombinedFilter(
url_filter=URLFilter(
include_patterns=["*/api/*"],
exclude_patterns=["*/internal/*"],
skip_extensions=["pdf"],
),
domain_filter=DomainFilter(allowed_domains=["api.example.com"]),
)
```
## See also
- [Filters API reference](/docs/api/filters).
- The `Filters` source: src/yoink/filters.py.
---
# CLI
## yoink crawl
_Source: `docs/cli/crawl.mdx` · https://yoink.goatsquadstudios.com/docs/cli/crawl_
> Complete reference for the yoink crawl command — every flag, every option, with examples.
`yoink crawl` is the workhorse: it takes a URL, fetches pages, and writes the result.
```bash
yoink crawl URL [OPTIONS]
```
## Examples
```bash
# The minimum
yoink crawl https://example.com
# Reasonable defaults for a small docs crawl
yoink crawl https://docs.example.com -d 2 -n 200 -o docs.jsonl
# JS-heavy SPA, output as Parquet
yoink crawl https://spa.com --render-js --format parquet -o data.parquet
# Resumable to S3
yoink crawl https://example.com \
--checkpoint s3://my-bucket/crawl.jsonl \
--resume \
--rate-limit 5 \
--depth 3
```
## Core options
", description: "Output file path. Skipped if --checkpoint is set without --output." },
{ name: "--follow-external", type: "FLAG", default: "false", description: "Follow links to domains other than the start URL's domain." },
{ name: "--save-html", type: "FLAG", default: "false", description: "Persist raw HTML on each Page record (large output)." },
{ name: "--user-agent", type: "TEXT", default: "yoink/ (+github)", description: "Custom User-Agent string sent on every request." },
]} />
## URL filtering
See [URL filtering](/docs/concepts/url-filtering) for pattern semantics.
## Checkpointing
See [Checkpointing](/docs/concepts/checkpointing) for details.
## Rate limiting
See [Rate limiting](/docs/concepts/rate-limiting).
## robots.txt
See [robots.txt compliance](/docs/concepts/robots-txt).
## JavaScript rendering
Requires the `[browser]` extra (`pip install "yoink[browser]"` and `playwright install chromium`).
See [JavaScript rendering](/docs/concepts/javascript-rendering).
## Output
By default, yoink prints progress to stderr and a summary to stdout when finished:
```
Yoinking https://example.com...
Max depth: 2, Max pages: 100, Concurrency: 10
Rate limit: 2.0 req/s, Robots.txt: enabled
Yoinking pages: 100%|████████| 87/100 [00:42<00:00, 2.07page/s]
Yoinked 87 pages to crawl_output.jsonl
Total links found: 1,243
Total text extracted: 412,891 characters
```
Pipe stderr away if you only want the summary:
```bash
yoink crawl https://example.com 2>/dev/null
```
## Exit codes
- `0` — always. The CLI prints errors to stderr but currently exits `0` on every code path, including bad config (e.g., `--resume` without `--checkpoint`) and write errors. If you script around `yoink crawl` and need to detect failure, scan stderr or check that the output file exists and is non-empty.
## See also
- [`yoink stats`](/docs/cli/stats) — analyze the output of a crawl.
- [Quickstart](/docs/quickstart) — concrete examples.
## yoink stats
_Source: `docs/cli/stats.mdx` · https://yoink.goatsquadstudios.com/docs/cli/stats_
> Analyze a saved crawl — page counts, depth distribution, top domains, content quality metrics.
`yoink stats` reads a crawl output file (JSON or JSONL) and prints a human-readable summary, with optional CSV / JSON export.
```bash
yoink stats FILE [OPTIONS]
```
## Examples
```bash
# Human-readable summary
yoink stats crawl_output.jsonl
# Export to CSV for spreadsheet work
yoink stats crawl_output.jsonl --export stats.csv
# JSON output
yoink stats crawl_output.jsonl --json
```
`yoink stats --json` currently writes one structlog INFO line to stdout before the JSON payload (the `loaded_pages` event). To pipe cleanly into `jq`, strip the first line:
```bash
yoink stats data.jsonl --json | tail -n +2 | jq '.total_pages'
```
This will be fixed in a future release; until then, the workaround is mechanical.
## Options
## What it computes
For every page in the file:
- **Total pages, total links, average links per page**
- **Total text size and average text size** (bytes)
- **Total HTML size** if `--save-html` was used
- **Depth distribution** — how many pages at each depth
- **Unique domains and top 10 domains** by page count
- **Status code distribution**
- **Content quality** — share of pages with text, title, metadata
- **Text length stats** — min / median / max characters
## Sample output
```
============================================================
YOINK Crawl Statistics
============================================================
Total Pages: 87
Total Links: 1,243
Avg Links/Page: 14.29
Content Size:
Total Text: 412.39 KB
Avg Text/Page: 4.74 KB
Domains:
Unique Domains: 1
Top Domains:
- docs.example.com: 87 pages
Depth Distribution:
Depth 0: 1 #
Depth 1: 24 ########################
Depth 2: 62 ##############################################################
Content Quality:
Pages with Text: 85 (97.7%)
Pages with Title: 87 (100.0%)
Pages with Metadata: 73 (83.9%)
Text Length:
Min: 142 chars
Median: 3,891 chars
Max: 28,442 chars
============================================================
```
## JSON output (`--json`)
```json
{
"total_pages": 87,
"total_links": 1243,
"avg_links_per_page": 14.29,
"total_text_size": 422291,
"max_depth": 2,
"pages_by_depth": { "0": 1, "1": 24, "2": 62 },
"unique_domains": 1,
"top_domains": [{ "domain": "docs.example.com", "count": 87 }],
"status_codes": { "200": 87 },
"pages_with_text": 85,
"pages_with_title": 87,
"pages_with_metadata": 73,
"text_length_min": 142,
"text_length_median": 3891,
"text_length_max": 28442
}
```
## See also
- The `CrawlStats` Python API: [`yoink.stats`](/docs/api/stats).
- Output formats: [reference/output-formats](/docs/reference/output-formats).
## yoink version
_Source: `docs/cli/version.mdx` · https://yoink.goatsquadstudios.com/docs/cli/version_
> Print yoink's version and a one-liner description.
```bash
yoink version
```
Output:
```
yoink version 0.1.0
The public data crawler.
```
That's it. Useful for shell scripts and CI pipelines that want to assert a minimum version.
```bash
yoink version | head -1 | awk '{print $3}'
# 0.1.0
```
You can also use the standard `--version` flag on the root command:
```bash
yoink --version
```
---
# Python API
## Crawler
_Source: `docs/api/crawler.mdx` · https://yoink.goatsquadstudios.com/docs/api/crawler_
> The main async web crawler — wires together the fetcher, parser, scheduler, and rate limiter.
`yoink.Crawler` is the entry point for programmatic use. It owns the worker pool and orchestrates a crawl from a start URL.
## Import
```python
from yoink import Crawler, CrawlConfig
```
## Constructor
```python
Crawler(
config: CrawlConfig | None = None,
url_filter: CombinedFilter | None = None,
checkpoint_manager: CheckpointManager | None = None,
)
```
## Methods
### `crawl(start_url, resume=False)`
Crawl a website starting from `start_url`.
```python
async def crawl(
self,
start_url: str,
resume: bool = False,
) -> list[Page]
```
**Returns:** `list[Page]` — every page yoinked. Note that pages are also accumulated in `crawler.pages`, which you can read mid-crawl from another coroutine.
### `crawl_with_progress(start_url, resume=False)`
Same as `crawl()` but renders a `tqdm` progress bar to stderr. Used by the CLI.
```python
async def crawl_with_progress(
self,
start_url: str,
resume: bool = False,
) -> list[Page]
```
## Attributes
## Examples
### Minimal crawl
```python
import asyncio
from yoink import Crawler
async def main():
crawler = Crawler()
pages = await crawler.crawl("https://example.com")
return pages
asyncio.run(main())
```
### With config and filter
```python
from yoink import Crawler, CrawlConfig
from yoink.filters import CombinedFilter
config = CrawlConfig(
max_depth=3,
max_pages=500,
requests_per_second=5.0,
render_js=True,
)
url_filter = CombinedFilter.from_config(
include_patterns=["*/api/*"],
skip_extensions=["pdf", "zip"],
)
crawler = Crawler(config=config, url_filter=url_filter)
pages = await crawler.crawl("https://docs.example.com")
```
### Mid-crawl progress (custom)
```python
import asyncio
from yoink import Crawler, CrawlConfig
async def report(crawler: Crawler):
while True:
await asyncio.sleep(2)
print(f"...crawled {len(crawler.pages)} pages")
async def main():
crawler = Crawler(CrawlConfig(max_pages=1000))
reporter = asyncio.create_task(report(crawler))
try:
return await crawler.crawl("https://example.com")
finally:
reporter.cancel()
asyncio.run(main())
```
### With checkpointing
See [Checkpointing](/docs/concepts/checkpointing) and [`CheckpointManager`](/docs/api/checkpoint) for full coverage.
```python
from yoink import Crawler, CrawlConfig, CheckpointManager
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
crawler = Crawler(config=CrawlConfig(), checkpoint_manager=checkpoint)
pages = await crawler.crawl("https://example.com", resume=True)
```
## See also
- [`CrawlConfig`](/docs/api/config) — every knob.
- [`Page`](/docs/api/page) — the per-page output type.
- [Architecture](/docs/concepts/architecture) — how the components fit.
## CrawlConfig
_Source: `docs/api/config.mdx` · https://yoink.goatsquadstudios.com/docs/api/config_
> Every knob — depth, concurrency, rate limit, robots, JS rendering, browser pool. Pydantic-validated.
`CrawlConfig` is a Pydantic model that captures every dial you can turn. Validation runs on construction, so invalid combinations (negative depth, concurrency > 100) fail immediately.
## Import
```python
from yoink import CrawlConfig
from yoink.models import WaitStrategy
```
## Core settings
= 0." },
{ name: "max_pages", type: "int", default: "100", description: "Hard cap on pages crawled. Validated >= 1." },
{ name: "max_concurrency", type: "int", default: "10", description: "Number of concurrent worker coroutines. Validated 1..100." },
{ name: "user_agent", type: "str", default: "yoink/ (+github)", description: "User-Agent header sent on every request." },
{ name: "timeout", type: "int", default: "30", description: "Per-request timeout in seconds. Validated >= 1." },
{ name: "follow_external", type: "bool", default: "False", description: "If False, drop links whose domain differs from the start URL's." },
{ name: "extract_text", type: "bool", default: "True", description: "Run trafilatura on each page's HTML to populate Page.text." },
{ name: "save_html", type: "bool", default: "False", description: "Persist the raw HTML on each Page record. Drastically increases output size." },
]} />
## robots.txt
## Rate limiting
= 0." },
]} />
## JavaScript rendering
Requires the `[browser]` extra.
## `WaitStrategy` enum
```python
from yoink.models import WaitStrategy
WaitStrategy.LOAD # "load"
WaitStrategy.DOMCONTENTLOADED # "domcontentloaded"
WaitStrategy.NETWORKIDLE # "networkidle"
WaitStrategy.COMMIT # "commit"
```
You can pass a string or an enum value:
```python
config = CrawlConfig(wait_strategy="networkidle") # OK
config = CrawlConfig(wait_strategy=WaitStrategy.NETWORKIDLE) # also OK
```
## Examples
### Minimal
```python
config = CrawlConfig(max_depth=2)
```
### Aggressive but polite
```python
config = CrawlConfig(
max_depth=4,
max_pages=10_000,
max_concurrency=20,
requests_per_second=10.0,
follow_external=False,
)
```
### SPA crawl with debug screenshots
```python
from yoink.models import WaitStrategy
config = CrawlConfig(
render_js=True,
browser_type="chromium",
wait_strategy=WaitStrategy.NETWORKIDLE,
wait_selector=".app-content",
headless=True,
browser_pool_size=5,
screenshot_dir="./debug",
)
```
### Loading from environment / config file
`CrawlConfig` is a standard Pydantic model, so you can use `model_validate()` with a dict from any source:
```python
import json
from yoink import CrawlConfig
with open("crawl.json") as f:
raw = json.load(f)
config = CrawlConfig.model_validate(raw)
```
## See also
- [`Crawler`](/docs/api/crawler) — uses this config.
- [Configuration reference](/docs/reference/configuration) — quick-scan view of every option.
## Page
_Source: `docs/api/page.mdx` · https://yoink.goatsquadstudios.com/docs/api/page_
> The per-page output type — URL, title, extracted text, links, metadata, status code, depth.
`Page` is the Pydantic model representing one crawled URL.
## Import
```python
from yoink import Page
# or
from yoink.models import Page
```
## Fields
tag content, if present." },
{ name: "text", type: "str | None", description: "Clean extracted text from trafilatura. None if extract_text=False or extraction failed." },
{ name: "html", type: "str | None", description: "Raw HTML. Only populated when save_html=True." },
{ name: "links", type: "list[str]", default: "[]", description: "Outbound links discovered on the page (absolute URLs)." },
{ name: "metadata", type: "dict[str, str]", default: "{}", description: "OpenGraph / Twitter / standard meta tags." },
{ name: "crawled_at", type: "datetime", description: "UTC timestamp when the page was fetched." },
{ name: "status_code", type: "int", default: "200", description: "HTTP response status code." },
{ name: "depth", type: "int", default: "0", description: "Link-hop distance from the start URL." },
]} />
## Methods
`Page` inherits all standard Pydantic v2 methods:
```python
page.model_dump() # → dict
page.model_dump(mode="json") # → JSON-safe dict (datetimes as strings)
page.model_dump_json() # → str
Page.model_validate(data) # construct from dict
Page.model_validate_json(s) # construct from JSON string
```
## Examples
### Inspecting after a crawl
```python
pages = await crawler.crawl("https://example.com")
for page in pages:
print(f"[{page.status_code}] depth={page.depth} {page.url}")
print(f" title: {page.title or '(none)'}")
print(f" text: {len(page.text or '')} chars, {len(page.links)} links")
if "og:image" in page.metadata:
print(f" image: {page.metadata['og:image']}")
```
### Reading pages back from JSONL
```python
import json
from yoink import Page
pages: list[Page] = []
with open("crawl_output.jsonl") as f:
for line in f:
pages.append(Page.model_validate_json(line))
print(f"Loaded {len(pages)} pages")
```
### Filtering for content quality
```python
# Keep only pages with at least 500 chars of clean text
substantial = [p for p in pages if p.text and len(p.text) >= 500]
# Group by depth
from collections import defaultdict
by_depth = defaultdict(list)
for p in pages:
by_depth[p.depth].append(p)
```
## JSON shape
When serialized:
```json
{
"url": "https://example.com/about",
"title": "About Example",
"text": "Example is a domain established for...",
"html": null,
"links": ["https://example.com/", "https://example.com/contact"],
"metadata": {
"description": "About page",
"og:title": "About Example",
"og:type": "website"
},
"crawled_at": "2026-05-03T12:34:56.789012",
"status_code": 200,
"depth": 1
}
```
## See also
- [Output formats](/docs/reference/output-formats) — how pages are serialized to JSON, JSONL, Parquet, text.
- [`Writer`](/docs/api/writers) — how to write pages to files programmatically.
## CheckpointManager
_Source: `docs/api/checkpoint.mdx` · https://yoink.goatsquadstudios.com/docs/api/checkpoint_
> Persist crawl progress to disk or S3 — automatic resume, configurable flush interval.
`CheckpointManager` writes pages and crawl state to a [`CheckpointStorage`](/docs/api/storage) backend. Pass one to `Crawler` to make any crawl resumable.
## Import
```python
from yoink import CheckpointManager
```
## Constructing
The recommended path is `from_uri()`, which picks the right storage backend:
```python
# Local file
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
# S3 (requires [s3] extra)
checkpoint = CheckpointManager.from_uri("s3://my-bucket/crawl.jsonl")
# With custom flush cadence
checkpoint = CheckpointManager.from_uri(
"./crawl.jsonl",
flush_interval=50,
)
```
For full control, build with an explicit storage backend:
```python
from yoink import CheckpointManager
from yoink.storage import LocalFileStorage, S3Storage
storage = S3Storage("s3://my-bucket/crawl.jsonl")
checkpoint = CheckpointManager(storage=storage, flush_interval=50)
```
## API surface
There's no `CheckpointManager.flush()` method. If you need to force a flush from outside the crawler (e.g., before checking the file from another process), call `await manager.storage.flush()` directly.
## Examples
### Resumable local crawl
```python
import asyncio
from yoink import Crawler, CrawlConfig, CheckpointManager
async def main():
config = CrawlConfig(max_depth=3, max_pages=10_000)
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
# Pick up where we left off if the file exists
pages = await crawler.crawl("https://example.com", resume=True)
print(f"Total pages: {len(pages)}")
asyncio.run(main())
```
### Inspecting a checkpoint
```python
import asyncio
from yoink import CheckpointManager
async def main():
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
data = await checkpoint.load()
print(f"Started: {data['metadata']['started_at']}")
print(f"Start URL: {data['metadata']['start_url']}")
print(f"Pages saved: {len(data['pages'])}")
if state := data.get("state"):
print(f"Visited: {len(state['visited'])}")
print(f"Queue: {len(state['queue'])}")
print(f"Filtered:{len(state['filtered'])}")
asyncio.run(main())
```
### Choosing a flush interval
```python
# Aggressive flushing — every page gets persisted state too
checkpoint = CheckpointManager.from_uri("./crawl.jsonl", flush_interval=1)
# Moderate (default) — state every 10 pages
checkpoint = CheckpointManager.from_uri("./crawl.jsonl", flush_interval=10)
# S3 — minimize API calls for cost
checkpoint = CheckpointManager.from_uri("s3://bucket/crawl.jsonl", flush_interval=100)
```
For S3 the trade-off is real: every flush is a `put_object` call. Don't go below 50 unless you have specific reasons.
## File format
A checkpoint is a JSONL file with `type` discriminators:
```jsonl
{"type": "metadata", "start_url": "...", "config": {...}, "started_at": "..."}
{"type": "page", "url": "...", "title": "...", ...}
{"type": "page", "url": "...", "title": "...", ...}
{"type": "state", "visited": [...], "queue": [...], "filtered": [...]}
```
This is intentionally readable and `grep`-friendly. You can hand-edit a checkpoint to remove a problematic page or trim the queue.
## See also
- [Checkpointing concepts](/docs/concepts/checkpointing).
- [`CheckpointStorage`](/docs/api/storage) — the storage backend interface.
- [Lambda + S3 example](/docs/examples/lambda-s3).
## Filters
_Source: `docs/api/filters.mdx` · https://yoink.goatsquadstudios.com/docs/api/filters_
> URLFilter, DomainFilter, and CombinedFilter — pattern matching, extension filtering, domain allowlists.
```python
from yoink.filters import URLFilter, DomainFilter, CombinedFilter
```
## `URLFilter`
Pattern-based URL filtering. Auto-detects glob, regex, or substring patterns.
```python
URLFilter(
include_patterns: list[str] | None = None,
exclude_patterns: list[str] | None = None,
skip_extensions: list[str] | None = None,
)
```
```python
url_filter = URLFilter(
include_patterns=["*/blog/*", "*/docs/*"],
exclude_patterns=["*/private/*", r"^.*\?draft=1$"],
skip_extensions=["pdf", "zip", "exe"],
)
url_filter.should_crawl("https://example.com/blog/post-1") # True
url_filter.should_crawl("https://example.com/private/x") # False
url_filter.should_crawl("https://example.com/manual.pdf") # False
```
## `DomainFilter`
Domain allowlist with subdomain matching.
```python
DomainFilter(allowed_domains: list[str] | None = None)
```
```python
domain_filter = DomainFilter(allowed_domains=["example.com"])
domain_filter.should_crawl("https://example.com/page") # True
domain_filter.should_crawl("https://blog.example.com/x") # True (subdomain)
domain_filter.should_crawl("https://other.com/page") # False
domain_filter.should_crawl("https://evil-example.com/x") # False
```
Subdomain matching: a URL passes if its hostname **is** an allowed domain or **ends with** `.{allowed_domain}`.
## `CombinedFilter`
Composes a `URLFilter` and a `DomainFilter`. This is what `Crawler` accepts.
```python
CombinedFilter(
url_filter: URLFilter | None = None,
domain_filter: DomainFilter | None = None,
)
```
The most ergonomic constructor is `from_config()`:
```python
CombinedFilter.from_config(
include_patterns: list[str] | None = None,
exclude_patterns: list[str] | None = None,
skip_extensions: list[str] | None = None,
allowed_domains: list[str] | None = None,
) -> CombinedFilter
```
```python
url_filter = CombinedFilter.from_config(
include_patterns=["*/api/*"],
exclude_patterns=["*/internal/*"],
skip_extensions=["pdf"],
allowed_domains=["api.example.com"],
)
crawler = Crawler(config=CrawlConfig(), url_filter=url_filter)
```
## Pattern dispatch
| Pattern shape | Matched as |
| -------------------------------------- | ---------------- |
| Contains `*` or `?` | Glob (fnmatch) |
| Starts `^`, ends `$`, or contains `[` | Regex (re.match) |
| Anything else | Substring (`in`) |
See [URL filtering](/docs/concepts/url-filtering) for examples.
## Custom filters
Anything implementing `should_crawl(url: str) -> bool` works as a filter. To plug it into the crawler, wrap it with a tiny adapter or use it directly:
```python
class WeekendOnlyFilter:
def should_crawl(self, url: str) -> bool:
from datetime import datetime
return datetime.utcnow().weekday() >= 5 # Sat/Sun
# CombinedFilter accepts anything with a url_filter or domain_filter slot
# that has .should_crawl, so subclassing is the cleanest path:
class MyURLFilter(URLFilter):
def should_crawl(self, url: str) -> bool:
if "?utm" in url:
return False
return super().should_crawl(url)
```
## See also
- [URL filtering concepts](/docs/concepts/url-filtering).
- The `Filters` source: src/yoink/filters.py.
## Storage backends
_Source: `docs/api/storage.mdx` · https://yoink.goatsquadstudios.com/docs/api/storage_
> CheckpointStorage interface, LocalFileStorage, S3Storage, and the StorageFactory.
Storage backends are how `CheckpointManager` persists records. yoink ships two — local files and S3 — and the interface is small enough to add your own (Redis, GCS, Azure Blob, etc.).
```python
from yoink.storage import (
CheckpointStorage, # abstract base
LocalFileStorage,
S3Storage,
StorageFactory,
)
```
## `CheckpointStorage` interface
Every backend implements five async methods:
```python
class CheckpointStorage(ABC):
@abstractmethod
async def write(self, data: str) -> None: ...
@abstractmethod
async def read(self) -> AsyncIterator[str]: ...
@abstractmethod
async def exists(self) -> bool: ...
@abstractmethod
async def flush(self) -> None: ...
@abstractmethod
async def close(self) -> None: ...
```
## `LocalFileStorage`
Async append to a local file via `aiofiles`.
```python
LocalFileStorage(path: str)
```
```python
storage = LocalFileStorage("./crawl.jsonl")
```
- Opens the file in append mode on first `write()`.
- `flush()` calls the underlying `flush()` on the file handle (OS will still buffer to disk; pair with `fsync` if you need durability guarantees beyond the crawl).
- `close()` closes the file handle.
## `S3Storage`
Buffered S3 backend using `aioboto3`. Requires the `[s3]` extra.
```python
S3Storage(uri: str) # s3://bucket/key
```
```python
storage = S3Storage("s3://my-bucket/crawls/site-a.jsonl")
```
**Behavior:**
- `write()` buffers in memory.
- `flush()` downloads existing object (if any), appends the buffer, re-uploads via `put_object`. This is necessary because S3 objects don't support append.
- `read()` does a single `get_object` and yields lines.
- `exists()` does `head_object`.
S3 has no native append. Each flush is a download-mutate-upload. Set `flush_interval` to 50–100+ for production crawls; the sweet spot depends on page size and how upset you'd be losing the most-recent buffer on a crash.
### Required IAM permissions
```json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:HeadObject"],
"Resource": "arn:aws:s3:::your-bucket-name/*"
}]
}
```
### Credentials
`aioboto3` uses the standard boto3 credential chain. Locally, run `aws configure`. On Lambda / EC2 / ECS, attach an IAM role to the runtime and `S3Storage` will pick it up automatically.
## `StorageFactory`
Picks a backend based on URI scheme. This is what `CheckpointManager.from_uri()` uses internally.
```python
StorageFactory.from_uri("./checkpoint.jsonl")
# → LocalFileStorage
StorageFactory.from_uri("/abs/path.jsonl")
# → LocalFileStorage
StorageFactory.from_uri("s3://bucket/key.jsonl")
# → S3Storage
```
## Implementing a custom backend
Implementing the interface is roughly 80 lines. Here's a sketch for Redis:
```python
import redis.asyncio as redis
from yoink.storage import CheckpointStorage
class RedisStreamStorage(CheckpointStorage):
def __init__(self, url: str, key: str):
self.client = redis.from_url(url)
self.key = key
async def write(self, data: str) -> None:
await self.client.rpush(self.key, data)
async def read(self):
for raw in await self.client.lrange(self.key, 0, -1):
yield raw.decode("utf-8")
async def exists(self) -> bool:
return bool(await self.client.exists(self.key))
async def flush(self) -> None:
# Redis is auto-flushed
pass
async def close(self) -> None:
await self.client.aclose()
```
Then plug it in:
```python
from yoink import CheckpointManager
storage = RedisStreamStorage("redis://localhost", "yoink:crawl-1")
checkpoint = CheckpointManager(storage=storage)
crawler = Crawler(config=CrawlConfig(), checkpoint_manager=checkpoint)
```
## See also
- [`CheckpointManager`](/docs/api/checkpoint).
- [Checkpointing concepts](/docs/concepts/checkpointing).
## CrawlStats
_Source: `docs/api/stats.mdx` · https://yoink.goatsquadstudios.com/docs/api/stats_
> Compute, format, and export statistics from a crawl — depth distribution, top domains, content quality.
`CrawlStats` analyzes a list of `Page` objects (or loads them from a file) and produces summary metrics. It powers the `yoink stats` CLI but is also fine to use programmatically.
## Import
```python
from yoink.stats import CrawlStats
```
## Constructing
```python
# From a list of Page objects (e.g., right after a crawl)
stats = CrawlStats(pages)
# From a saved file (.json or .jsonl)
stats = CrawlStats.from_file(Path("crawl_output.jsonl"))
```
## Methods
## What `compute()` returns
```python
{
"total_pages": 87,
"total_links": 1243,
"total_text_size": 422291, # bytes
"total_html_size": 0, # 0 if save_html=False
"avg_links_per_page": 14.29,
"avg_text_size": 4853.92,
"avg_html_size": 0,
"max_depth": 2,
"pages_by_depth": { 0: 1, 1: 24, 2: 62 },
"unique_domains": 1,
"top_domains": [{ "domain": "docs.example.com", "count": 87 }],
"status_codes": { 200: 87 },
"pages_with_text": 85,
"pages_with_title": 87,
"pages_with_metadata": 73,
"text_length_min": 142,
"text_length_median": 3891,
"text_length_max": 28442,
}
```
If you pass an empty list of pages, `compute()` short-circuits and returns just `{"total_pages": 0}` — none of the other keys above will be present. If you read `data["pages_with_text"]` blindly, that'll `KeyError` on an empty crawl. Check `total_pages > 0` first or use `data.get(...)`.
## Examples
### After a crawl
```python
from yoink import Crawler, CrawlConfig
from yoink.stats import CrawlStats
async def main():
crawler = Crawler(CrawlConfig())
pages = await crawler.crawl("https://example.com")
stats = CrawlStats(pages)
print(stats.format_summary())
```
### From a saved file
```python
from pathlib import Path
from yoink.stats import CrawlStats
stats = CrawlStats.from_file(Path("crawl_output.jsonl"))
data = stats.compute()
print(f"Got {data['total_pages']} pages across {data['unique_domains']} domains")
print(f"Median page text: {data['text_length_median']} chars")
```
### Filtering by content quality
```python
data = stats.compute()
text_share = data["pages_with_text"] / data["total_pages"]
if text_share < 0.5:
print("⚠ Less than half the pages had extractable text — site may be JS-heavy")
```
### Export
```python
stats.export_csv(Path("crawl_stats.csv"))
```
The CSV has two sections:
```
Metric,Value
Total Pages,87
Total Links,1243
...
Top Domains,Count
docs.example.com,87
```
## See also
- The CLI version: [`yoink stats`](/docs/cli/stats).
## Writers
_Source: `docs/api/writers.mdx` · https://yoink.goatsquadstudios.com/docs/api/writers_
> Serialize a list of Page objects to JSON, JSONL, Parquet, or plain text.
`Writer` is a static helper class with one method per output format. Used internally by the CLI; available directly for programmatic use.
## Import
```python
from yoink.writers import Writer
```
## Methods
## Examples
### After a crawl
```python
from pathlib import Path
from yoink import Crawler
from yoink.writers import Writer
async def main():
crawler = Crawler()
pages = await crawler.crawl("https://example.com")
Writer.write_jsonl(pages, Path("data.jsonl"))
Writer.write_parquet(pages, Path("data.parquet"))
```
### Parquet for analytics
```python
import pandas as pd
Writer.write_parquet(pages, Path("data.parquet"))
df = pd.read_parquet("data.parquet")
print(df["depth"].value_counts())
print(df.groupby("depth")["num_links"].mean())
```
The Parquet schema is **flattened** for analytical queries:
### Text dump
```python
Writer.write_text(pages, Path("dump.txt"))
```
Output:
```
URL: https://example.com
Title: Example Domain
--------------------------------------------------------------------------------
This domain is for use in illustrative examples in documents...
================================================================================
URL: https://example.com/about
Title: About
...
```
When a page has no `title`, the field is rendered as `Title: N/A`.
## Output format choice
| Format | Best for | Streaming? | Compressed? |
| ---------- | ------------------------------------------- | ---------- | ----------- |
| `jsonl` | AI / ML pipelines, large datasets | yes | no |
| `json` | Small datasets, debugging | no | no |
| `parquet` | Analytics, pandas, columnar storage | yes (rows) | yes (snappy)|
| `text` | Quick eyeballing, archival | no | no |
For most use cases, **JSONL is the default answer.**
## See also
- [Output formats reference](/docs/reference/output-formats).
- [`Page`](/docs/api/page) — what gets serialized.
---
# Examples
## Basic crawl
_Source: `docs/examples/basic.mdx` · https://yoink.goatsquadstudios.com/docs/examples/basic_
> A minimal end-to-end example — crawl, save, inspect.
The fewest lines of code that does something useful.
[`examples/basic_crawl.py`](https://github.com/ErikkJs/yoink/blob/master/examples/basic_crawl.py) is a minimal starter — just a `Crawler()` with defaults that prints page summaries. The script below extends that to also save JSONL and print stats. Use whichever fits.
## Script
Save this as `my_crawl.py`:
```python
import asyncio
from pathlib import Path
from yoink import Crawler, CrawlConfig
from yoink.writers import Writer
from yoink.stats import CrawlStats
async def main():
config = CrawlConfig(
max_depth=2,
max_pages=50,
requests_per_second=2.0,
)
crawler = Crawler(config=config)
pages = await crawler.crawl("https://example.com")
# Save to JSONL
output = Path("example.jsonl")
Writer.write_jsonl(pages, output)
print(f"Saved {len(pages)} pages to {output}")
# Print summary
stats = CrawlStats(pages)
print(stats.format_summary())
asyncio.run(main())
```
## Run it
```bash
python my_crawl.py
```
## What you get
1. `example.jsonl` — one JSON object per page.
2. A formatted summary printed to stdout (depth distribution, top domains, content quality).
## Variations
### Save HTML too
```python
config = CrawlConfig(
max_depth=2,
max_pages=50,
save_html=True, # raw HTML on each Page record
)
```
### Multiple output formats
```python
Writer.write_jsonl(pages, Path("data.jsonl"))
Writer.write_parquet(pages, Path("data.parquet"))
Writer.write_text(pages, Path("data.txt"))
```
### Filter file types
```python
from yoink.filters import CombinedFilter
url_filter = CombinedFilter.from_config(
skip_extensions=["pdf", "zip", "exe", "jpg", "png"],
)
crawler = Crawler(config=config, url_filter=url_filter)
```
## Same thing on the CLI
```bash
yoink crawl https://example.com -d 2 -n 50 -o example.jsonl
yoink stats example.jsonl
```
## See also
- [Quickstart](/docs/quickstart) — the same idea, even shorter.
- [`Crawler`](/docs/api/crawler) and [`CrawlConfig`](/docs/api/config) for the full API.
## AI training data
_Source: `docs/examples/ai-training.mdx` · https://yoink.goatsquadstudios.com/docs/examples/ai-training_
> Build a clean, deduplicated text dataset suitable for fine-tuning or RAG.
This is the canonical use case yoink was built for: turn a documentation site (or any structured public source) into a clean JSONL ready to feed into a training pipeline or vector database.
[`examples/ai_training_data.py`](https://github.com/ErikkJs/yoink/blob/master/examples/ai_training_data.py) is a simpler starter (length-filter + JSONL + stats, no dedup or token budgeting). The script below adds hash-based dedup, length clipping, and a token-count estimate — copy whichever matches your needs.
## The pipeline
## Script
```python
import asyncio
import hashlib
import json
from pathlib import Path
from yoink import Crawler, CrawlConfig
from yoink.filters import CombinedFilter
MIN_TEXT_CHARS = 500
MAX_TEXT_CHARS = 50_000
async def build_dataset(start_url: str, output: Path):
config = CrawlConfig(
max_depth=3,
max_pages=10_000,
max_concurrency=15,
requests_per_second=5.0,
extract_text=True,
save_html=False, # we don't need it
respect_robots=True, # always
)
url_filter = CombinedFilter.from_config(
skip_extensions=["pdf", "zip", "exe", "jpg", "png", "gif", "mp4"],
exclude_patterns=["*/print/*", "*/edit/*", r".*\?diff=.*"],
)
crawler = Crawler(config=config, url_filter=url_filter)
pages = await crawler.crawl(start_url)
# Dedup by text hash (different URLs, same content)
seen_hashes: set[str] = set()
written = 0
with open(output, "w", encoding="utf-8") as f:
for page in pages:
text = page.text
if not text:
continue
if len(text) < MIN_TEXT_CHARS:
continue
if len(text) > MAX_TEXT_CHARS:
text = text[:MAX_TEXT_CHARS]
h = hashlib.sha256(text.encode("utf-8")).hexdigest()
if h in seen_hashes:
continue
seen_hashes.add(h)
record = {
"id": h[:16],
"source_url": page.url,
"title": page.title,
"text": text,
"tokens_approx": len(text) // 4,
"depth": page.depth,
}
f.write(json.dumps(record, ensure_ascii=False) + "\n")
written += 1
return {
"crawled": len(pages),
"written": written,
"deduped": len(pages) - written,
}
if __name__ == "__main__":
result = asyncio.run(build_dataset(
"https://docs.example.com",
Path("training_data.jsonl"),
))
print(f"Crawled: {result['crawled']}")
print(f"Written: {result['written']}")
print(f"Deduped: {result['deduped']}")
```
## What this does
1. **Polite crawl** — 5 RPS, respects robots.txt, stays on the start domain.
2. **Skip binaries** — no PDFs, images, or zips muddying the text dataset.
3. **Skip noise** — `print/`, `edit/`, and `?diff=` URLs typically duplicate canonical content.
4. **Filter on length** — drop pages with too little (chrome-only) or too much (likely concatenated-everything-pages) text.
5. **Dedupe by hash** — different URLs with identical extracted text get collapsed.
6. **Token estimate** — a rough `len(text) // 4` works well enough for budgeting.
## Loading it back
```python
import json
records = [json.loads(line) for line in open("training_data.jsonl")]
print(f"{len(records)} records, {sum(r['tokens_approx'] for r in records):,} approx tokens")
```
## Variations
### For a vector index (chunking)
```python
from textwrap import wrap
def chunks(text: str, size: int = 1000):
return wrap(text, size, replace_whitespace=False, drop_whitespace=False)
# in the loop:
for i, chunk in enumerate(chunks(text)):
record = {
"id": f"{h[:16]}-{i}",
"source_url": page.url,
"chunk_index": i,
"text": chunk,
}
...
```
### Including metadata for filtering
```python
record = {
"id": h[:16],
"source_url": page.url,
"title": page.title,
"text": text,
"description": page.metadata.get("description"),
"og_type": page.metadata.get("og:type"),
"depth": page.depth,
"crawled_at": page.crawled_at.isoformat(),
}
```
## See also
- [URL filtering concepts](/docs/concepts/url-filtering).
- [`CrawlConfig`](/docs/api/config) — every knob.
- [`Page`](/docs/api/page) — what's available on each record.
## Checkpoint & resume
_Source: `docs/examples/checkpoint-resume.mdx` · https://yoink.goatsquadstudios.com/docs/examples/checkpoint-resume_
> Three patterns for resumable crawls — local file, S3 across processes, and a Lambda handler that survives 15-minute timeouts.
The repo's [`examples/checkpoint_resume.py`](https://github.com/ErikkJs/yoink/blob/master/examples/checkpoint_resume.py) bundles three runnable scenarios that map cleanly onto real production patterns. This page walks through each.
## 1. Local file checkpoint
The simplest case: long crawl on your laptop, want to be able to Ctrl-C and pick up where you left off.
```python
import asyncio
from yoink import Crawler, CrawlConfig, CheckpointManager
async def main():
config = CrawlConfig(
max_depth=2,
max_pages=100,
max_concurrency=10,
)
checkpoint = CheckpointManager.from_uri(
"./crawl.jsonl",
flush_interval=5, # state snapshot every 5 pages
)
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
try:
# First run, OR resume — same call either way
pages = await crawler.crawl("https://example.com", resume=True)
print(f"Crawled {len(pages)} pages")
except KeyboardInterrupt:
print("Interrupted! Checkpoint saved. Re-run to resume.")
asyncio.run(main())
```
What `resume=True` does on a fresh run with no checkpoint: starts from scratch. With an existing checkpoint: restores `visited`, `queue`, and previously-yoinked pages, then continues from the queue.
## 2. S3 checkpoint (cross-process)
Same code, different URI. Useful when the crawl might run on different hosts (e.g., a fresh container picks up where a killed one left off):
```python
config = CrawlConfig(max_depth=2, max_pages=1000, max_concurrency=20)
checkpoint = CheckpointManager.from_uri(
"s3://my-crawl-bucket/checkpoints/example.jsonl",
flush_interval=10, # higher → fewer S3 API calls
)
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
pages = await crawler.crawl("https://example.com", resume=True)
```
Requires the `s3` extra (`pip install -e ".[s3]"`) and AWS credentials in the environment / IAM role / `~/.aws/credentials`.
Each flush is a download-mutate-upload round-trip on the S3 object (S3 has no native append). For a small/medium crawl, `flush_interval=10` keeps the API call rate sensible without losing more than ~10 pages of state on a crash.
## 3. Lambda handler
This is the pattern that makes long crawls survive Lambda's hard 15-minute timeout. Each invocation crawls for ~14 minutes, checkpoints to S3, and exits. EventBridge re-invokes; the next run resumes from the same checkpoint.
```python
async def lambda_handler():
event = {"url": "https://example.com"}
checkpoint = CheckpointManager.from_uri(
"s3://my-crawl-bucket/lambda-checkpoints/crawl.jsonl",
flush_interval=10,
)
config = CrawlConfig(max_pages=5000, max_concurrency=30, max_depth=3)
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
pages = await crawler.crawl(event["url"], resume=True)
return {
"statusCode": 200,
"body": {
"pages_crawled": len(pages),
"message": "Crawl completed or checkpoint saved for next invocation",
},
}
```
For the full deployment recipe (IAM role, EventBridge schedule, layer build), see [Lambda + S3 checkpoints](/docs/examples/lambda-s3).
## Running the bundled file
```bash
poetry run python examples/checkpoint_resume.py
```
By default this runs the local-file scenario. The S3 and Lambda scenarios are commented out at the bottom of the file — uncomment after you've configured AWS credentials.
## See also
- [Checkpointing concepts](/docs/concepts/checkpointing) — file format, resume semantics, when to use.
- [`CheckpointManager` API](/docs/api/checkpoint).
- [Storage backends](/docs/api/storage).
## Lambda + S3 checkpoints
_Source: `docs/examples/lambda-s3.mdx` · https://yoink.goatsquadstudios.com/docs/examples/lambda-s3_
> A resumable AWS Lambda crawler that survives 15-minute timeouts via S3 checkpoints.
AWS Lambda has a hard 15-minute execution limit. A crawl that wants to survive longer than that needs to checkpoint and resume across invocations. With yoink, that's about 20 lines of code.
## The architecture
## Lambda handler
```python
import asyncio
import json
import os
from yoink import Crawler, CrawlConfig, CheckpointManager
CHECKPOINT_BUCKET = os.environ["CHECKPOINT_BUCKET"]
CHECKPOINT_KEY = os.environ["CHECKPOINT_KEY"] # e.g. "crawls/example-com.jsonl"
START_URL = os.environ["START_URL"]
MAX_PAGES = int(os.environ.get("MAX_PAGES", "10000"))
# Reserve ~30s for Lambda housekeeping
TIME_BUDGET_SECONDS = 14 * 60
async def crawl_chunk():
config = CrawlConfig(
max_depth=4,
max_pages=MAX_PAGES,
max_concurrency=20,
requests_per_second=10.0,
)
checkpoint_uri = f"s3://{CHECKPOINT_BUCKET}/{CHECKPOINT_KEY}"
checkpoint = CheckpointManager.from_uri(checkpoint_uri, flush_interval=50)
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
# Resume picks up if checkpoint exists, else starts fresh
pages = await asyncio.wait_for(
crawler.crawl(START_URL, resume=True),
timeout=TIME_BUDGET_SECONDS,
)
return pages
def handler(event, context):
try:
pages = asyncio.run(crawl_chunk())
done = len(pages) >= MAX_PAGES
except asyncio.TimeoutError:
# Hit the time budget — we'll resume on the next invocation
done = False
pages = []
return {
"statusCode": 200,
"body": json.dumps({
"pages_so_far": len(pages),
"done": done,
}),
}
```
## Deploy
### IAM role
The Lambda execution role needs:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:HeadObject"],
"Resource": "arn:aws:s3:::your-checkpoint-bucket/*"
}
]
}
```
### Lambda layer
Bundle yoink and its `[s3]` extra into a layer:
```bash
mkdir -p layer/python
pip install --target layer/python "yoink[s3]"
cd layer && zip -r ../yoink-layer.zip python && cd ..
aws lambda publish-layer-version \
--layer-name yoink \
--zip-file fileb://yoink-layer.zip \
--compatible-runtimes python3.11
```
### Function
```bash
aws lambda create-function \
--function-name yoink-crawler \
--runtime python3.11 \
--role arn:aws:iam::ACCOUNT:role/yoink-crawler-role \
--handler handler.handler \
--timeout 900 \
--memory-size 1024 \
--layers arn:aws:lambda:REGION:ACCOUNT:layer:yoink:1 \
--zip-file fileb://handler.zip \
--environment "Variables={CHECKPOINT_BUCKET=...,CHECKPOINT_KEY=crawls/example.jsonl,START_URL=https://example.com}"
```
### Schedule
```bash
aws events put-rule \
--name yoink-crawler-tick \
--schedule-expression "rate(14 minutes)"
aws events put-targets \
--rule yoink-crawler-tick \
--targets "Id=1,Arn=arn:aws:lambda:REGION:ACCOUNT:function:yoink-crawler"
aws lambda add-permission \
--function-name yoink-crawler \
--statement-id allow-eventbridge \
--action lambda:InvokeFunction \
--principal events.amazonaws.com \
--source-arn arn:aws:events:REGION:ACCOUNT:rule/yoink-crawler-tick
```
## Observability
A few things worth logging:
```python
import structlog
log = structlog.get_logger()
# in handler:
log.info("invocation_complete",
pages_so_far=len(pages),
done=done,
checkpoint=checkpoint_uri,
)
```
You can read the checkpoint file from anywhere with read access — `aws s3 cp`, the AWS console, or a small Lambda that loads it via `CheckpointManager.from_uri(...).load()`.
## Stopping the schedule
When `done=True`, disable the EventBridge rule (or have the Lambda do it):
```python
import boto3
if done:
boto3.client("events").disable_rule(Name="yoink-crawler-tick")
```
Don't wait for the whole crawl to finish before doing something with it. The checkpoint file is JSONL — kick off a parallel Lambda or Glue job that tails it and processes new lines.
## See also
- [Checkpointing concepts](/docs/concepts/checkpointing).
- [`CheckpointManager`](/docs/api/checkpoint).
- [Storage backends](/docs/api/storage).
## Custom extraction
_Source: `docs/examples/custom-extraction.mdx` · https://yoink.goatsquadstudios.com/docs/examples/custom-extraction_
> Replace or augment the default text extractor with domain-specific logic.
Trafilatura is great for general-purpose article extraction. But sometimes you need to extract structured data — product specs, schema.org JSON-LD, GitHub READMEs — and want to bypass or augment the default extractor.
[`examples/custom_extraction.py`](https://github.com/ErikkJs/yoink/blob/master/examples/custom_extraction.py) demonstrates lightweight post-processing — link-counting by domain, keyword search, metadata inspection. The recipes on this page go further (subclassing `Extractor`, parsing JSON-LD, handling PDFs).
## Approach 1: post-process `Page.html`
If you set `save_html=True`, every page record carries the raw HTML. You can run any extractor over it after the crawl.
```python
import asyncio
import json
from bs4 import BeautifulSoup
from yoink import Crawler, CrawlConfig
async def main():
config = CrawlConfig(
max_depth=2,
save_html=True, # we need raw HTML
extract_text=False, # skip trafilatura
)
crawler = Crawler(config=config)
pages = await crawler.crawl("https://example.com/products")
products = []
for page in pages:
if not page.html:
continue
soup = BeautifulSoup(page.html, "lxml")
# Pull schema.org JSON-LD
for tag in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(tag.string or "")
except json.JSONDecodeError:
continue
if isinstance(data, dict) and data.get("@type") == "Product":
products.append({
"url": page.url,
"name": data.get("name"),
"price": data.get("offers", {}).get("price"),
"currency": data.get("offers", {}).get("priceCurrency"),
})
return products
products = asyncio.run(main())
print(f"Extracted {len(products)} products")
```
## Approach 2: subclass the `Extractor`
For invasive changes, replace the extractor entirely. The `Crawler.__init__` builds its own `Extractor`, so the cleanest path is a small subclass of `Crawler`:
```python
from yoink import Crawler, CrawlConfig
from yoink.extractor import Extractor
class MarkdownExtractor(Extractor):
def extract(self, html: str, url: str) -> str:
# Replace the trafilatura call with markdownify, html2text,
# readability-lxml, or your own logic.
from markdownify import markdownify
return markdownify(html, heading_style="ATX")
class MarkdownCrawler(Crawler):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.extractor = MarkdownExtractor()
# Use it like any Crawler
crawler = MarkdownCrawler(config=CrawlConfig())
pages = await crawler.crawl("https://docs.example.com")
# page.text is now markdown
```
## Approach 3: extract during the crawl with metadata
The default `Parser` already extracts standard meta tags into `Page.metadata`. If you need extra fields, parse them in a wrapper:
```python
from yoink import Crawler, CrawlConfig
from bs4 import BeautifulSoup
class EnrichedCrawler(Crawler):
async def _worker(self, fetcher, worker_id):
# Defer to the parent, then enrich each page after it's added
await super()._worker(fetcher, worker_id)
# That works in principle, but the most pragmatic approach is to enrich AFTER:
crawler = Crawler(config=CrawlConfig(save_html=True))
pages = await crawler.crawl("https://docs.example.com")
for page in pages:
if page.html:
soup = BeautifulSoup(page.html, "lxml")
# Pull custom metadata
published = soup.find("meta", attrs={"name": "article:published_time"})
if published:
page.metadata["published_at"] = published.get("content", "")
```
## Approach 4: PDF or other non-HTML content
yoink doesn't ship a PDF extractor, but you can post-process easily:
```python
import asyncio
import requests
from io import BytesIO
from pypdf import PdfReader
async def main():
config = CrawlConfig(extract_text=False) # we'll do our own
crawler = Crawler(config=config)
pages = await crawler.crawl("https://example.com/papers")
for page in pages:
if page.url.endswith(".pdf"):
# Re-fetch as binary (yoink fetched it as a string, which mangled bytes)
content = requests.get(page.url, timeout=30).content
reader = PdfReader(BytesIO(content))
page.text = "\n\n".join(p.extract_text() for p in reader.pages)
asyncio.run(main())
```
yoink optimizes for HTML and clean text. For weird formats — PDFs, video transcripts, structured data feeds — pull just the URLs you need with yoink (use [URL filtering](/docs/concepts/url-filtering)) and process them with format-specific tooling afterwards.
## See also
- The default `Extractor`: src/yoink/extractor.py.
- The default `Parser`: src/yoink/parser.py.
---
# Reference
## Output formats
_Source: `docs/reference/output-formats.mdx` · https://yoink.goatsquadstudios.com/docs/reference/output-formats_
> JSON, JSONL, Parquet, and plain text — exact shapes, when to use each.
yoink writes crawl results in four formats. They all carry the same `Page` data, but differ in shape, streamability, and compression.
## Format comparison
| Format | Best for | Streamable | Compressed | Extras needed |
| --------- | ------------------------- | ---------- | ---------- | ------------- |
| `jsonl` | AI/ML, large datasets | yes (rows) | no | — |
| `json` | Small datasets, debugging | no | no | — |
| `parquet` | Analytics, pandas | yes (rows) | snappy | `[parquet]` |
| `text` | Eyeballing | no | no | — |
## JSON
A single JSON array. Easy to read, easy to break: large arrays must be loaded entirely into memory.
```bash
yoink crawl https://example.com -f json -o data.json
```
```json
[
{
"url": "https://example.com",
"title": "Example Domain",
"text": "...",
"html": null,
"links": ["https://example.com/about"],
"metadata": {},
"crawled_at": "2026-05-03T12:00:00",
"status_code": 200,
"depth": 0
},
{ "url": "https://example.com/about", "...": "..." }
]
```
## JSONL (recommended)
Newline-delimited JSON. One `Page` per line. Streamable and `grep`-friendly.
```bash
yoink crawl https://example.com -f jsonl -o data.jsonl
```
```jsonl
{"url": "https://example.com", "title": "Example Domain", ...}
{"url": "https://example.com/about", "title": "About", ...}
```
Reading with the standard library:
```python
import json
from yoink import Page
with open("data.jsonl") as f:
pages = [Page.model_validate_json(line) for line in f]
```
Streaming (don't load it all):
```python
def iter_pages(path):
with open(path) as f:
for line in f:
yield Page.model_validate_json(line)
for page in iter_pages("data.jsonl"):
process(page)
```
## Parquet
Columnar storage. Smaller files, faster analytical queries. Requires `pip install "yoink[parquet]"`.
```bash
yoink crawl https://example.com -f parquet -o data.parquet
```
The schema is **flattened** — `links` becomes `num_links`, `metadata` becomes a JSON-encoded string. This is intentional: it keeps the file portable and analytical queries fast.
Compression is `snappy` for fast read/write. Read with pandas / pyarrow / DuckDB:
```python
import pandas as pd
df = pd.read_parquet("data.parquet")
# Or DuckDB for SQL
import duckdb
duckdb.sql("SELECT depth, count(*) FROM 'data.parquet' GROUP BY depth").show()
```
Parquet drops the per-page `links` array (only `num_links` is preserved) and never writes `html` even when `save_html=True`. If you need either, use JSONL.
## Text
Plain text dump. Good for archival and quick visual inspection.
```bash
yoink crawl https://example.com -f text -o data.txt
```
Format:
```
URL: https://example.com
Title: Example Domain
--------------------------------------------------------------------------------
This domain is for use in illustrative examples in documents.
================================================================================
URL: https://example.com/about
Title: About
--------------------------------------------------------------------------------
...
```
This format is one-way — you can't reliably load it back into `Page` objects. Use JSONL for round-tripping.
## Choosing
- **AI training / RAG indexing?** JSONL.
- **Pandas / DuckDB / Athena?** Parquet.
- **Throwaway one-shot?** JSON.
- **Quick read?** Text.
## See also
- [`Writer`](/docs/api/writers) — programmatic output.
- [`Page`](/docs/api/page) — the underlying data shape.
## Configuration reference
_Source: `docs/reference/configuration.mdx` · https://yoink.goatsquadstudios.com/docs/reference/configuration_
> Quick-scan reference for every configuration option, organized by section.
This page is the dense, scrolling reference. For prose explanations, see the corresponding concept pages.
## Core
Max link-hop distance from start URL. Architecture.> },
{ name: "max_pages", type: "int", default: "100", description: "Total page cap." },
{ name: "max_concurrency", type: "int", default: "10", description: "Worker coroutines (1..100)." },
{ name: "user_agent", type: "str", default: "yoink/ (+github)", description: "User-Agent header." },
{ name: "timeout", type: "int", default: "30", description: "Per-request timeout (seconds)." },
{ name: "follow_external", type: "bool", default: "False", description: "Follow links to other domains." },
{ name: "extract_text", type: "bool", default: "True", description: "Run trafilatura for clean text." },
{ name: "save_html", type: "bool", default: "False", description: "Persist raw HTML on each Page." },
]} />
## Rate limiting
Per-domain token-bucket fill rate. Rate limiting.> },
{ name: "request_delay", type: "float", default: "0.0", description: "Minimum seconds between requests to same domain." },
]} />
## robots.txt
Fetch and apply robots.txt. robots.txt.> },
]} />
## JavaScript rendering (requires `[browser]`)
Use Playwright. JS rendering.> },
{ name: "headless", type: "bool", default: "True", description: "Run browser without a UI window." },
{ name: "wait_strategy", type: "WaitStrategy", default: "NETWORKIDLE", description: "load | domcontentloaded | networkidle | commit." },
{ name: "wait_selector", type: "str | None", default: "None", description: "CSS selector to wait for." },
{ name: "browser_type", type: "Literal", default: "chromium", description: "chromium | firefox | webkit." },
{ name: "browser_pool_size", type: "int", default: "3", description: "Pooled browser contexts (1..10)." },
{ name: "screenshot_dir", type: "str | None", default: "None", description: "Debug screenshots directory." },
]} />
## URL filtering (separate from `CrawlConfig`)
Pass to `Crawler(url_filter=...)`. See [`CombinedFilter.from_config`](/docs/api/filters).
## Checkpointing (separate from `CrawlConfig`)
Pass to `Crawler(checkpoint_manager=...)`. See [`CheckpointManager`](/docs/api/checkpoint).
## CLI flag mapping
| CLI flag | Config field |
| ----------------------- | ----------------------------- |
| `--depth, -d` | `max_depth` |
| `--max-pages, -n` | `max_pages` |
| `--concurrency, -c` | `max_concurrency` |
| `--user-agent` | `user_agent` |
| `--follow-external` | `follow_external` |
| `--save-html` | `save_html` |
| `--rate-limit, -r` | `requests_per_second` |
| `--request-delay` | `request_delay` |
| `--no-robots` | `respect_robots=False` |
| `--render-js, --browser`| `render_js` |
| `--wait-for` | `wait_strategy` |
| `--wait-selector` | `wait_selector` |
| `--browser-type` | `browser_type` |
| `--no-headless` | `headless=False` |
| `--include` | `url_filter.include_patterns` |
| `--exclude` | `url_filter.exclude_patterns` |
| `--skip-extensions` | `url_filter.skip_extensions` |
| `--checkpoint` | `CheckpointManager.from_uri` |
| `--checkpoint-interval` | `flush_interval` |
| `--resume` | `crawler.crawl(resume=True)` |
## See also
- [`CrawlConfig`](/docs/api/config) — the Pydantic model itself.
- [CLI: yoink crawl](/docs/cli/crawl) — flag-by-flag.