# yoink — full documentation > Fast, async Python web crawler with rate limiting, robots.txt compliance, optional JavaScript rendering, and resumable S3-backed checkpoints. This file concatenates every documentation page so you can paste the whole thing into an AI assistant's context. --- # Getting started ## Introduction _Source: `docs/introduction.mdx` · https://yoink.goatsquadstudios.com/docs/introduction_ > A fast, async Python web crawler for extracting AI-ready data from public websites. **yoink** is a focused, well-tested Python crawler that turns public websites into clean, structured data. It's the tool you reach for when you want to build a training set, mirror a documentation site, audit an API surface, or run a research crawl — without hand-rolling the boring parts. ## What's in the box - **Async architecture** built on `aiohttp` with configurable concurrency - **Clean text extraction** via [trafilatura](https://github.com/adbar/trafilatura) — no nav chrome, no boilerplate - **Per-domain rate limiting** using a token bucket with burst support - **robots.txt compliance** out of the box, including `Crawl-delay` and `Sitemap` directives - **JavaScript rendering** via Playwright for SPAs (optional extra) - **Resumable crawls** with append-only checkpoints to disk or S3 - **URL filtering** with glob, regex, and extension matching - **First-class output formats** — JSON, JSONL, Parquet, plain text - **Built-in stats** for inspecting what you yoinked ## Design principles yoink is intentionally small (~3,200 lines of Python, 134 passing tests). The hard parts — HTTP, HTML parsing, text extraction, browser automation — are delegated to libraries that have been battle-tested for years. 1. **Polite by default.** Respects `robots.txt`, identifies itself, rate-limits per domain, stays on the start domain. 2. **Pluggable, not magic.** Swap fetchers, storage backends, filters, and extractors without forking the crawler. 3. **Resumable, always.** Long crawls die. Lambda runs time out. yoink should pick up where it left off. 4. **Output is the product.** Clean JSONL/Parquet that drops straight into your pipeline beats a fancy CLI. ## When to use yoink ✅ **Good fit** - You want a few hundred to a few hundred thousand public pages, fast. - You're feeding an LLM, building an embedding index, or training a model. - You're mirroring documentation, doing SEO research, or running content analysis. - You're shipping a Lambda job that needs to survive restarts. ❌ **Not the right tool** - You need to log in, solve CAPTCHAs, or scrape at adversarial sites that explicitly forbid it. - You want a UI-driven scraping product. yoink is a library + CLI. - You need millions of pages a day at sustained throughput. Look at distributed systems like Apache Nutch. ## Where to next - [Installation](/docs/installation) — `pip install yoink` and optional extras. - [Quickstart](/docs/quickstart) — your first crawl in 30 seconds. - [Architecture](/docs/concepts/architecture) — how the moving parts fit together. ## Installation _Source: `docs/installation.mdx` · https://yoink.goatsquadstudios.com/docs/installation_ > Install yoink from source or PyPI, including optional extras for Parquet, S3, and JavaScript rendering. yoink supports **Python 3.11+** and runs on Linux, macOS, and Windows. ## Standard install yoink is currently distributed from source on GitHub: ```bash git clone https://github.com/ErikkJs/yoink cd yoink pip install -e . ``` If you use [Poetry](https://python-poetry.org/), the project ships a `pyproject.toml`: ```bash git clone https://github.com/ErikkJs/yoink cd yoink poetry install ``` The bare name `yoink` on PyPI is taken by an unrelated package (a podcast downloader). Until this crawler is published under a distinct distribution name, install from source as shown above. Running `pip install yoink` will fetch the wrong package. You can also install directly from the GitHub URL without cloning: ```bash pip install "git+https://github.com/ErikkJs/yoink.git" ``` ## Optional extras yoink keeps heavy dependencies behind extras so the core stays lean. | Extra | Adds | When you need it | | ---------- | ----------------------------- | ---------------------------------------------------- | | `parquet` | `pyarrow` | Writing crawl output as columnar Parquet files | | `s3` | `aioboto3` | Checkpointing to AWS S3 (Lambda, EC2, ECS workloads) | | `browser` | `playwright` | Rendering JavaScript-heavy sites and SPAs | | `all` | All of the above | When you don't want to think about it | ```bash # Install with one extra (from a local clone) pip install -e ".[parquet]" # Multiple extras pip install -e ".[s3,parquet]" # Everything pip install -e ".[all]" # Or from GitHub pip install "yoink[all] @ git+https://github.com/ErikkJs/yoink.git" ``` ### Playwright browsers The `browser` extra installs the Playwright Python package, but you also need to download the actual browser binaries (Chromium / Firefox / WebKit): ```bash pip install -e ".[browser]" playwright install chromium ``` For containerized environments, use `playwright install --with-deps chromium` to install both the browser and the system libraries it needs. ### S3 credentials The `s3` extra brings in `aioboto3`, but the SDK still needs credentials. Any of these work: ```bash # 1. AWS CLI profile (recommended for local development) aws configure # 2. Environment variables export AWS_ACCESS_KEY_ID=AKIA... export AWS_SECRET_ACCESS_KEY=... export AWS_DEFAULT_REGION=us-east-1 # 3. IAM role (automatic on EC2 / ECS / Lambda) # No configuration needed ``` Minimum IAM permissions for the bucket you're checkpointing to: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject", "s3:HeadObject"], "Resource": "arn:aws:s3:::your-bucket-name/*" } ] } ``` ## Verify the install ```bash yoink version # yoink version 0.1.0 # The public data crawler. ``` You're ready. Head to the [quickstart](/docs/quickstart). ## Quickstart _Source: `docs/quickstart.mdx` · https://yoink.goatsquadstudios.com/docs/quickstart_ > Yoink your first website in under a minute — CLI and Python. This page gets you from zero to a finished crawl twice: once on the CLI, once in Python. ## Your first CLI crawl ```bash yoink crawl https://example.com ``` That's it. yoink will: 1. Fetch the start URL, parse it, extract text, and follow links. 2. Default to depth `1` and `100 pages` — adjust with `--depth` and `--max-pages`. 3. Rate-limit to `2` requests per second per domain and respect `robots.txt`. 4. Write results to `crawl_output.jsonl` in the current directory. Open the file: ```bash head -1 crawl_output.jsonl | python -m json.tool ``` ## A more useful crawl ```bash yoink crawl https://docs.python.org \ --depth 2 \ --max-pages 50 \ --include "*/tutorial/*" \ --skip-extensions pdf,zip \ --format jsonl \ -o python_tutorial.jsonl ``` What's happening: - `--depth 2` follows two link hops from the start URL. - `--include "*/tutorial/*"` only crawls URLs matching that glob. - `--skip-extensions pdf,zip` ignores binary file links. - `--format jsonl -o python_tutorial.jsonl` streams one JSON object per page to disk. Then inspect what you got: ```bash yoink stats python_tutorial.jsonl ``` You'll see total pages, link counts, depth distribution, top domains, and content quality metrics. ## Your first Python crawl ```python import asyncio from yoink import Crawler, CrawlConfig async def main(): config = CrawlConfig( max_depth=2, max_pages=100, max_concurrency=10, requests_per_second=2.0, ) crawler = Crawler(config=config) pages = await crawler.crawl("https://example.com") for page in pages: print(f"{page.status_code} {page.url}") print(f" title: {page.title}") print(f" text: {len(page.text or '')} chars") asyncio.run(main()) ``` ## Resumable crawls Long crawls die. Plan for it from day one with checkpointing: ```python from yoink import Crawler, CrawlConfig, CheckpointManager async def main(): config = CrawlConfig(max_depth=3, max_pages=10_000) checkpoint = CheckpointManager.from_uri("./crawl.jsonl") crawler = Crawler(config=config, checkpoint_manager=checkpoint) # Pick up where we left off if the file already exists pages = await crawler.crawl("https://docs.example.com", resume=True) return pages ``` Same on the CLI: ```bash # First run — interrupted with Ctrl-C yoink crawl https://docs.example.com --checkpoint ./crawl.jsonl # Resume yoink crawl https://docs.example.com --checkpoint ./crawl.jsonl --resume ``` Replace the path with `s3://my-bucket/crawl.jsonl` and yoink will buffer writes and flush them to S3. Survives Lambda timeouts and restarts. See [Lambda + S3 checkpoints](/docs/examples/lambda-s3). ## What to read next - **Concepts** — [architecture](/docs/concepts/architecture), [rate limiting](/docs/concepts/rate-limiting), [JS rendering](/docs/concepts/javascript-rendering). - **CLI reference** — every `yoink crawl` flag, [explained](/docs/cli/crawl). - **Python API** — [`Crawler`](/docs/api/crawler), [`CrawlConfig`](/docs/api/config), [`CheckpointManager`](/docs/api/checkpoint). --- # Concepts ## Architecture _Source: `docs/concepts/architecture.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/architecture_ > How yoink's modules fit together — fetcher, parser, scheduler, rate limiter, robots checker, checkpoint, storage. yoink is built as a small graph of single-purpose modules. Each one does one thing, and the `Crawler` is the conductor that wires them together. ## The components | Module | Responsibility | | --------------------- | --------------------------------------------------------------------------- | | `Crawler` | Owns the worker pool, drives the crawl loop, persists results. | | `Fetcher` | Async HTTP client (aiohttp). Up to 3 attempts with exponential backoff on `ClientError`; immediate retry on `TimeoutError`; HTTP errors (4xx/5xx) returned as-is, not retried. | | `PlaywrightFetcher` | Browser-based fetcher for JS-heavy sites. | | `create_fetcher` | Factory function in `fetcher_factory.py`. Returns `PlaywrightFetcher` when `render_js=True` (and Playwright is importable), `Fetcher` otherwise. Emits `UserWarning` and falls back to `Fetcher` if `render_js=True` but Playwright isn't installed. | | `Parser` | HTML → title, links, metadata (BeautifulSoup + lxml). | | `Extractor` | HTML → clean text via trafilatura. | | `Scheduler` | URL queue, depth tracking, deduplication, filter integration. | | `RateLimiter` | Per-domain token bucket with `Crawl-delay` support. | | `RobotsChecker` | Fetches & caches `robots.txt`, answers `is_allowed(url)`. | | `URLFilter` / `DomainFilter` / `CombinedFilter` | Glob/regex/extension/domain matching (`filters.py`). | | `CheckpointManager` | Append page records and crawl state to a `CheckpointStorage`. | | `CheckpointStorage` | Pluggable backend (local file, S3). | | `Writer` | Final-output serialization (JSON, JSONL, Parquet, text). | | `CrawlStats` | Post-crawl analysis (depth, domains, content quality). | ## Lifecycle of a crawl ## Why this shape? **Async workers, not threads.** Crawling is I/O bound. `asyncio` lets one process handle hundreds of concurrent requests without the overhead of OS threads. **A real queue, not recursion.** Depth-limited BFS gives predictable memory usage and clean depth metadata. The scheduler also owns deduplication, so workers can't accidentally re-fetch the same URL. **Rate limiting at the gate.** Token bucket per domain — workers compete for tokens, so even if you've got 50 concurrent requests, no single domain sees more than `requests_per_second`. **Checkpoints as an append log.** Pages stream to checkpoint as soon as they're crawled, so a crash never costs you more than the in-flight batch. State (visited set, queue, filters) is written at the end and on every flush interval. ## Where to extend - **Custom storage backend** — implement [`CheckpointStorage`](/docs/api/storage) (`write`, `read`, `exists`, `flush`, `close`) for Redis, GCS, Azure Blob, etc. - **Custom filter** — implement `should_crawl(url) -> bool` and pass it via the `url_filter` argument to `Crawler`. - **Custom extractor** — replace the default trafilatura-based `Extractor` for domain-specific extraction (PDFs, schema.org parsing, etc.). ## Code locations The full source lives at github.com/ErikkJs/yoink/tree/master/src/yoink. ~3,200 lines of Python across 18 files, each module focused, with 134 passing tests under `tests/`. ## Rate limiting _Source: `docs/concepts/rate-limiting.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/rate-limiting_ > Per-domain token bucket rate limiting with burst support, minimum delays, and Crawl-delay honoring. yoink rate-limits at the **fetcher gate** — every outbound request has to acquire a token before going out. This protects target servers, keeps you on the right side of `robots.txt` `Crawl-delay`, and avoids tripping basic anti-abuse heuristics. ## The mechanism: token bucket A token bucket fills at a constant rate (your `requests_per_second`) and holds a fixed maximum number of tokens (its capacity). Each request consumes one token. If the bucket is empty, the request waits until a token regenerates. This gives you smooth traffic shaping at sustained `requests_per_second`. The default `burst_size=1` means there's no extra burst headroom — the very first request consumes the only token, and subsequent requests pace themselves at exactly your configured rate. `burst_size` is a knob on the `RateLimiter` class but is **not** currently surfaced on `CrawlConfig` or the CLI. If you need bursts (e.g., 10 RPS sustained but happy to fire 5 in a row when idle), construct the limiter directly: ```python from yoink.rate_limiter import RateLimiter limiter = RateLimiter(requests_per_second=10.0, burst_size=5) # then pass to your fetcher manually if subclassing ``` For most workloads, the `requests_per_second=2.0, burst_size=1` defaults are exactly what you want — polite, predictable, no surprises. ## Per-domain isolation Rate limits are scoped to each domain you crawl. If `--follow-external` is enabled and your crawl visits both `docs.python.org` and `python.org`, they each get an independent bucket. Misbehaving on one domain can't slow another. ```python config = CrawlConfig( requests_per_second=5.0, # 5 RPS per domain max_concurrency=20, # but only 20 concurrent overall ) ``` ## `request_delay` — a wait-time floor `request_delay` is a hard floor on the wait time computed by `acquire()` for each request to a given domain. With `request_delay=0.5`, every request to that domain (including the first) sleeps at least 500ms before being released, even if the token bucket has tokens available. ```bash yoink crawl https://example.com --rate-limit 5.0 --request-delay 0.5 # Up to 5 RPS by token bucket, but every release sleeps ≥ 500ms ``` In Python: ```python config = CrawlConfig( requests_per_second=5.0, request_delay=0.5, # seconds; per-acquire floor ) ``` `request_delay` raises the floor on `acquire()`'s wait calculation, so it's effectively a per-request "wait at least this long." With `burst_size=1` and `request_delay=0.5`, you get a steady cadence of one request every 500ms (or slower, if the bucket is empty). It's not literally measured "between consecutive completions" — it's the minimum sleep before each token is handed out. ## robots.txt `Crawl-delay` When `respect_robots=True` (the default), yoink reads each domain's `robots.txt` and applies its `Crawl-delay` directive by reducing the bucket's refill rate to `1 / crawl_delay` requests per second — but only if that's stricter than your configured rate. The stricter limit always wins. If your config says `requests_per_second=5.0` (1 request every 200ms) and the site's `robots.txt` has `Crawl-delay: 1`, the bucket's effective rate drops to 1 RPS for that domain — yoink will wait at least 1 second between requests there. Your config is the ceiling, not the floor. Once a `Crawl-delay` reduces the bucket's rate, it stays reduced for the lifetime of that `RateLimiter` even if `robots.txt` is later refreshed with a less-restrictive value. In practice this only matters if you cache an extremely strict `Crawl-delay` and the site loosens it during your crawl — generally a non-issue. ## Picking sane defaults A non-exhaustive heuristic: | Target | Suggested `requests_per_second` | | ------------------- | ------------------------------- | | Personal blog | 1.0 | | Documentation site | 2.0 – 5.0 | | Public API / large news site | 5.0 – 10.0 | | Your own staging server | Whatever you want | If the site you're crawling publishes a `Crawl-delay`, honor it — yoink does this for you, but you can also set `request_delay` explicitly to make the constraint visible at the call site. ## Disabling rate limiting You can't turn it fully off, but you can effectively disable it for testing: ```python config = CrawlConfig( requests_per_second=1000, # absurdly high request_delay=0.0, ) ``` For real workloads: don't. ## See also - [`CrawlConfig.requests_per_second`](/docs/api/config) and `request_delay` reference. - [robots.txt compliance](/docs/concepts/robots-txt) — how `Crawl-delay` is parsed and applied. - The `RateLimiter` module: src/yoink/rate_limiter.py. ## robots.txt compliance _Source: `docs/concepts/robots-txt.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/robots-txt_ > How yoink parses, caches, and applies robots.txt directives — Allow, Disallow, Crawl-delay, Sitemap. yoink respects [robots.txt](https://www.robotstxt.org/) by default. The `RobotsChecker` is consulted before every fetch, and disallowed URLs are filtered out before they ever hit the queue. ## What's supported - ✅ `User-agent` matching — exact, partial substring, and `*` wildcard fallback. - ✅ `Disallow` rules with wildcard (`*`) and end-anchor (`$`) patterns. - ✅ `Allow` rules (longer/more-specific paths win). - ✅ `Crawl-delay` — narrows the rate limiter for that domain. - ✅ `Sitemap` directives — parsed and stored on each domain's `RobotsDirectives.sitemaps` list. - ✅ Per-domain caching with a 1-hour default TTL. ## How it fits in ## Pattern matching yoink approximates [RFC 9309](https://www.rfc-editor.org/rfc/rfc9309): - `*` matches any sequence of characters (greedy). - `$` at the end of a pattern anchors the match to the end of the URL path. - Rules are sorted by path length (longest first), and the first match wins. **Tie-breaks** between equal-length `Allow` and `Disallow` rules go to whichever appears first in the file (Python's stable sort), not strictly to `Allow` as the RFC prefers. Author your `Allow`/`Disallow` rules with that in mind, or rely on the longer/more-specific path winning. Examples: ```text User-agent: * Disallow: /private/ Disallow: /*.pdf$ Allow: /private/public-page.html Crawl-delay: 2 ``` | URL | Result | Why | | -------------------------------- | -------- | ----------------------------------------- | | `/about` | allowed | No matching rule | | `/private/secrets` | blocked | `Disallow: /private/` | | `/private/public-page.html` | allowed | `Allow` is more specific than `Disallow` | | `/docs/manual.pdf` | blocked | `Disallow: /*.pdf$` | | `/docs/manual.pdf?download=1` | allowed | The `$` anchor; query strings break the match | ## User-agent matching yoink matches your configured `user_agent` against the `robots.txt` `User-agent` blocks in this order: 1. **Exact match** (case-insensitive). 2. **Partial match** — bidirectional substring (`a in b or b in a`). For example, `User-agent: yoink` matches the default UA `yoink/0.3.0 (+...)` because `"yoink"` is a substring of the UA. 3. **Wildcard fallback** (`User-agent: *`). The substring check runs both directions, so a `robots.txt` block with `User-agent: yo` would also match `yoink/0.3.0`. If you publish or consume terse UAs, this can lead to surprising matches — use a distinctive UA string and you'll be fine. ## Caching `robots.txt` is fetched once per domain and cached for 1 hour. This keeps yoink polite for long crawls without re-fetching `robots.txt` for every URL. The cache is in-memory and per-`Crawler` instance — a fresh process or a new `Crawler()` will re-fetch. ## Disabling robots.txt checks You can disable robots.txt enforcement, but it's the website operator's primary signal that they don't want a crawler. If you opt out, you take on the responsibility of knowing why and being able to defend it. ```bash # CLI yoink crawl https://example.com --no-robots ``` ```python # Python config = CrawlConfig(respect_robots=False) ``` When disabled, yoink doesn't fetch `robots.txt` at all and crawls freely subject only to your other config. ## Inspecting the rules The cleanest way to inspect what `RobotsChecker` saw is to share the `Crawler`'s instance — it already has the `Fetcher` wired up. Here's a one-shot script that prints what it learned about each domain it visited: ```python import asyncio from yoink import Crawler, CrawlConfig async def main(): crawler = Crawler(CrawlConfig(max_pages=20)) await crawler.crawl("https://example.com") rc = crawler.robots_checker if rc is None: return # respect_robots was disabled for domain, cached in rc._cache.items(): for ua, directives in cached.directives.items(): print(f"[{domain}] User-agent: {ua}") print(f" rules: {len(directives.rules)}") print(f" crawl_delay: {directives.crawl_delay}") print(f" sitemaps: {directives.sitemaps}") asyncio.run(main()) ``` For ad-hoc `is_allowed()` checks, use the public method (it's `async`): ```python allowed = await crawler.robots_checker.is_allowed("https://example.com/private/") ``` `RobotsChecker` needs a `Fetcher` to download `robots.txt` from the network. The `Crawler` wires this for you. If you want to use `RobotsChecker` outside a `Crawler`, you have to call `set_fetcher(my_fetcher)` with an open `Fetcher` (`async with Fetcher() as f: ...`) before `is_allowed()` will check anything — otherwise it returns `True` unconditionally. ## See also - [Rate limiting](/docs/concepts/rate-limiting) — how `Crawl-delay` interacts with your `requests_per_second`. - The `RobotsChecker` source: src/yoink/robots.py. ## JavaScript rendering _Source: `docs/concepts/javascript-rendering.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/javascript-rendering_ > Use Playwright to render single-page apps and sites whose content lives behind JS execution. The default `Fetcher` is a fast async HTTP client. It works perfectly for traditional sites, documentation, blogs, and most APIs. But if a site renders content client-side — React/Vue/Svelte SPAs, infinite-scroll pages, content that requires JS execution — you need a real browser. That's what the **Playwright fetcher** is for. ## Enable JS rendering Install the optional `browser` extra and download a browser binary: ```bash pip install "yoink[browser]" playwright install chromium ``` Then turn it on with `--render-js` (CLI) or `render_js=True` (Python): ```bash yoink crawl https://spa-site.com --render-js ``` ```python config = CrawlConfig( render_js=True, browser_type="chromium", # or firefox, webkit wait_strategy="networkidle", # or load, domcontentloaded, commit headless=True, ) ``` ## How it works When `render_js=True`, `create_fetcher()` returns a `PlaywrightFetcher` instead of the standard HTTP `Fetcher`. The crawler is otherwise unchanged — same scheduler, same rate limiter, same robots checker. If you set `render_js=True` but the `playwright` package isn't importable, `create_fetcher()` emits a `UserWarning` and silently falls back to the HTTP `Fetcher`. The crawl still runs — you just won't get JS rendering. Install with `pip install "yoink[browser]" && playwright install chromium` to actually get the browser. ## Wait strategies Playwright's notion of "loaded" is different from a plain HTTP fetch. Pick the strategy that matches what you need: For sites that render content after `networkidle` (rare, but it happens), use a CSS selector to wait for a specific element: ```bash yoink crawl https://spa.com --render-js --wait-selector ".article-content" ``` ```python config = CrawlConfig( render_js=True, wait_selector=".article-content", ) ``` ## Browser pooling Launching a browser is expensive. yoink reuses a pool of browser **contexts** (isolated cookie/localStorage scopes within a single browser process): ```python config = CrawlConfig( render_js=True, browser_pool_size=3, # default ) ``` Workers borrow a context, render the page, and return it. Three contexts is a good default for `max_concurrency=10` — enough that workers rarely block on the pool, few enough that memory stays reasonable. ## Browser choice | Browser | When to pick it | | --------- | ---------------------------------------------------------------- | | chromium | Default. Best site compatibility, fastest startup. | | firefox | If you need to test against Firefox-specific behavior. | | webkit | Closest approximation of Safari rendering. | For data extraction, **Chromium is almost always the right choice.** The other engines exist for testing/cross-browser validation. ## Debugging Run with a visible browser to watch what's happening: ```bash yoink crawl https://spa.com --render-js --no-headless ``` For scripted runs that crash mysteriously, point Playwright at a screenshot directory: ```python config = CrawlConfig( render_js=True, screenshot_dir="./debug-screenshots", ) ``` Each fetched page gets a PNG dropped in that directory, named `screenshot_<8-char-md5>.png` (e.g., `screenshot_a1b2c3d4.png`) where the 8 chars are the first 8 hex digits of the MD5 of the URL. Collisions are extremely rare in practice but possible on huge crawls. ## Cost & throughput JS rendering is **10–50× slower** than plain HTTP fetching. A page that takes 200ms over HTTP might take 3–8 seconds with Playwright (network + render + wait). Plan accordingly: - Lower `max_concurrency` (try 5 instead of 20). - Use `wait_strategy="domcontentloaded"` if you don't need post-mount data. - Keep `--render-js` off for the parts of your crawl that don't need it. yoink doesn't (yet) auto-detect; that's a per-target decision. ## When NOT to use it If `curl https://site.com` returns the content you want, you don't need a browser. The default `Fetcher` is faster, lighter, and infinitely more reliable. Try the HTTP fetcher first. Switch only when content is missing. ## See also - [`CrawlConfig`](/docs/api/config) — full list of JS-related options. - The Playwright fetcher source: src/yoink/playwright_fetcher.py. ## Checkpointing _Source: `docs/concepts/checkpointing.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/checkpointing_ > Resumable crawls with append-only checkpoints. Survive Lambda timeouts, OOM kills, and Ctrl-C. Long crawls die. Lambda timeouts hit. SSH connections drop. Servers OOM-kill your process. yoink's checkpointing system makes any crawl resumable with two lines of code. ## What gets checkpointed A checkpoint file is an append-only log of three kinds of records: 1. **Metadata** — start URL, config snapshot, timestamp. Written once at the start. 2. **Pages** — one record per crawled page. Streamed as they finish. 3. **State** — the visited set, the queue, the filtered set. Written periodically and on shutdown. The format is JSONL with a `type` discriminator on each line: ## CLI usage ```bash # Run a crawl with checkpointing yoink crawl https://example.com --checkpoint ./crawl.jsonl # It crashed / you Ctrl-C'd. Resume: yoink crawl https://example.com --checkpoint ./crawl.jsonl --resume ``` The same flags work with S3 URIs: ```bash yoink crawl https://example.com --checkpoint s3://my-bucket/crawl.jsonl --resume ``` ## Python usage ```python from yoink import Crawler, CrawlConfig, CheckpointManager async def main(): config = CrawlConfig(max_pages=10_000) # Local file checkpoint = CheckpointManager.from_uri("./crawl.jsonl") # ...or S3 # checkpoint = CheckpointManager.from_uri("s3://my-bucket/crawl.jsonl") crawler = Crawler(config=config, checkpoint_manager=checkpoint) # If the file exists, pick up where we left off pages = await crawler.crawl("https://example.com", resume=True) return pages ``` ## Flush interval Pages are written immediately. State is flushed every N pages (default `10`) and on shutdown: ```python checkpoint = CheckpointManager.from_uri( "./crawl.jsonl", flush_interval=50, # write state every 50 pages ) ``` ```bash yoink crawl https://example.com --checkpoint ./crawl.jsonl --checkpoint-interval 50 ``` Lower values give finer-grained resume but cost more I/O. For S3, every flush is an API call, so you generally want a higher interval (50–100). ## Storage backends `CheckpointManager.from_uri(...)` picks a backend based on the URI scheme: | URI | Backend | Implementation | | ---------------------------------- | ------------------ | -------------------- | | `./relative/path.jsonl` | `LocalFileStorage` | Async aiofiles append | | `/absolute/path.jsonl` | `LocalFileStorage` | Async aiofiles append | | `s3://bucket/key.jsonl` | `S3Storage` | Buffered → put_object | Want a custom backend (Redis, GCS, Azure)? Implement [`CheckpointStorage`](/docs/api/storage) — five async methods. ## How resume works When you call `crawler.crawl(url, resume=True)`: 1. The checkpoint file is read line by line. 2. **Pages** are restored into `crawler.pages`. 3. **State** restores `scheduler.visited`, `scheduler.queue`, `scheduler.filtered`. 4. If the start URL doesn't match the checkpoint metadata, you get a warning. 5. The crawl continues from the queue. Restoring visited URLs means yoink will never re-fetch a page that finished before the crash. The crawl picks up exactly where it left off — same depth, same queue order. ## When to use checkpoints ✅ **Use them** - Crawls expected to take more than 10 minutes. - Lambda jobs (any execution > 30s). - Containers that may be killed (autoscaling, spot instances). - Anywhere the start URL might be re-invoked. ❌ **Skip them** - Throwaway crawls (one-shot data pulls in dev). - Tiny crawls where re-running is cheaper than checkpoint I/O. ## See also - [Lambda + S3 checkpoints example](/docs/examples/lambda-s3) — a complete resumable Lambda handler. - [`CheckpointManager` API](/docs/api/checkpoint). - [Storage backends](/docs/api/storage). ## URL filtering _Source: `docs/concepts/url-filtering.mdx` · https://yoink.goatsquadstudios.com/docs/concepts/url-filtering_ > Include patterns, exclude patterns, file-extension filters, and domain filters — combine for precise targeting. Most crawls don't want every URL. URL filters tell yoink which pages to follow and which to skip *before* they hit the queue. ## The filter pipeline For each candidate URL, `CombinedFilter` checks filters in this order. The first one that says "no" wins: `DomainFilter` runs first because it's a fast hostname check; if you've explicitly allowlisted a domain set, everything else is irrelevant for URLs outside it. Inside `URLFilter`, the order is extension → include → exclude — the cheap path-suffix check before any pattern matching. ## CLI usage ```bash yoink crawl https://example.com \ --include "*/blog/*" \ --include "*/docs/*" \ --exclude "*/private/*" \ --skip-extensions pdf,zip,exe ``` - `--include` and `--exclude` are repeatable. - `--skip-extensions` is comma-separated. ## Python usage ```python from yoink import Crawler, CrawlConfig from yoink.filters import CombinedFilter url_filter = CombinedFilter.from_config( include_patterns=["*/blog/*", "*/docs/*"], exclude_patterns=["*/private/*"], skip_extensions=["pdf", "zip", "exe"], allowed_domains=["example.com", "blog.example.com"], ) crawler = Crawler(config=CrawlConfig(), url_filter=url_filter) pages = await crawler.crawl("https://example.com") ``` ## Pattern syntax yoink auto-detects the kind of pattern based on its shape: | Pattern shape | Treated as | Example | | --------------------------- | ---------------- | ------------------------------------- | | Contains `*` or `?` | Glob | `*/blog/*`, `*.html` | | Starts `^` / ends `$` / has `[` | Regex | `^https://example\.com/v\d+/.*$` | | Anything else | Substring match | `/api/` | ### Glob examples ```python # All blog posts "*/blog/*" # Anything under /docs/, any depth "*/docs/*" # Specific URL with placeholder "https://example.com/posts/?" ``` ### Regex examples ```python # Versioned API URLs r"^https://api\.example\.com/v\d+/.*$" # Posts from 2024 or later r"/posts/(202[4-9]|20[3-9]\d)/.*" ``` ### Substring examples ```python "/api/" # any URL containing /api/ "draft" # any URL containing 'draft' ``` `"/api"` matches `/api/users` *and* `/apiserver/foo`. Use globs (`"*/api/*"`) when you mean a path segment. ## Extension filtering Inside `URLFilter`, `skip_extensions` is checked before include/exclude patterns because it's cheap. It matches the lowercased URL path: ```python skip_extensions=["pdf", "zip", "exe", "jpg", "png"] ``` You don't need the leading dot — yoink strips it. `pdf`, `.pdf`, and `PDF` all work. ## Domain filtering By default, yoink stays on the start URL's domain. With `--follow-external`, it'll follow links anywhere. To allow specific external domains only: ```python from yoink.filters import DomainFilter, CombinedFilter domain_filter = DomainFilter(allowed_domains=["example.com", "docs.example.com"]) url_filter = CombinedFilter(domain_filter=domain_filter) crawler = Crawler( config=CrawlConfig(follow_external=True), url_filter=url_filter, ) ``` Domain matching honors subdomains: `allowed_domains=["example.com"]` matches `example.com`, `www.example.com`, and `blog.example.com` — but not `evil-example.com`. ## Combining filters Use `CombinedFilter.from_config(...)` for the common case: ```python from yoink.filters import CombinedFilter url_filter = CombinedFilter.from_config( include_patterns=["*/api/*"], exclude_patterns=["*/api/internal/*"], skip_extensions=["pdf"], allowed_domains=["api.example.com"], ) ``` Or compose lower-level filters explicitly: ```python from yoink.filters import URLFilter, DomainFilter, CombinedFilter url_filter = CombinedFilter( url_filter=URLFilter( include_patterns=["*/api/*"], exclude_patterns=["*/internal/*"], skip_extensions=["pdf"], ), domain_filter=DomainFilter(allowed_domains=["api.example.com"]), ) ``` ## See also - [Filters API reference](/docs/api/filters). - The `Filters` source: src/yoink/filters.py. --- # CLI ## yoink crawl _Source: `docs/cli/crawl.mdx` · https://yoink.goatsquadstudios.com/docs/cli/crawl_ > Complete reference for the yoink crawl command — every flag, every option, with examples. `yoink crawl` is the workhorse: it takes a URL, fetches pages, and writes the result. ```bash yoink crawl URL [OPTIONS] ``` ## Examples ```bash # The minimum yoink crawl https://example.com # Reasonable defaults for a small docs crawl yoink crawl https://docs.example.com -d 2 -n 200 -o docs.jsonl # JS-heavy SPA, output as Parquet yoink crawl https://spa.com --render-js --format parquet -o data.parquet # Resumable to S3 yoink crawl https://example.com \ --checkpoint s3://my-bucket/crawl.jsonl \ --resume \ --rate-limit 5 \ --depth 3 ``` ## Core options ", description: "Output file path. Skipped if --checkpoint is set without --output." }, { name: "--follow-external", type: "FLAG", default: "false", description: "Follow links to domains other than the start URL's domain." }, { name: "--save-html", type: "FLAG", default: "false", description: "Persist raw HTML on each Page record (large output)." }, { name: "--user-agent", type: "TEXT", default: "yoink/ (+github)", description: "Custom User-Agent string sent on every request." }, ]} /> ## URL filtering See [URL filtering](/docs/concepts/url-filtering) for pattern semantics. ## Checkpointing See [Checkpointing](/docs/concepts/checkpointing) for details. ## Rate limiting See [Rate limiting](/docs/concepts/rate-limiting). ## robots.txt See [robots.txt compliance](/docs/concepts/robots-txt). ## JavaScript rendering Requires the `[browser]` extra (`pip install "yoink[browser]"` and `playwright install chromium`). See [JavaScript rendering](/docs/concepts/javascript-rendering). ## Output By default, yoink prints progress to stderr and a summary to stdout when finished: ``` Yoinking https://example.com... Max depth: 2, Max pages: 100, Concurrency: 10 Rate limit: 2.0 req/s, Robots.txt: enabled Yoinking pages: 100%|████████| 87/100 [00:42<00:00, 2.07page/s] Yoinked 87 pages to crawl_output.jsonl Total links found: 1,243 Total text extracted: 412,891 characters ``` Pipe stderr away if you only want the summary: ```bash yoink crawl https://example.com 2>/dev/null ``` ## Exit codes - `0` — always. The CLI prints errors to stderr but currently exits `0` on every code path, including bad config (e.g., `--resume` without `--checkpoint`) and write errors. If you script around `yoink crawl` and need to detect failure, scan stderr or check that the output file exists and is non-empty. ## See also - [`yoink stats`](/docs/cli/stats) — analyze the output of a crawl. - [Quickstart](/docs/quickstart) — concrete examples. ## yoink stats _Source: `docs/cli/stats.mdx` · https://yoink.goatsquadstudios.com/docs/cli/stats_ > Analyze a saved crawl — page counts, depth distribution, top domains, content quality metrics. `yoink stats` reads a crawl output file (JSON or JSONL) and prints a human-readable summary, with optional CSV / JSON export. ```bash yoink stats FILE [OPTIONS] ``` ## Examples ```bash # Human-readable summary yoink stats crawl_output.jsonl # Export to CSV for spreadsheet work yoink stats crawl_output.jsonl --export stats.csv # JSON output yoink stats crawl_output.jsonl --json ``` `yoink stats --json` currently writes one structlog INFO line to stdout before the JSON payload (the `loaded_pages` event). To pipe cleanly into `jq`, strip the first line: ```bash yoink stats data.jsonl --json | tail -n +2 | jq '.total_pages' ``` This will be fixed in a future release; until then, the workaround is mechanical. ## Options ## What it computes For every page in the file: - **Total pages, total links, average links per page** - **Total text size and average text size** (bytes) - **Total HTML size** if `--save-html` was used - **Depth distribution** — how many pages at each depth - **Unique domains and top 10 domains** by page count - **Status code distribution** - **Content quality** — share of pages with text, title, metadata - **Text length stats** — min / median / max characters ## Sample output ``` ============================================================ YOINK Crawl Statistics ============================================================ Total Pages: 87 Total Links: 1,243 Avg Links/Page: 14.29 Content Size: Total Text: 412.39 KB Avg Text/Page: 4.74 KB Domains: Unique Domains: 1 Top Domains: - docs.example.com: 87 pages Depth Distribution: Depth 0: 1 # Depth 1: 24 ######################## Depth 2: 62 ############################################################## Content Quality: Pages with Text: 85 (97.7%) Pages with Title: 87 (100.0%) Pages with Metadata: 73 (83.9%) Text Length: Min: 142 chars Median: 3,891 chars Max: 28,442 chars ============================================================ ``` ## JSON output (`--json`) ```json { "total_pages": 87, "total_links": 1243, "avg_links_per_page": 14.29, "total_text_size": 422291, "max_depth": 2, "pages_by_depth": { "0": 1, "1": 24, "2": 62 }, "unique_domains": 1, "top_domains": [{ "domain": "docs.example.com", "count": 87 }], "status_codes": { "200": 87 }, "pages_with_text": 85, "pages_with_title": 87, "pages_with_metadata": 73, "text_length_min": 142, "text_length_median": 3891, "text_length_max": 28442 } ``` ## See also - The `CrawlStats` Python API: [`yoink.stats`](/docs/api/stats). - Output formats: [reference/output-formats](/docs/reference/output-formats). ## yoink version _Source: `docs/cli/version.mdx` · https://yoink.goatsquadstudios.com/docs/cli/version_ > Print yoink's version and a one-liner description. ```bash yoink version ``` Output: ``` yoink version 0.1.0 The public data crawler. ``` That's it. Useful for shell scripts and CI pipelines that want to assert a minimum version. ```bash yoink version | head -1 | awk '{print $3}' # 0.1.0 ``` You can also use the standard `--version` flag on the root command: ```bash yoink --version ``` --- # Python API ## Crawler _Source: `docs/api/crawler.mdx` · https://yoink.goatsquadstudios.com/docs/api/crawler_ > The main async web crawler — wires together the fetcher, parser, scheduler, and rate limiter. `yoink.Crawler` is the entry point for programmatic use. It owns the worker pool and orchestrates a crawl from a start URL. ## Import ```python from yoink import Crawler, CrawlConfig ``` ## Constructor ```python Crawler( config: CrawlConfig | None = None, url_filter: CombinedFilter | None = None, checkpoint_manager: CheckpointManager | None = None, ) ``` ## Methods ### `crawl(start_url, resume=False)` Crawl a website starting from `start_url`. ```python async def crawl( self, start_url: str, resume: bool = False, ) -> list[Page] ``` **Returns:** `list[Page]` — every page yoinked. Note that pages are also accumulated in `crawler.pages`, which you can read mid-crawl from another coroutine. ### `crawl_with_progress(start_url, resume=False)` Same as `crawl()` but renders a `tqdm` progress bar to stderr. Used by the CLI. ```python async def crawl_with_progress( self, start_url: str, resume: bool = False, ) -> list[Page] ``` ## Attributes ## Examples ### Minimal crawl ```python import asyncio from yoink import Crawler async def main(): crawler = Crawler() pages = await crawler.crawl("https://example.com") return pages asyncio.run(main()) ``` ### With config and filter ```python from yoink import Crawler, CrawlConfig from yoink.filters import CombinedFilter config = CrawlConfig( max_depth=3, max_pages=500, requests_per_second=5.0, render_js=True, ) url_filter = CombinedFilter.from_config( include_patterns=["*/api/*"], skip_extensions=["pdf", "zip"], ) crawler = Crawler(config=config, url_filter=url_filter) pages = await crawler.crawl("https://docs.example.com") ``` ### Mid-crawl progress (custom) ```python import asyncio from yoink import Crawler, CrawlConfig async def report(crawler: Crawler): while True: await asyncio.sleep(2) print(f"...crawled {len(crawler.pages)} pages") async def main(): crawler = Crawler(CrawlConfig(max_pages=1000)) reporter = asyncio.create_task(report(crawler)) try: return await crawler.crawl("https://example.com") finally: reporter.cancel() asyncio.run(main()) ``` ### With checkpointing See [Checkpointing](/docs/concepts/checkpointing) and [`CheckpointManager`](/docs/api/checkpoint) for full coverage. ```python from yoink import Crawler, CrawlConfig, CheckpointManager checkpoint = CheckpointManager.from_uri("./crawl.jsonl") crawler = Crawler(config=CrawlConfig(), checkpoint_manager=checkpoint) pages = await crawler.crawl("https://example.com", resume=True) ``` ## See also - [`CrawlConfig`](/docs/api/config) — every knob. - [`Page`](/docs/api/page) — the per-page output type. - [Architecture](/docs/concepts/architecture) — how the components fit. ## CrawlConfig _Source: `docs/api/config.mdx` · https://yoink.goatsquadstudios.com/docs/api/config_ > Every knob — depth, concurrency, rate limit, robots, JS rendering, browser pool. Pydantic-validated. `CrawlConfig` is a Pydantic model that captures every dial you can turn. Validation runs on construction, so invalid combinations (negative depth, concurrency > 100) fail immediately. ## Import ```python from yoink import CrawlConfig from yoink.models import WaitStrategy ``` ## Core settings = 0." }, { name: "max_pages", type: "int", default: "100", description: "Hard cap on pages crawled. Validated >= 1." }, { name: "max_concurrency", type: "int", default: "10", description: "Number of concurrent worker coroutines. Validated 1..100." }, { name: "user_agent", type: "str", default: "yoink/ (+github)", description: "User-Agent header sent on every request." }, { name: "timeout", type: "int", default: "30", description: "Per-request timeout in seconds. Validated >= 1." }, { name: "follow_external", type: "bool", default: "False", description: "If False, drop links whose domain differs from the start URL's." }, { name: "extract_text", type: "bool", default: "True", description: "Run trafilatura on each page's HTML to populate Page.text." }, { name: "save_html", type: "bool", default: "False", description: "Persist the raw HTML on each Page record. Drastically increases output size." }, ]} /> ## robots.txt ## Rate limiting = 0." }, ]} /> ## JavaScript rendering Requires the `[browser]` extra. ## `WaitStrategy` enum ```python from yoink.models import WaitStrategy WaitStrategy.LOAD # "load" WaitStrategy.DOMCONTENTLOADED # "domcontentloaded" WaitStrategy.NETWORKIDLE # "networkidle" WaitStrategy.COMMIT # "commit" ``` You can pass a string or an enum value: ```python config = CrawlConfig(wait_strategy="networkidle") # OK config = CrawlConfig(wait_strategy=WaitStrategy.NETWORKIDLE) # also OK ``` ## Examples ### Minimal ```python config = CrawlConfig(max_depth=2) ``` ### Aggressive but polite ```python config = CrawlConfig( max_depth=4, max_pages=10_000, max_concurrency=20, requests_per_second=10.0, follow_external=False, ) ``` ### SPA crawl with debug screenshots ```python from yoink.models import WaitStrategy config = CrawlConfig( render_js=True, browser_type="chromium", wait_strategy=WaitStrategy.NETWORKIDLE, wait_selector=".app-content", headless=True, browser_pool_size=5, screenshot_dir="./debug", ) ``` ### Loading from environment / config file `CrawlConfig` is a standard Pydantic model, so you can use `model_validate()` with a dict from any source: ```python import json from yoink import CrawlConfig with open("crawl.json") as f: raw = json.load(f) config = CrawlConfig.model_validate(raw) ``` ## See also - [`Crawler`](/docs/api/crawler) — uses this config. - [Configuration reference](/docs/reference/configuration) — quick-scan view of every option. ## Page _Source: `docs/api/page.mdx` · https://yoink.goatsquadstudios.com/docs/api/page_ > The per-page output type — URL, title, extracted text, links, metadata, status code, depth. `Page` is the Pydantic model representing one crawled URL. ## Import ```python from yoink import Page # or from yoink.models import Page ``` ## Fields tag content, if present." }, { name: "text", type: "str | None", description: "Clean extracted text from trafilatura. None if extract_text=False or extraction failed." }, { name: "html", type: "str | None", description: "Raw HTML. Only populated when save_html=True." }, { name: "links", type: "list[str]", default: "[]", description: "Outbound links discovered on the page (absolute URLs)." }, { name: "metadata", type: "dict[str, str]", default: "{}", description: "OpenGraph / Twitter / standard meta tags." }, { name: "crawled_at", type: "datetime", description: "UTC timestamp when the page was fetched." }, { name: "status_code", type: "int", default: "200", description: "HTTP response status code." }, { name: "depth", type: "int", default: "0", description: "Link-hop distance from the start URL." }, ]} /> ## Methods `Page` inherits all standard Pydantic v2 methods: ```python page.model_dump() # → dict page.model_dump(mode="json") # → JSON-safe dict (datetimes as strings) page.model_dump_json() # → str Page.model_validate(data) # construct from dict Page.model_validate_json(s) # construct from JSON string ``` ## Examples ### Inspecting after a crawl ```python pages = await crawler.crawl("https://example.com") for page in pages: print(f"[{page.status_code}] depth={page.depth} {page.url}") print(f" title: {page.title or '(none)'}") print(f" text: {len(page.text or '')} chars, {len(page.links)} links") if "og:image" in page.metadata: print(f" image: {page.metadata['og:image']}") ``` ### Reading pages back from JSONL ```python import json from yoink import Page pages: list[Page] = [] with open("crawl_output.jsonl") as f: for line in f: pages.append(Page.model_validate_json(line)) print(f"Loaded {len(pages)} pages") ``` ### Filtering for content quality ```python # Keep only pages with at least 500 chars of clean text substantial = [p for p in pages if p.text and len(p.text) >= 500] # Group by depth from collections import defaultdict by_depth = defaultdict(list) for p in pages: by_depth[p.depth].append(p) ``` ## JSON shape When serialized: ```json { "url": "https://example.com/about", "title": "About Example", "text": "Example is a domain established for...", "html": null, "links": ["https://example.com/", "https://example.com/contact"], "metadata": { "description": "About page", "og:title": "About Example", "og:type": "website" }, "crawled_at": "2026-05-03T12:34:56.789012", "status_code": 200, "depth": 1 } ``` ## See also - [Output formats](/docs/reference/output-formats) — how pages are serialized to JSON, JSONL, Parquet, text. - [`Writer`](/docs/api/writers) — how to write pages to files programmatically. ## CheckpointManager _Source: `docs/api/checkpoint.mdx` · https://yoink.goatsquadstudios.com/docs/api/checkpoint_ > Persist crawl progress to disk or S3 — automatic resume, configurable flush interval. `CheckpointManager` writes pages and crawl state to a [`CheckpointStorage`](/docs/api/storage) backend. Pass one to `Crawler` to make any crawl resumable. ## Import ```python from yoink import CheckpointManager ``` ## Constructing The recommended path is `from_uri()`, which picks the right storage backend: ```python # Local file checkpoint = CheckpointManager.from_uri("./crawl.jsonl") # S3 (requires [s3] extra) checkpoint = CheckpointManager.from_uri("s3://my-bucket/crawl.jsonl") # With custom flush cadence checkpoint = CheckpointManager.from_uri( "./crawl.jsonl", flush_interval=50, ) ``` For full control, build with an explicit storage backend: ```python from yoink import CheckpointManager from yoink.storage import LocalFileStorage, S3Storage storage = S3Storage("s3://my-bucket/crawl.jsonl") checkpoint = CheckpointManager(storage=storage, flush_interval=50) ``` ## API surface There's no `CheckpointManager.flush()` method. If you need to force a flush from outside the crawler (e.g., before checking the file from another process), call `await manager.storage.flush()` directly. ## Examples ### Resumable local crawl ```python import asyncio from yoink import Crawler, CrawlConfig, CheckpointManager async def main(): config = CrawlConfig(max_depth=3, max_pages=10_000) checkpoint = CheckpointManager.from_uri("./crawl.jsonl") crawler = Crawler(config=config, checkpoint_manager=checkpoint) # Pick up where we left off if the file exists pages = await crawler.crawl("https://example.com", resume=True) print(f"Total pages: {len(pages)}") asyncio.run(main()) ``` ### Inspecting a checkpoint ```python import asyncio from yoink import CheckpointManager async def main(): checkpoint = CheckpointManager.from_uri("./crawl.jsonl") data = await checkpoint.load() print(f"Started: {data['metadata']['started_at']}") print(f"Start URL: {data['metadata']['start_url']}") print(f"Pages saved: {len(data['pages'])}") if state := data.get("state"): print(f"Visited: {len(state['visited'])}") print(f"Queue: {len(state['queue'])}") print(f"Filtered:{len(state['filtered'])}") asyncio.run(main()) ``` ### Choosing a flush interval ```python # Aggressive flushing — every page gets persisted state too checkpoint = CheckpointManager.from_uri("./crawl.jsonl", flush_interval=1) # Moderate (default) — state every 10 pages checkpoint = CheckpointManager.from_uri("./crawl.jsonl", flush_interval=10) # S3 — minimize API calls for cost checkpoint = CheckpointManager.from_uri("s3://bucket/crawl.jsonl", flush_interval=100) ``` For S3 the trade-off is real: every flush is a `put_object` call. Don't go below 50 unless you have specific reasons. ## File format A checkpoint is a JSONL file with `type` discriminators: ```jsonl {"type": "metadata", "start_url": "...", "config": {...}, "started_at": "..."} {"type": "page", "url": "...", "title": "...", ...} {"type": "page", "url": "...", "title": "...", ...} {"type": "state", "visited": [...], "queue": [...], "filtered": [...]} ``` This is intentionally readable and `grep`-friendly. You can hand-edit a checkpoint to remove a problematic page or trim the queue. ## See also - [Checkpointing concepts](/docs/concepts/checkpointing). - [`CheckpointStorage`](/docs/api/storage) — the storage backend interface. - [Lambda + S3 example](/docs/examples/lambda-s3). ## Filters _Source: `docs/api/filters.mdx` · https://yoink.goatsquadstudios.com/docs/api/filters_ > URLFilter, DomainFilter, and CombinedFilter — pattern matching, extension filtering, domain allowlists. ```python from yoink.filters import URLFilter, DomainFilter, CombinedFilter ``` ## `URLFilter` Pattern-based URL filtering. Auto-detects glob, regex, or substring patterns. ```python URLFilter( include_patterns: list[str] | None = None, exclude_patterns: list[str] | None = None, skip_extensions: list[str] | None = None, ) ``` ```python url_filter = URLFilter( include_patterns=["*/blog/*", "*/docs/*"], exclude_patterns=["*/private/*", r"^.*\?draft=1$"], skip_extensions=["pdf", "zip", "exe"], ) url_filter.should_crawl("https://example.com/blog/post-1") # True url_filter.should_crawl("https://example.com/private/x") # False url_filter.should_crawl("https://example.com/manual.pdf") # False ``` ## `DomainFilter` Domain allowlist with subdomain matching. ```python DomainFilter(allowed_domains: list[str] | None = None) ``` ```python domain_filter = DomainFilter(allowed_domains=["example.com"]) domain_filter.should_crawl("https://example.com/page") # True domain_filter.should_crawl("https://blog.example.com/x") # True (subdomain) domain_filter.should_crawl("https://other.com/page") # False domain_filter.should_crawl("https://evil-example.com/x") # False ``` Subdomain matching: a URL passes if its hostname **is** an allowed domain or **ends with** `.{allowed_domain}`. ## `CombinedFilter` Composes a `URLFilter` and a `DomainFilter`. This is what `Crawler` accepts. ```python CombinedFilter( url_filter: URLFilter | None = None, domain_filter: DomainFilter | None = None, ) ``` The most ergonomic constructor is `from_config()`: ```python CombinedFilter.from_config( include_patterns: list[str] | None = None, exclude_patterns: list[str] | None = None, skip_extensions: list[str] | None = None, allowed_domains: list[str] | None = None, ) -> CombinedFilter ``` ```python url_filter = CombinedFilter.from_config( include_patterns=["*/api/*"], exclude_patterns=["*/internal/*"], skip_extensions=["pdf"], allowed_domains=["api.example.com"], ) crawler = Crawler(config=CrawlConfig(), url_filter=url_filter) ``` ## Pattern dispatch | Pattern shape | Matched as | | -------------------------------------- | ---------------- | | Contains `*` or `?` | Glob (fnmatch) | | Starts `^`, ends `$`, or contains `[` | Regex (re.match) | | Anything else | Substring (`in`) | See [URL filtering](/docs/concepts/url-filtering) for examples. ## Custom filters Anything implementing `should_crawl(url: str) -> bool` works as a filter. To plug it into the crawler, wrap it with a tiny adapter or use it directly: ```python class WeekendOnlyFilter: def should_crawl(self, url: str) -> bool: from datetime import datetime return datetime.utcnow().weekday() >= 5 # Sat/Sun # CombinedFilter accepts anything with a url_filter or domain_filter slot # that has .should_crawl, so subclassing is the cleanest path: class MyURLFilter(URLFilter): def should_crawl(self, url: str) -> bool: if "?utm" in url: return False return super().should_crawl(url) ``` ## See also - [URL filtering concepts](/docs/concepts/url-filtering). - The `Filters` source: src/yoink/filters.py. ## Storage backends _Source: `docs/api/storage.mdx` · https://yoink.goatsquadstudios.com/docs/api/storage_ > CheckpointStorage interface, LocalFileStorage, S3Storage, and the StorageFactory. Storage backends are how `CheckpointManager` persists records. yoink ships two — local files and S3 — and the interface is small enough to add your own (Redis, GCS, Azure Blob, etc.). ```python from yoink.storage import ( CheckpointStorage, # abstract base LocalFileStorage, S3Storage, StorageFactory, ) ``` ## `CheckpointStorage` interface Every backend implements five async methods: ```python class CheckpointStorage(ABC): @abstractmethod async def write(self, data: str) -> None: ... @abstractmethod async def read(self) -> AsyncIterator[str]: ... @abstractmethod async def exists(self) -> bool: ... @abstractmethod async def flush(self) -> None: ... @abstractmethod async def close(self) -> None: ... ``` ## `LocalFileStorage` Async append to a local file via `aiofiles`. ```python LocalFileStorage(path: str) ``` ```python storage = LocalFileStorage("./crawl.jsonl") ``` - Opens the file in append mode on first `write()`. - `flush()` calls the underlying `flush()` on the file handle (OS will still buffer to disk; pair with `fsync` if you need durability guarantees beyond the crawl). - `close()` closes the file handle. ## `S3Storage` Buffered S3 backend using `aioboto3`. Requires the `[s3]` extra. ```python S3Storage(uri: str) # s3://bucket/key ``` ```python storage = S3Storage("s3://my-bucket/crawls/site-a.jsonl") ``` **Behavior:** - `write()` buffers in memory. - `flush()` downloads existing object (if any), appends the buffer, re-uploads via `put_object`. This is necessary because S3 objects don't support append. - `read()` does a single `get_object` and yields lines. - `exists()` does `head_object`. S3 has no native append. Each flush is a download-mutate-upload. Set `flush_interval` to 50–100+ for production crawls; the sweet spot depends on page size and how upset you'd be losing the most-recent buffer on a crash. ### Required IAM permissions ```json { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject", "s3:HeadObject"], "Resource": "arn:aws:s3:::your-bucket-name/*" }] } ``` ### Credentials `aioboto3` uses the standard boto3 credential chain. Locally, run `aws configure`. On Lambda / EC2 / ECS, attach an IAM role to the runtime and `S3Storage` will pick it up automatically. ## `StorageFactory` Picks a backend based on URI scheme. This is what `CheckpointManager.from_uri()` uses internally. ```python StorageFactory.from_uri("./checkpoint.jsonl") # → LocalFileStorage StorageFactory.from_uri("/abs/path.jsonl") # → LocalFileStorage StorageFactory.from_uri("s3://bucket/key.jsonl") # → S3Storage ``` ## Implementing a custom backend Implementing the interface is roughly 80 lines. Here's a sketch for Redis: ```python import redis.asyncio as redis from yoink.storage import CheckpointStorage class RedisStreamStorage(CheckpointStorage): def __init__(self, url: str, key: str): self.client = redis.from_url(url) self.key = key async def write(self, data: str) -> None: await self.client.rpush(self.key, data) async def read(self): for raw in await self.client.lrange(self.key, 0, -1): yield raw.decode("utf-8") async def exists(self) -> bool: return bool(await self.client.exists(self.key)) async def flush(self) -> None: # Redis is auto-flushed pass async def close(self) -> None: await self.client.aclose() ``` Then plug it in: ```python from yoink import CheckpointManager storage = RedisStreamStorage("redis://localhost", "yoink:crawl-1") checkpoint = CheckpointManager(storage=storage) crawler = Crawler(config=CrawlConfig(), checkpoint_manager=checkpoint) ``` ## See also - [`CheckpointManager`](/docs/api/checkpoint). - [Checkpointing concepts](/docs/concepts/checkpointing). ## CrawlStats _Source: `docs/api/stats.mdx` · https://yoink.goatsquadstudios.com/docs/api/stats_ > Compute, format, and export statistics from a crawl — depth distribution, top domains, content quality. `CrawlStats` analyzes a list of `Page` objects (or loads them from a file) and produces summary metrics. It powers the `yoink stats` CLI but is also fine to use programmatically. ## Import ```python from yoink.stats import CrawlStats ``` ## Constructing ```python # From a list of Page objects (e.g., right after a crawl) stats = CrawlStats(pages) # From a saved file (.json or .jsonl) stats = CrawlStats.from_file(Path("crawl_output.jsonl")) ``` ## Methods ## What `compute()` returns ```python { "total_pages": 87, "total_links": 1243, "total_text_size": 422291, # bytes "total_html_size": 0, # 0 if save_html=False "avg_links_per_page": 14.29, "avg_text_size": 4853.92, "avg_html_size": 0, "max_depth": 2, "pages_by_depth": { 0: 1, 1: 24, 2: 62 }, "unique_domains": 1, "top_domains": [{ "domain": "docs.example.com", "count": 87 }], "status_codes": { 200: 87 }, "pages_with_text": 85, "pages_with_title": 87, "pages_with_metadata": 73, "text_length_min": 142, "text_length_median": 3891, "text_length_max": 28442, } ``` If you pass an empty list of pages, `compute()` short-circuits and returns just `{"total_pages": 0}` — none of the other keys above will be present. If you read `data["pages_with_text"]` blindly, that'll `KeyError` on an empty crawl. Check `total_pages > 0` first or use `data.get(...)`. ## Examples ### After a crawl ```python from yoink import Crawler, CrawlConfig from yoink.stats import CrawlStats async def main(): crawler = Crawler(CrawlConfig()) pages = await crawler.crawl("https://example.com") stats = CrawlStats(pages) print(stats.format_summary()) ``` ### From a saved file ```python from pathlib import Path from yoink.stats import CrawlStats stats = CrawlStats.from_file(Path("crawl_output.jsonl")) data = stats.compute() print(f"Got {data['total_pages']} pages across {data['unique_domains']} domains") print(f"Median page text: {data['text_length_median']} chars") ``` ### Filtering by content quality ```python data = stats.compute() text_share = data["pages_with_text"] / data["total_pages"] if text_share < 0.5: print("⚠ Less than half the pages had extractable text — site may be JS-heavy") ``` ### Export ```python stats.export_csv(Path("crawl_stats.csv")) ``` The CSV has two sections: ``` Metric,Value Total Pages,87 Total Links,1243 ... Top Domains,Count docs.example.com,87 ``` ## See also - The CLI version: [`yoink stats`](/docs/cli/stats). ## Writers _Source: `docs/api/writers.mdx` · https://yoink.goatsquadstudios.com/docs/api/writers_ > Serialize a list of Page objects to JSON, JSONL, Parquet, or plain text. `Writer` is a static helper class with one method per output format. Used internally by the CLI; available directly for programmatic use. ## Import ```python from yoink.writers import Writer ``` ## Methods ## Examples ### After a crawl ```python from pathlib import Path from yoink import Crawler from yoink.writers import Writer async def main(): crawler = Crawler() pages = await crawler.crawl("https://example.com") Writer.write_jsonl(pages, Path("data.jsonl")) Writer.write_parquet(pages, Path("data.parquet")) ``` ### Parquet for analytics ```python import pandas as pd Writer.write_parquet(pages, Path("data.parquet")) df = pd.read_parquet("data.parquet") print(df["depth"].value_counts()) print(df.groupby("depth")["num_links"].mean()) ``` The Parquet schema is **flattened** for analytical queries: ### Text dump ```python Writer.write_text(pages, Path("dump.txt")) ``` Output: ``` URL: https://example.com Title: Example Domain -------------------------------------------------------------------------------- This domain is for use in illustrative examples in documents... ================================================================================ URL: https://example.com/about Title: About ... ``` When a page has no `title`, the field is rendered as `Title: N/A`. ## Output format choice | Format | Best for | Streaming? | Compressed? | | ---------- | ------------------------------------------- | ---------- | ----------- | | `jsonl` | AI / ML pipelines, large datasets | yes | no | | `json` | Small datasets, debugging | no | no | | `parquet` | Analytics, pandas, columnar storage | yes (rows) | yes (snappy)| | `text` | Quick eyeballing, archival | no | no | For most use cases, **JSONL is the default answer.** ## See also - [Output formats reference](/docs/reference/output-formats). - [`Page`](/docs/api/page) — what gets serialized. --- # Examples ## Basic crawl _Source: `docs/examples/basic.mdx` · https://yoink.goatsquadstudios.com/docs/examples/basic_ > A minimal end-to-end example — crawl, save, inspect. The fewest lines of code that does something useful. [`examples/basic_crawl.py`](https://github.com/ErikkJs/yoink/blob/master/examples/basic_crawl.py) is a minimal starter — just a `Crawler()` with defaults that prints page summaries. The script below extends that to also save JSONL and print stats. Use whichever fits. ## Script Save this as `my_crawl.py`: ```python import asyncio from pathlib import Path from yoink import Crawler, CrawlConfig from yoink.writers import Writer from yoink.stats import CrawlStats async def main(): config = CrawlConfig( max_depth=2, max_pages=50, requests_per_second=2.0, ) crawler = Crawler(config=config) pages = await crawler.crawl("https://example.com") # Save to JSONL output = Path("example.jsonl") Writer.write_jsonl(pages, output) print(f"Saved {len(pages)} pages to {output}") # Print summary stats = CrawlStats(pages) print(stats.format_summary()) asyncio.run(main()) ``` ## Run it ```bash python my_crawl.py ``` ## What you get 1. `example.jsonl` — one JSON object per page. 2. A formatted summary printed to stdout (depth distribution, top domains, content quality). ## Variations ### Save HTML too ```python config = CrawlConfig( max_depth=2, max_pages=50, save_html=True, # raw HTML on each Page record ) ``` ### Multiple output formats ```python Writer.write_jsonl(pages, Path("data.jsonl")) Writer.write_parquet(pages, Path("data.parquet")) Writer.write_text(pages, Path("data.txt")) ``` ### Filter file types ```python from yoink.filters import CombinedFilter url_filter = CombinedFilter.from_config( skip_extensions=["pdf", "zip", "exe", "jpg", "png"], ) crawler = Crawler(config=config, url_filter=url_filter) ``` ## Same thing on the CLI ```bash yoink crawl https://example.com -d 2 -n 50 -o example.jsonl yoink stats example.jsonl ``` ## See also - [Quickstart](/docs/quickstart) — the same idea, even shorter. - [`Crawler`](/docs/api/crawler) and [`CrawlConfig`](/docs/api/config) for the full API. ## AI training data _Source: `docs/examples/ai-training.mdx` · https://yoink.goatsquadstudios.com/docs/examples/ai-training_ > Build a clean, deduplicated text dataset suitable for fine-tuning or RAG. This is the canonical use case yoink was built for: turn a documentation site (or any structured public source) into a clean JSONL ready to feed into a training pipeline or vector database. [`examples/ai_training_data.py`](https://github.com/ErikkJs/yoink/blob/master/examples/ai_training_data.py) is a simpler starter (length-filter + JSONL + stats, no dedup or token budgeting). The script below adds hash-based dedup, length clipping, and a token-count estimate — copy whichever matches your needs. ## The pipeline ## Script ```python import asyncio import hashlib import json from pathlib import Path from yoink import Crawler, CrawlConfig from yoink.filters import CombinedFilter MIN_TEXT_CHARS = 500 MAX_TEXT_CHARS = 50_000 async def build_dataset(start_url: str, output: Path): config = CrawlConfig( max_depth=3, max_pages=10_000, max_concurrency=15, requests_per_second=5.0, extract_text=True, save_html=False, # we don't need it respect_robots=True, # always ) url_filter = CombinedFilter.from_config( skip_extensions=["pdf", "zip", "exe", "jpg", "png", "gif", "mp4"], exclude_patterns=["*/print/*", "*/edit/*", r".*\?diff=.*"], ) crawler = Crawler(config=config, url_filter=url_filter) pages = await crawler.crawl(start_url) # Dedup by text hash (different URLs, same content) seen_hashes: set[str] = set() written = 0 with open(output, "w", encoding="utf-8") as f: for page in pages: text = page.text if not text: continue if len(text) < MIN_TEXT_CHARS: continue if len(text) > MAX_TEXT_CHARS: text = text[:MAX_TEXT_CHARS] h = hashlib.sha256(text.encode("utf-8")).hexdigest() if h in seen_hashes: continue seen_hashes.add(h) record = { "id": h[:16], "source_url": page.url, "title": page.title, "text": text, "tokens_approx": len(text) // 4, "depth": page.depth, } f.write(json.dumps(record, ensure_ascii=False) + "\n") written += 1 return { "crawled": len(pages), "written": written, "deduped": len(pages) - written, } if __name__ == "__main__": result = asyncio.run(build_dataset( "https://docs.example.com", Path("training_data.jsonl"), )) print(f"Crawled: {result['crawled']}") print(f"Written: {result['written']}") print(f"Deduped: {result['deduped']}") ``` ## What this does 1. **Polite crawl** — 5 RPS, respects robots.txt, stays on the start domain. 2. **Skip binaries** — no PDFs, images, or zips muddying the text dataset. 3. **Skip noise** — `print/`, `edit/`, and `?diff=` URLs typically duplicate canonical content. 4. **Filter on length** — drop pages with too little (chrome-only) or too much (likely concatenated-everything-pages) text. 5. **Dedupe by hash** — different URLs with identical extracted text get collapsed. 6. **Token estimate** — a rough `len(text) // 4` works well enough for budgeting. ## Loading it back ```python import json records = [json.loads(line) for line in open("training_data.jsonl")] print(f"{len(records)} records, {sum(r['tokens_approx'] for r in records):,} approx tokens") ``` ## Variations ### For a vector index (chunking) ```python from textwrap import wrap def chunks(text: str, size: int = 1000): return wrap(text, size, replace_whitespace=False, drop_whitespace=False) # in the loop: for i, chunk in enumerate(chunks(text)): record = { "id": f"{h[:16]}-{i}", "source_url": page.url, "chunk_index": i, "text": chunk, } ... ``` ### Including metadata for filtering ```python record = { "id": h[:16], "source_url": page.url, "title": page.title, "text": text, "description": page.metadata.get("description"), "og_type": page.metadata.get("og:type"), "depth": page.depth, "crawled_at": page.crawled_at.isoformat(), } ``` ## See also - [URL filtering concepts](/docs/concepts/url-filtering). - [`CrawlConfig`](/docs/api/config) — every knob. - [`Page`](/docs/api/page) — what's available on each record. ## Checkpoint & resume _Source: `docs/examples/checkpoint-resume.mdx` · https://yoink.goatsquadstudios.com/docs/examples/checkpoint-resume_ > Three patterns for resumable crawls — local file, S3 across processes, and a Lambda handler that survives 15-minute timeouts. The repo's [`examples/checkpoint_resume.py`](https://github.com/ErikkJs/yoink/blob/master/examples/checkpoint_resume.py) bundles three runnable scenarios that map cleanly onto real production patterns. This page walks through each. ## 1. Local file checkpoint The simplest case: long crawl on your laptop, want to be able to Ctrl-C and pick up where you left off. ```python import asyncio from yoink import Crawler, CrawlConfig, CheckpointManager async def main(): config = CrawlConfig( max_depth=2, max_pages=100, max_concurrency=10, ) checkpoint = CheckpointManager.from_uri( "./crawl.jsonl", flush_interval=5, # state snapshot every 5 pages ) crawler = Crawler(config=config, checkpoint_manager=checkpoint) try: # First run, OR resume — same call either way pages = await crawler.crawl("https://example.com", resume=True) print(f"Crawled {len(pages)} pages") except KeyboardInterrupt: print("Interrupted! Checkpoint saved. Re-run to resume.") asyncio.run(main()) ``` What `resume=True` does on a fresh run with no checkpoint: starts from scratch. With an existing checkpoint: restores `visited`, `queue`, and previously-yoinked pages, then continues from the queue. ## 2. S3 checkpoint (cross-process) Same code, different URI. Useful when the crawl might run on different hosts (e.g., a fresh container picks up where a killed one left off): ```python config = CrawlConfig(max_depth=2, max_pages=1000, max_concurrency=20) checkpoint = CheckpointManager.from_uri( "s3://my-crawl-bucket/checkpoints/example.jsonl", flush_interval=10, # higher → fewer S3 API calls ) crawler = Crawler(config=config, checkpoint_manager=checkpoint) pages = await crawler.crawl("https://example.com", resume=True) ``` Requires the `s3` extra (`pip install -e ".[s3]"`) and AWS credentials in the environment / IAM role / `~/.aws/credentials`. Each flush is a download-mutate-upload round-trip on the S3 object (S3 has no native append). For a small/medium crawl, `flush_interval=10` keeps the API call rate sensible without losing more than ~10 pages of state on a crash. ## 3. Lambda handler This is the pattern that makes long crawls survive Lambda's hard 15-minute timeout. Each invocation crawls for ~14 minutes, checkpoints to S3, and exits. EventBridge re-invokes; the next run resumes from the same checkpoint. ```python async def lambda_handler(): event = {"url": "https://example.com"} checkpoint = CheckpointManager.from_uri( "s3://my-crawl-bucket/lambda-checkpoints/crawl.jsonl", flush_interval=10, ) config = CrawlConfig(max_pages=5000, max_concurrency=30, max_depth=3) crawler = Crawler(config=config, checkpoint_manager=checkpoint) pages = await crawler.crawl(event["url"], resume=True) return { "statusCode": 200, "body": { "pages_crawled": len(pages), "message": "Crawl completed or checkpoint saved for next invocation", }, } ``` For the full deployment recipe (IAM role, EventBridge schedule, layer build), see [Lambda + S3 checkpoints](/docs/examples/lambda-s3). ## Running the bundled file ```bash poetry run python examples/checkpoint_resume.py ``` By default this runs the local-file scenario. The S3 and Lambda scenarios are commented out at the bottom of the file — uncomment after you've configured AWS credentials. ## See also - [Checkpointing concepts](/docs/concepts/checkpointing) — file format, resume semantics, when to use. - [`CheckpointManager` API](/docs/api/checkpoint). - [Storage backends](/docs/api/storage). ## Lambda + S3 checkpoints _Source: `docs/examples/lambda-s3.mdx` · https://yoink.goatsquadstudios.com/docs/examples/lambda-s3_ > A resumable AWS Lambda crawler that survives 15-minute timeouts via S3 checkpoints. AWS Lambda has a hard 15-minute execution limit. A crawl that wants to survive longer than that needs to checkpoint and resume across invocations. With yoink, that's about 20 lines of code. ## The architecture ## Lambda handler ```python import asyncio import json import os from yoink import Crawler, CrawlConfig, CheckpointManager CHECKPOINT_BUCKET = os.environ["CHECKPOINT_BUCKET"] CHECKPOINT_KEY = os.environ["CHECKPOINT_KEY"] # e.g. "crawls/example-com.jsonl" START_URL = os.environ["START_URL"] MAX_PAGES = int(os.environ.get("MAX_PAGES", "10000")) # Reserve ~30s for Lambda housekeeping TIME_BUDGET_SECONDS = 14 * 60 async def crawl_chunk(): config = CrawlConfig( max_depth=4, max_pages=MAX_PAGES, max_concurrency=20, requests_per_second=10.0, ) checkpoint_uri = f"s3://{CHECKPOINT_BUCKET}/{CHECKPOINT_KEY}" checkpoint = CheckpointManager.from_uri(checkpoint_uri, flush_interval=50) crawler = Crawler(config=config, checkpoint_manager=checkpoint) # Resume picks up if checkpoint exists, else starts fresh pages = await asyncio.wait_for( crawler.crawl(START_URL, resume=True), timeout=TIME_BUDGET_SECONDS, ) return pages def handler(event, context): try: pages = asyncio.run(crawl_chunk()) done = len(pages) >= MAX_PAGES except asyncio.TimeoutError: # Hit the time budget — we'll resume on the next invocation done = False pages = [] return { "statusCode": 200, "body": json.dumps({ "pages_so_far": len(pages), "done": done, }), } ``` ## Deploy ### IAM role The Lambda execution role needs: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"], "Resource": "*" }, { "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject", "s3:HeadObject"], "Resource": "arn:aws:s3:::your-checkpoint-bucket/*" } ] } ``` ### Lambda layer Bundle yoink and its `[s3]` extra into a layer: ```bash mkdir -p layer/python pip install --target layer/python "yoink[s3]" cd layer && zip -r ../yoink-layer.zip python && cd .. aws lambda publish-layer-version \ --layer-name yoink \ --zip-file fileb://yoink-layer.zip \ --compatible-runtimes python3.11 ``` ### Function ```bash aws lambda create-function \ --function-name yoink-crawler \ --runtime python3.11 \ --role arn:aws:iam::ACCOUNT:role/yoink-crawler-role \ --handler handler.handler \ --timeout 900 \ --memory-size 1024 \ --layers arn:aws:lambda:REGION:ACCOUNT:layer:yoink:1 \ --zip-file fileb://handler.zip \ --environment "Variables={CHECKPOINT_BUCKET=...,CHECKPOINT_KEY=crawls/example.jsonl,START_URL=https://example.com}" ``` ### Schedule ```bash aws events put-rule \ --name yoink-crawler-tick \ --schedule-expression "rate(14 minutes)" aws events put-targets \ --rule yoink-crawler-tick \ --targets "Id=1,Arn=arn:aws:lambda:REGION:ACCOUNT:function:yoink-crawler" aws lambda add-permission \ --function-name yoink-crawler \ --statement-id allow-eventbridge \ --action lambda:InvokeFunction \ --principal events.amazonaws.com \ --source-arn arn:aws:events:REGION:ACCOUNT:rule/yoink-crawler-tick ``` ## Observability A few things worth logging: ```python import structlog log = structlog.get_logger() # in handler: log.info("invocation_complete", pages_so_far=len(pages), done=done, checkpoint=checkpoint_uri, ) ``` You can read the checkpoint file from anywhere with read access — `aws s3 cp`, the AWS console, or a small Lambda that loads it via `CheckpointManager.from_uri(...).load()`. ## Stopping the schedule When `done=True`, disable the EventBridge rule (or have the Lambda do it): ```python import boto3 if done: boto3.client("events").disable_rule(Name="yoink-crawler-tick") ``` Don't wait for the whole crawl to finish before doing something with it. The checkpoint file is JSONL — kick off a parallel Lambda or Glue job that tails it and processes new lines. ## See also - [Checkpointing concepts](/docs/concepts/checkpointing). - [`CheckpointManager`](/docs/api/checkpoint). - [Storage backends](/docs/api/storage). ## Custom extraction _Source: `docs/examples/custom-extraction.mdx` · https://yoink.goatsquadstudios.com/docs/examples/custom-extraction_ > Replace or augment the default text extractor with domain-specific logic. Trafilatura is great for general-purpose article extraction. But sometimes you need to extract structured data — product specs, schema.org JSON-LD, GitHub READMEs — and want to bypass or augment the default extractor. [`examples/custom_extraction.py`](https://github.com/ErikkJs/yoink/blob/master/examples/custom_extraction.py) demonstrates lightweight post-processing — link-counting by domain, keyword search, metadata inspection. The recipes on this page go further (subclassing `Extractor`, parsing JSON-LD, handling PDFs). ## Approach 1: post-process `Page.html` If you set `save_html=True`, every page record carries the raw HTML. You can run any extractor over it after the crawl. ```python import asyncio import json from bs4 import BeautifulSoup from yoink import Crawler, CrawlConfig async def main(): config = CrawlConfig( max_depth=2, save_html=True, # we need raw HTML extract_text=False, # skip trafilatura ) crawler = Crawler(config=config) pages = await crawler.crawl("https://example.com/products") products = [] for page in pages: if not page.html: continue soup = BeautifulSoup(page.html, "lxml") # Pull schema.org JSON-LD for tag in soup.find_all("script", type="application/ld+json"): try: data = json.loads(tag.string or "") except json.JSONDecodeError: continue if isinstance(data, dict) and data.get("@type") == "Product": products.append({ "url": page.url, "name": data.get("name"), "price": data.get("offers", {}).get("price"), "currency": data.get("offers", {}).get("priceCurrency"), }) return products products = asyncio.run(main()) print(f"Extracted {len(products)} products") ``` ## Approach 2: subclass the `Extractor` For invasive changes, replace the extractor entirely. The `Crawler.__init__` builds its own `Extractor`, so the cleanest path is a small subclass of `Crawler`: ```python from yoink import Crawler, CrawlConfig from yoink.extractor import Extractor class MarkdownExtractor(Extractor): def extract(self, html: str, url: str) -> str: # Replace the trafilatura call with markdownify, html2text, # readability-lxml, or your own logic. from markdownify import markdownify return markdownify(html, heading_style="ATX") class MarkdownCrawler(Crawler): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.extractor = MarkdownExtractor() # Use it like any Crawler crawler = MarkdownCrawler(config=CrawlConfig()) pages = await crawler.crawl("https://docs.example.com") # page.text is now markdown ``` ## Approach 3: extract during the crawl with metadata The default `Parser` already extracts standard meta tags into `Page.metadata`. If you need extra fields, parse them in a wrapper: ```python from yoink import Crawler, CrawlConfig from bs4 import BeautifulSoup class EnrichedCrawler(Crawler): async def _worker(self, fetcher, worker_id): # Defer to the parent, then enrich each page after it's added await super()._worker(fetcher, worker_id) # That works in principle, but the most pragmatic approach is to enrich AFTER: crawler = Crawler(config=CrawlConfig(save_html=True)) pages = await crawler.crawl("https://docs.example.com") for page in pages: if page.html: soup = BeautifulSoup(page.html, "lxml") # Pull custom metadata published = soup.find("meta", attrs={"name": "article:published_time"}) if published: page.metadata["published_at"] = published.get("content", "") ``` ## Approach 4: PDF or other non-HTML content yoink doesn't ship a PDF extractor, but you can post-process easily: ```python import asyncio import requests from io import BytesIO from pypdf import PdfReader async def main(): config = CrawlConfig(extract_text=False) # we'll do our own crawler = Crawler(config=config) pages = await crawler.crawl("https://example.com/papers") for page in pages: if page.url.endswith(".pdf"): # Re-fetch as binary (yoink fetched it as a string, which mangled bytes) content = requests.get(page.url, timeout=30).content reader = PdfReader(BytesIO(content)) page.text = "\n\n".join(p.extract_text() for p in reader.pages) asyncio.run(main()) ``` yoink optimizes for HTML and clean text. For weird formats — PDFs, video transcripts, structured data feeds — pull just the URLs you need with yoink (use [URL filtering](/docs/concepts/url-filtering)) and process them with format-specific tooling afterwards. ## See also - The default `Extractor`: src/yoink/extractor.py. - The default `Parser`: src/yoink/parser.py. --- # Reference ## Output formats _Source: `docs/reference/output-formats.mdx` · https://yoink.goatsquadstudios.com/docs/reference/output-formats_ > JSON, JSONL, Parquet, and plain text — exact shapes, when to use each. yoink writes crawl results in four formats. They all carry the same `Page` data, but differ in shape, streamability, and compression. ## Format comparison | Format | Best for | Streamable | Compressed | Extras needed | | --------- | ------------------------- | ---------- | ---------- | ------------- | | `jsonl` | AI/ML, large datasets | yes (rows) | no | — | | `json` | Small datasets, debugging | no | no | — | | `parquet` | Analytics, pandas | yes (rows) | snappy | `[parquet]` | | `text` | Eyeballing | no | no | — | ## JSON A single JSON array. Easy to read, easy to break: large arrays must be loaded entirely into memory. ```bash yoink crawl https://example.com -f json -o data.json ``` ```json [ { "url": "https://example.com", "title": "Example Domain", "text": "...", "html": null, "links": ["https://example.com/about"], "metadata": {}, "crawled_at": "2026-05-03T12:00:00", "status_code": 200, "depth": 0 }, { "url": "https://example.com/about", "...": "..." } ] ``` ## JSONL (recommended) Newline-delimited JSON. One `Page` per line. Streamable and `grep`-friendly. ```bash yoink crawl https://example.com -f jsonl -o data.jsonl ``` ```jsonl {"url": "https://example.com", "title": "Example Domain", ...} {"url": "https://example.com/about", "title": "About", ...} ``` Reading with the standard library: ```python import json from yoink import Page with open("data.jsonl") as f: pages = [Page.model_validate_json(line) for line in f] ``` Streaming (don't load it all): ```python def iter_pages(path): with open(path) as f: for line in f: yield Page.model_validate_json(line) for page in iter_pages("data.jsonl"): process(page) ``` ## Parquet Columnar storage. Smaller files, faster analytical queries. Requires `pip install "yoink[parquet]"`. ```bash yoink crawl https://example.com -f parquet -o data.parquet ``` The schema is **flattened** — `links` becomes `num_links`, `metadata` becomes a JSON-encoded string. This is intentional: it keeps the file portable and analytical queries fast. Compression is `snappy` for fast read/write. Read with pandas / pyarrow / DuckDB: ```python import pandas as pd df = pd.read_parquet("data.parquet") # Or DuckDB for SQL import duckdb duckdb.sql("SELECT depth, count(*) FROM 'data.parquet' GROUP BY depth").show() ``` Parquet drops the per-page `links` array (only `num_links` is preserved) and never writes `html` even when `save_html=True`. If you need either, use JSONL. ## Text Plain text dump. Good for archival and quick visual inspection. ```bash yoink crawl https://example.com -f text -o data.txt ``` Format: ``` URL: https://example.com Title: Example Domain -------------------------------------------------------------------------------- This domain is for use in illustrative examples in documents. ================================================================================ URL: https://example.com/about Title: About -------------------------------------------------------------------------------- ... ``` This format is one-way — you can't reliably load it back into `Page` objects. Use JSONL for round-tripping. ## Choosing - **AI training / RAG indexing?** JSONL. - **Pandas / DuckDB / Athena?** Parquet. - **Throwaway one-shot?** JSON. - **Quick read?** Text. ## See also - [`Writer`](/docs/api/writers) — programmatic output. - [`Page`](/docs/api/page) — the underlying data shape. ## Configuration reference _Source: `docs/reference/configuration.mdx` · https://yoink.goatsquadstudios.com/docs/reference/configuration_ > Quick-scan reference for every configuration option, organized by section. This page is the dense, scrolling reference. For prose explanations, see the corresponding concept pages. ## Core Max link-hop distance from start URL. Architecture. }, { name: "max_pages", type: "int", default: "100", description: "Total page cap." }, { name: "max_concurrency", type: "int", default: "10", description: "Worker coroutines (1..100)." }, { name: "user_agent", type: "str", default: "yoink/ (+github)", description: "User-Agent header." }, { name: "timeout", type: "int", default: "30", description: "Per-request timeout (seconds)." }, { name: "follow_external", type: "bool", default: "False", description: "Follow links to other domains." }, { name: "extract_text", type: "bool", default: "True", description: "Run trafilatura for clean text." }, { name: "save_html", type: "bool", default: "False", description: "Persist raw HTML on each Page." }, ]} /> ## Rate limiting Per-domain token-bucket fill rate. Rate limiting. }, { name: "request_delay", type: "float", default: "0.0", description: "Minimum seconds between requests to same domain." }, ]} /> ## robots.txt Fetch and apply robots.txt. robots.txt. }, ]} /> ## JavaScript rendering (requires `[browser]`) Use Playwright. JS rendering. }, { name: "headless", type: "bool", default: "True", description: "Run browser without a UI window." }, { name: "wait_strategy", type: "WaitStrategy", default: "NETWORKIDLE", description: "load | domcontentloaded | networkidle | commit." }, { name: "wait_selector", type: "str | None", default: "None", description: "CSS selector to wait for." }, { name: "browser_type", type: "Literal", default: "chromium", description: "chromium | firefox | webkit." }, { name: "browser_pool_size", type: "int", default: "3", description: "Pooled browser contexts (1..10)." }, { name: "screenshot_dir", type: "str | None", default: "None", description: "Debug screenshots directory." }, ]} /> ## URL filtering (separate from `CrawlConfig`) Pass to `Crawler(url_filter=...)`. See [`CombinedFilter.from_config`](/docs/api/filters). ## Checkpointing (separate from `CrawlConfig`) Pass to `Crawler(checkpoint_manager=...)`. See [`CheckpointManager`](/docs/api/checkpoint). ## CLI flag mapping | CLI flag | Config field | | ----------------------- | ----------------------------- | | `--depth, -d` | `max_depth` | | `--max-pages, -n` | `max_pages` | | `--concurrency, -c` | `max_concurrency` | | `--user-agent` | `user_agent` | | `--follow-external` | `follow_external` | | `--save-html` | `save_html` | | `--rate-limit, -r` | `requests_per_second` | | `--request-delay` | `request_delay` | | `--no-robots` | `respect_robots=False` | | `--render-js, --browser`| `render_js` | | `--wait-for` | `wait_strategy` | | `--wait-selector` | `wait_selector` | | `--browser-type` | `browser_type` | | `--no-headless` | `headless=False` | | `--include` | `url_filter.include_patterns` | | `--exclude` | `url_filter.exclude_patterns` | | `--skip-extensions` | `url_filter.skip_extensions` | | `--checkpoint` | `CheckpointManager.from_uri` | | `--checkpoint-interval` | `flush_interval` | | `--resume` | `crawler.crawl(resume=True)` | ## See also - [`CrawlConfig`](/docs/api/config) — the Pydantic model itself. - [CLI: yoink crawl](/docs/cli/crawl) — flag-by-flag.