Concepts

Architecture

How yoink's modules fit together — fetcher, parser, scheduler, rate limiter, robots checker, checkpoint, storage.

docs/concepts/architecture.mdx·edit on github ↗·

yoink is built as a small graph of single-purpose modules. Each one does one thing, and the Crawler is the conductor that wires them together.

The components

ModuleResponsibility
CrawlerOwns the worker pool, drives the crawl loop, persists results.
FetcherAsync HTTP client (aiohttp). Up to 3 attempts with exponential backoff on ClientError; immediate retry on TimeoutError; HTTP errors (4xx/5xx) returned as-is, not retried.
PlaywrightFetcherBrowser-based fetcher for JS-heavy sites.
create_fetcherFactory function in fetcher_factory.py. Returns PlaywrightFetcher when render_js=True (and Playwright is importable), Fetcher otherwise. Emits UserWarning and falls back to Fetcher if render_js=True but Playwright isn't installed.
ParserHTML → title, links, metadata (BeautifulSoup + lxml).
ExtractorHTML → clean text via trafilatura.
SchedulerURL queue, depth tracking, deduplication, filter integration.
RateLimiterPer-domain token bucket with Crawl-delay support.
RobotsCheckerFetches & caches robots.txt, answers is_allowed(url).
URLFilter / DomainFilter / CombinedFilterGlob/regex/extension/domain matching (filters.py).
CheckpointManagerAppend page records and crawl state to a CheckpointStorage.
CheckpointStoragePluggable backend (local file, S3).
WriterFinal-output serialization (JSON, JSONL, Parquet, text).
CrawlStatsPost-crawl analysis (depth, domains, content quality).

Lifecycle of a crawl

crawler / worker loopcrawler.py

▸ setup (run once)

01

Scheduler.add(start_url, depth=0)

seed the queue with the start URL

02

spawn N workers

max_concurrency coroutines, each running the inner loop

▸ inner loop · per URL · per worker
01

RateLimiter.acquire(domain)

wait for a token in the per-domain bucket

02

RobotsChecker.is_allowed(url)

fetch & cache robots.txt, evaluate rules

03

Fetcher.fetch(url) → html, status

aiohttp or Playwright, depending on render_js

04

Parser.parse(html) → title, links, metadata

BeautifulSoup + lxml

05

Extractor.extract(html) → text

trafilatura, optional

06

CheckpointManager.write_page(page)

JSONL append to disk or S3

07

Scheduler.add(link, depth+1) for link in links

enqueue children, dedup against visited

repeat until queue is empty
async workers run the same inner loop. exit when queue empty AND no in-flight work.

Why this shape?

Async workers, not threads. Crawling is I/O bound. asyncio lets one process handle hundreds of concurrent requests without the overhead of OS threads.

A real queue, not recursion. Depth-limited BFS gives predictable memory usage and clean depth metadata. The scheduler also owns deduplication, so workers can't accidentally re-fetch the same URL.

Rate limiting at the gate. Token bucket per domain — workers compete for tokens, so even if you've got 50 concurrent requests, no single domain sees more than requests_per_second.

Checkpoints as an append log. Pages stream to checkpoint as soon as they're crawled, so a crash never costs you more than the in-flight batch. State (visited set, queue, filters) is written at the end and on every flush interval.

Where to extend

  • Custom storage backend — implement CheckpointStorage (write, read, exists, flush, close) for Redis, GCS, Azure Blob, etc.
  • Custom filter — implement should_crawl(url) -> bool and pass it via the url_filter argument to Crawler.
  • Custom extractor — replace the default trafilatura-based Extractor for domain-specific extraction (PDFs, schema.org parsing, etc.).

Code locations

The full source lives at github.com/ErikkJs/yoink/tree/master/src/yoink. ~3,200 lines of Python across 18 files, each module focused, with 134 passing tests under tests/.