Concepts

Checkpointing

Resumable crawls with append-only checkpoints. Survive Lambda timeouts, OOM kills, and Ctrl-C.

docs/concepts/checkpointing.mdx·edit on github ↗·

Long crawls die. Lambda timeouts hit. SSH connections drop. Servers OOM-kill your process. yoink's checkpointing system makes any crawl resumable with two lines of code.

What gets checkpointed

A checkpoint file is an append-only log of three kinds of records:

  1. Metadata — start URL, config snapshot, timestamp. Written once at the start.
  2. Pages — one record per crawled page. Streamed as they finish.
  3. State — the visited set, the queue, the filtered set. Written periodically and on shutdown.

The format is JSONL with a type discriminator on each line:

checkpoint / file formatcrawl.jsonl
001metadata
{"type":"metadata","start_url":"https://example.com","config":{...},"started_at":"2026-05-03T12:00:00"}

once at start

002page
{"type":"page","url":"https://example.com","title":"Example","text":"…","depth":0}

streamed as crawled

003page
{"type":"page","url":"https://example.com/about","title":"About","text":"…","depth":1}

004state
{"type":"state","visited":[…87 urls…],"queue":[…12 urls…],"filtered":[…5 urls…]}

every flush_interval pages + on shutdown

···file grows append-only as the crawl progresses
three record types share one file. resume reads the whole file once and reconstitutes state.

CLI usage

# Run a crawl with checkpointing
yoink crawl https://example.com --checkpoint ./crawl.jsonl
 
# It crashed / you Ctrl-C'd. Resume:
yoink crawl https://example.com --checkpoint ./crawl.jsonl --resume

The same flags work with S3 URIs:

yoink crawl https://example.com --checkpoint s3://my-bucket/crawl.jsonl --resume

Python usage

from yoink import Crawler, CrawlConfig, CheckpointManager
 
async def main():
    config = CrawlConfig(max_pages=10_000)
 
    # Local file
    checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
 
    # ...or S3
    # checkpoint = CheckpointManager.from_uri("s3://my-bucket/crawl.jsonl")
 
    crawler = Crawler(config=config, checkpoint_manager=checkpoint)
 
    # If the file exists, pick up where we left off
    pages = await crawler.crawl("https://example.com", resume=True)
    return pages

Flush interval

Pages are written immediately. State is flushed every N pages (default 10) and on shutdown:

checkpoint = CheckpointManager.from_uri(
    "./crawl.jsonl",
    flush_interval=50,  # write state every 50 pages
)
yoink crawl https://example.com --checkpoint ./crawl.jsonl --checkpoint-interval 50

Lower values give finer-grained resume but cost more I/O. For S3, every flush is an API call, so you generally want a higher interval (50–100).

Storage backends

CheckpointManager.from_uri(...) picks a backend based on the URI scheme:

URIBackendImplementation
./relative/path.jsonlLocalFileStorageAsync aiofiles append
/absolute/path.jsonlLocalFileStorageAsync aiofiles append
s3://bucket/key.jsonlS3StorageBuffered → put_object

Want a custom backend (Redis, GCS, Azure)? Implement CheckpointStorage — five async methods.

How resume works

When you call crawler.crawl(url, resume=True):

  1. The checkpoint file is read line by line.
  2. Pages are restored into crawler.pages.
  3. State restores scheduler.visited, scheduler.queue, scheduler.filtered.
  4. If the start URL doesn't match the checkpoint metadata, you get a warning.
  5. The crawl continues from the queue.

When to use checkpoints

Use them

  • Crawls expected to take more than 10 minutes.
  • Lambda jobs (any execution > 30s).
  • Containers that may be killed (autoscaling, spot instances).
  • Anywhere the start URL might be re-invoked.

Skip them

  • Throwaway crawls (one-shot data pulls in dev).
  • Tiny crawls where re-running is cheaper than checkpoint I/O.

See also