Checkpointing
Resumable crawls with append-only checkpoints. Survive Lambda timeouts, OOM kills, and Ctrl-C.
Long crawls die. Lambda timeouts hit. SSH connections drop. Servers OOM-kill your process. yoink's checkpointing system makes any crawl resumable with two lines of code.
What gets checkpointed
A checkpoint file is an append-only log of three kinds of records:
- Metadata — start URL, config snapshot, timestamp. Written once at the start.
- Pages — one record per crawled page. Streamed as they finish.
- State — the visited set, the queue, the filtered set. Written periodically and on shutdown.
The format is JSONL with a type discriminator on each line:
{"type":"metadata","start_url":"https://example.com","config":{...},"started_at":"2026-05-03T12:00:00"}↳ once at start
{"type":"page","url":"https://example.com","title":"Example","text":"…","depth":0}↳ streamed as crawled
{"type":"page","url":"https://example.com/about","title":"About","text":"…","depth":1}↳ …
{"type":"state","visited":[…87 urls…],"queue":[…12 urls…],"filtered":[…5 urls…]}↳ every flush_interval pages + on shutdown
CLI usage
# Run a crawl with checkpointing
yoink crawl https://example.com --checkpoint ./crawl.jsonl
# It crashed / you Ctrl-C'd. Resume:
yoink crawl https://example.com --checkpoint ./crawl.jsonl --resumeThe same flags work with S3 URIs:
yoink crawl https://example.com --checkpoint s3://my-bucket/crawl.jsonl --resumePython usage
from yoink import Crawler, CrawlConfig, CheckpointManager
async def main():
config = CrawlConfig(max_pages=10_000)
# Local file
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
# ...or S3
# checkpoint = CheckpointManager.from_uri("s3://my-bucket/crawl.jsonl")
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
# If the file exists, pick up where we left off
pages = await crawler.crawl("https://example.com", resume=True)
return pagesFlush interval
Pages are written immediately. State is flushed every N pages (default 10) and on shutdown:
checkpoint = CheckpointManager.from_uri(
"./crawl.jsonl",
flush_interval=50, # write state every 50 pages
)yoink crawl https://example.com --checkpoint ./crawl.jsonl --checkpoint-interval 50Lower values give finer-grained resume but cost more I/O. For S3, every flush is an API call, so you generally want a higher interval (50–100).
Storage backends
CheckpointManager.from_uri(...) picks a backend based on the URI scheme:
| URI | Backend | Implementation |
|---|---|---|
./relative/path.jsonl | LocalFileStorage | Async aiofiles append |
/absolute/path.jsonl | LocalFileStorage | Async aiofiles append |
s3://bucket/key.jsonl | S3Storage | Buffered → put_object |
Want a custom backend (Redis, GCS, Azure)? Implement CheckpointStorage — five async methods.
How resume works
When you call crawler.crawl(url, resume=True):
- The checkpoint file is read line by line.
- Pages are restored into
crawler.pages. - State restores
scheduler.visited,scheduler.queue,scheduler.filtered. - If the start URL doesn't match the checkpoint metadata, you get a warning.
- The crawl continues from the queue.
When to use checkpoints
✅ Use them
- Crawls expected to take more than 10 minutes.
- Lambda jobs (any execution > 30s).
- Containers that may be killed (autoscaling, spot instances).
- Anywhere the start URL might be re-invoked.
❌ Skip them
- Throwaway crawls (one-shot data pulls in dev).
- Tiny crawls where re-running is cheaper than checkpoint I/O.
See also
- Lambda + S3 checkpoints example — a complete resumable Lambda handler.
CheckpointManagerAPI.- Storage backends.