Checkpoint & resume
Three patterns for resumable crawls — local file, S3 across processes, and a Lambda handler that survives 15-minute timeouts.
The repo's examples/checkpoint_resume.py bundles three runnable scenarios that map cleanly onto real production patterns. This page walks through each.
1. Local file checkpoint
The simplest case: long crawl on your laptop, want to be able to Ctrl-C and pick up where you left off.
import asyncio
from yoink import Crawler, CrawlConfig, CheckpointManager
async def main():
config = CrawlConfig(
max_depth=2,
max_pages=100,
max_concurrency=10,
)
checkpoint = CheckpointManager.from_uri(
"./crawl.jsonl",
flush_interval=5, # state snapshot every 5 pages
)
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
try:
# First run, OR resume — same call either way
pages = await crawler.crawl("https://example.com", resume=True)
print(f"Crawled {len(pages)} pages")
except KeyboardInterrupt:
print("Interrupted! Checkpoint saved. Re-run to resume.")
asyncio.run(main())What resume=True does on a fresh run with no checkpoint: starts from scratch. With an existing checkpoint: restores visited, queue, and previously-yoinked pages, then continues from the queue.
2. S3 checkpoint (cross-process)
Same code, different URI. Useful when the crawl might run on different hosts (e.g., a fresh container picks up where a killed one left off):
config = CrawlConfig(max_depth=2, max_pages=1000, max_concurrency=20)
checkpoint = CheckpointManager.from_uri(
"s3://my-crawl-bucket/checkpoints/example.jsonl",
flush_interval=10, # higher → fewer S3 API calls
)
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
pages = await crawler.crawl("https://example.com", resume=True)Requires the s3 extra (pip install -e ".[s3]") and AWS credentials in the environment / IAM role / ~/.aws/credentials.
3. Lambda handler
This is the pattern that makes long crawls survive Lambda's hard 15-minute timeout. Each invocation crawls for ~14 minutes, checkpoints to S3, and exits. EventBridge re-invokes; the next run resumes from the same checkpoint.
async def lambda_handler():
event = {"url": "https://example.com"}
checkpoint = CheckpointManager.from_uri(
"s3://my-crawl-bucket/lambda-checkpoints/crawl.jsonl",
flush_interval=10,
)
config = CrawlConfig(max_pages=5000, max_concurrency=30, max_depth=3)
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
pages = await crawler.crawl(event["url"], resume=True)
return {
"statusCode": 200,
"body": {
"pages_crawled": len(pages),
"message": "Crawl completed or checkpoint saved for next invocation",
},
}For the full deployment recipe (IAM role, EventBridge schedule, layer build), see Lambda + S3 checkpoints.
Running the bundled file
poetry run python examples/checkpoint_resume.pyBy default this runs the local-file scenario. The S3 and Lambda scenarios are commented out at the bottom of the file — uncomment after you've configured AWS credentials.
See also
- Checkpointing concepts — file format, resume semantics, when to use.
CheckpointManagerAPI.- Storage backends.