CheckpointManager
Persist crawl progress to disk or S3 — automatic resume, configurable flush interval.
CheckpointManager writes pages and crawl state to a CheckpointStorage backend. Pass one to Crawler to make any crawl resumable.
Import
from yoink import CheckpointManagerConstructing
The recommended path is from_uri(), which picks the right storage backend:
# Local file
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
# S3 (requires [s3] extra)
checkpoint = CheckpointManager.from_uri("s3://my-bucket/crawl.jsonl")
# With custom flush cadence
checkpoint = CheckpointManager.from_uri(
"./crawl.jsonl",
flush_interval=50,
)For full control, build with an explicit storage backend:
from yoink import CheckpointManager
from yoink.storage import LocalFileStorage, S3Storage
storage = S3Storage("s3://my-bucket/crawl.jsonl")
checkpoint = CheckpointManager(storage=storage, flush_interval=50)API surface
| Name | Type | Default | Description |
|---|---|---|---|
| from_uri(uri, flush_interval=10) | classmethod | — | Build a CheckpointManager from a path or s3:// URI. |
| write_metadata(start_url, config) | async | — | Write the run header. Called once by Crawler. |
| write_page(page) | async | — | Append one page record. |
| write_state(visited, queue, filtered) | async | — | Snapshot the scheduler state for resume. |
| load() | async | — | Read the checkpoint and return { metadata, pages, state }. |
| close() | async | — | Flush the storage backend and release resources. Called by Crawler at end of run. |
Examples
Resumable local crawl
import asyncio
from yoink import Crawler, CrawlConfig, CheckpointManager
async def main():
config = CrawlConfig(max_depth=3, max_pages=10_000)
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
# Pick up where we left off if the file exists
pages = await crawler.crawl("https://example.com", resume=True)
print(f"Total pages: {len(pages)}")
asyncio.run(main())Inspecting a checkpoint
import asyncio
from yoink import CheckpointManager
async def main():
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
data = await checkpoint.load()
print(f"Started: {data['metadata']['started_at']}")
print(f"Start URL: {data['metadata']['start_url']}")
print(f"Pages saved: {len(data['pages'])}")
if state := data.get("state"):
print(f"Visited: {len(state['visited'])}")
print(f"Queue: {len(state['queue'])}")
print(f"Filtered:{len(state['filtered'])}")
asyncio.run(main())Choosing a flush interval
# Aggressive flushing — every page gets persisted state too
checkpoint = CheckpointManager.from_uri("./crawl.jsonl", flush_interval=1)
# Moderate (default) — state every 10 pages
checkpoint = CheckpointManager.from_uri("./crawl.jsonl", flush_interval=10)
# S3 — minimize API calls for cost
checkpoint = CheckpointManager.from_uri("s3://bucket/crawl.jsonl", flush_interval=100)For S3 the trade-off is real: every flush is a put_object call. Don't go below 50 unless you have specific reasons.
File format
A checkpoint is a JSONL file with type discriminators:
{"type": "metadata", "start_url": "...", "config": {...}, "started_at": "..."}
{"type": "page", "url": "...", "title": "...", ...}
{"type": "page", "url": "...", "title": "...", ...}
{"type": "state", "visited": [...], "queue": [...], "filtered": [...]}This is intentionally readable and grep-friendly. You can hand-edit a checkpoint to remove a problematic page or trim the queue.
See also
- Checkpointing concepts.
CheckpointStorage— the storage backend interface.- Lambda + S3 example.