Quickstart
Yoink your first website in under a minute — CLI and Python.
This page gets you from zero to a finished crawl twice: once on the CLI, once in Python.
Your first CLI crawl
yoink crawl https://example.comThat's it. yoink will:
- Fetch the start URL, parse it, extract text, and follow links.
- Default to depth
1and100 pages— adjust with--depthand--max-pages. - Rate-limit to
2requests per second per domain and respectrobots.txt. - Write results to
crawl_output.jsonlin the current directory.
Open the file:
head -1 crawl_output.jsonl | python -m json.toolA more useful crawl
yoink crawl https://docs.python.org \
--depth 2 \
--max-pages 50 \
--include "*/tutorial/*" \
--skip-extensions pdf,zip \
--format jsonl \
-o python_tutorial.jsonlWhat's happening:
--depth 2follows two link hops from the start URL.--include "*/tutorial/*"only crawls URLs matching that glob.--skip-extensions pdf,zipignores binary file links.--format jsonl -o python_tutorial.jsonlstreams one JSON object per page to disk.
Then inspect what you got:
yoink stats python_tutorial.jsonlYou'll see total pages, link counts, depth distribution, top domains, and content quality metrics.
Your first Python crawl
import asyncio
from yoink import Crawler, CrawlConfig
async def main():
config = CrawlConfig(
max_depth=2,
max_pages=100,
max_concurrency=10,
requests_per_second=2.0,
)
crawler = Crawler(config=config)
pages = await crawler.crawl("https://example.com")
for page in pages:
print(f"{page.status_code} {page.url}")
print(f" title: {page.title}")
print(f" text: {len(page.text or '')} chars")
asyncio.run(main())Resumable crawls
Long crawls die. Plan for it from day one with checkpointing:
from yoink import Crawler, CrawlConfig, CheckpointManager
async def main():
config = CrawlConfig(max_depth=3, max_pages=10_000)
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
crawler = Crawler(config=config, checkpoint_manager=checkpoint)
# Pick up where we left off if the file already exists
pages = await crawler.crawl("https://docs.example.com", resume=True)
return pagesSame on the CLI:
# First run — interrupted with Ctrl-C
yoink crawl https://docs.example.com --checkpoint ./crawl.jsonl
# Resume
yoink crawl https://docs.example.com --checkpoint ./crawl.jsonl --resumeWhat to read next
- Concepts — architecture, rate limiting, JS rendering.
- CLI reference — every
yoink crawlflag, explained. - Python API —
Crawler,CrawlConfig,CheckpointManager.