Basic crawl
A minimal end-to-end example — crawl, save, inspect.
The fewest lines of code that does something useful.
Script
Save this as my_crawl.py:
import asyncio
from pathlib import Path
from yoink import Crawler, CrawlConfig
from yoink.writers import Writer
from yoink.stats import CrawlStats
async def main():
config = CrawlConfig(
max_depth=2,
max_pages=50,
requests_per_second=2.0,
)
crawler = Crawler(config=config)
pages = await crawler.crawl("https://example.com")
# Save to JSONL
output = Path("example.jsonl")
Writer.write_jsonl(pages, output)
print(f"Saved {len(pages)} pages to {output}")
# Print summary
stats = CrawlStats(pages)
print(stats.format_summary())
asyncio.run(main())Run it
python my_crawl.pyWhat you get
example.jsonl— one JSON object per page.- A formatted summary printed to stdout (depth distribution, top domains, content quality).
Variations
Save HTML too
config = CrawlConfig(
max_depth=2,
max_pages=50,
save_html=True, # raw HTML on each Page record
)Multiple output formats
Writer.write_jsonl(pages, Path("data.jsonl"))
Writer.write_parquet(pages, Path("data.parquet"))
Writer.write_text(pages, Path("data.txt"))Filter file types
from yoink.filters import CombinedFilter
url_filter = CombinedFilter.from_config(
skip_extensions=["pdf", "zip", "exe", "jpg", "png"],
)
crawler = Crawler(config=config, url_filter=url_filter)Same thing on the CLI
yoink crawl https://example.com -d 2 -n 50 -o example.jsonl
yoink stats example.jsonlSee also
- Quickstart — the same idea, even shorter.
CrawlerandCrawlConfigfor the full API.