CrawlStats
Compute, format, and export statistics from a crawl — depth distribution, top domains, content quality.
CrawlStats analyzes a list of Page objects (or loads them from a file) and produces summary metrics. It powers the yoink stats CLI but is also fine to use programmatically.
Import
from yoink.stats import CrawlStatsConstructing
# From a list of Page objects (e.g., right after a crawl)
stats = CrawlStats(pages)
# From a saved file (.json or .jsonl)
stats = CrawlStats.from_file(Path("crawl_output.jsonl"))Methods
| Name | Type | Default | Description |
|---|---|---|---|
| compute() | → dict[str, Any] | — | Compute and return all metrics as a dict. Cached after first call. |
| format_summary() | → str | — | Return a multi-line human-readable summary. |
| export_csv(output_path) | → None | — | Write summary stats and top-domain breakdown to a CSV file. |
| from_file(path) | classmethod → CrawlStats | — | Load pages from .json or .jsonl and return a CrawlStats. |
What compute() returns
{
"total_pages": 87,
"total_links": 1243,
"total_text_size": 422291, # bytes
"total_html_size": 0, # 0 if save_html=False
"avg_links_per_page": 14.29,
"avg_text_size": 4853.92,
"avg_html_size": 0,
"max_depth": 2,
"pages_by_depth": { 0: 1, 1: 24, 2: 62 },
"unique_domains": 1,
"top_domains": [{ "domain": "docs.example.com", "count": 87 }],
"status_codes": { 200: 87 },
"pages_with_text": 85,
"pages_with_title": 87,
"pages_with_metadata": 73,
"text_length_min": 142,
"text_length_median": 3891,
"text_length_max": 28442,
}Examples
After a crawl
from yoink import Crawler, CrawlConfig
from yoink.stats import CrawlStats
async def main():
crawler = Crawler(CrawlConfig())
pages = await crawler.crawl("https://example.com")
stats = CrawlStats(pages)
print(stats.format_summary())From a saved file
from pathlib import Path
from yoink.stats import CrawlStats
stats = CrawlStats.from_file(Path("crawl_output.jsonl"))
data = stats.compute()
print(f"Got {data['total_pages']} pages across {data['unique_domains']} domains")
print(f"Median page text: {data['text_length_median']} chars")Filtering by content quality
data = stats.compute()
text_share = data["pages_with_text"] / data["total_pages"]
if text_share < 0.5:
print("⚠ Less than half the pages had extractable text — site may be JS-heavy")Export
stats.export_csv(Path("crawl_stats.csv"))The CSV has two sections:
Metric,Value
Total Pages,87
Total Links,1243
...
Top Domains,Count
docs.example.com,87
See also
- The CLI version:
yoink stats.