Python API

CrawlStats

Compute, format, and export statistics from a crawl — depth distribution, top domains, content quality.

docs/api/stats.mdx·edit on github ↗·

CrawlStats analyzes a list of Page objects (or loads them from a file) and produces summary metrics. It powers the yoink stats CLI but is also fine to use programmatically.

Import

from yoink.stats import CrawlStats

Constructing

# From a list of Page objects (e.g., right after a crawl)
stats = CrawlStats(pages)
 
# From a saved file (.json or .jsonl)
stats = CrawlStats.from_file(Path("crawl_output.jsonl"))

Methods

NameTypeDefaultDescription
compute()→ dict[str, Any]Compute and return all metrics as a dict. Cached after first call.
format_summary()→ strReturn a multi-line human-readable summary.
export_csv(output_path)→ NoneWrite summary stats and top-domain breakdown to a CSV file.
from_file(path)classmethod → CrawlStatsLoad pages from .json or .jsonl and return a CrawlStats.

What compute() returns

{
    "total_pages": 87,
    "total_links": 1243,
    "total_text_size": 422291,         # bytes
    "total_html_size": 0,              # 0 if save_html=False
    "avg_links_per_page": 14.29,
    "avg_text_size": 4853.92,
    "avg_html_size": 0,
    "max_depth": 2,
    "pages_by_depth": { 0: 1, 1: 24, 2: 62 },
    "unique_domains": 1,
    "top_domains": [{ "domain": "docs.example.com", "count": 87 }],
    "status_codes": { 200: 87 },
    "pages_with_text": 85,
    "pages_with_title": 87,
    "pages_with_metadata": 73,
    "text_length_min": 142,
    "text_length_median": 3891,
    "text_length_max": 28442,
}

Examples

After a crawl

from yoink import Crawler, CrawlConfig
from yoink.stats import CrawlStats
 
async def main():
    crawler = Crawler(CrawlConfig())
    pages = await crawler.crawl("https://example.com")
 
    stats = CrawlStats(pages)
    print(stats.format_summary())

From a saved file

from pathlib import Path
from yoink.stats import CrawlStats
 
stats = CrawlStats.from_file(Path("crawl_output.jsonl"))
data = stats.compute()
 
print(f"Got {data['total_pages']} pages across {data['unique_domains']} domains")
print(f"Median page text: {data['text_length_median']} chars")

Filtering by content quality

data = stats.compute()
text_share = data["pages_with_text"] / data["total_pages"]
 
if text_share < 0.5:
    print("⚠ Less than half the pages had extractable text — site may be JS-heavy")

Export

stats.export_csv(Path("crawl_stats.csv"))

The CSV has two sections:

Metric,Value
Total Pages,87
Total Links,1243
...

Top Domains,Count
docs.example.com,87

See also