Python API

Page

The per-page output type — URL, title, extracted text, links, metadata, status code, depth.

Page is the Pydantic model representing one crawled URL.

Import

from yoink import Page
# or
from yoink.models import Page

Fields

Name	Type	Default	Description
url*	str	—	The URL that was queued and fetched. yoink does not currently rewrite this to the post-redirect URL — if you need the final URL, inspect the response headers via a custom Fetcher subclass.
title	str \| None	—	The <title> tag content, if present.
text	str \| None	—	Clean extracted text from trafilatura. None if extract_text=False or extraction failed.
html	str \| None	—	Raw HTML. Only populated when save_html=True.
links	list[str]	[]	Outbound links discovered on the page (absolute URLs).
metadata	dict[str, str]	{}	OpenGraph / Twitter / standard meta tags.
crawled_at	datetime	—	UTC timestamp when the page was fetched.
status_code	int	200	HTTP response status code.
depth	int	0	Link-hop distance from the start URL.

Methods

Page inherits all standard Pydantic v2 methods:

page.model_dump()           # → dict
page.model_dump(mode="json") # → JSON-safe dict (datetimes as strings)
page.model_dump_json()       # → str
Page.model_validate(data)    # construct from dict
Page.model_validate_json(s)  # construct from JSON string

Examples

Inspecting after a crawl

pages = await crawler.crawl("https://example.com")
 
for page in pages:
    print(f"[{page.status_code}] depth={page.depth} {page.url}")
    print(f"  title: {page.title or '(none)'}")
    print(f"  text:  {len(page.text or '')} chars, {len(page.links)} links")
    if "og:image" in page.metadata:
        print(f"  image: {page.metadata['og:image']}")

Reading pages back from JSONL

import json
from yoink import Page
 
pages: list[Page] = []
with open("crawl_output.jsonl") as f:
    for line in f:
        pages.append(Page.model_validate_json(line))
 
print(f"Loaded {len(pages)} pages")

Filtering for content quality

# Keep only pages with at least 500 chars of clean text
substantial = [p for p in pages if p.text and len(p.text) >= 500]
 
# Group by depth
from collections import defaultdict
by_depth = defaultdict(list)
for p in pages:
    by_depth[p.depth].append(p)

JSON shape

When serialized:

{
  "url": "https://example.com/about",
  "title": "About Example",
  "text": "Example is a domain established for...",
  "html": null,
  "links": ["https://example.com/", "https://example.com/contact"],
  "metadata": {
    "description": "About page",
    "og:title": "About Example",
    "og:type": "website"
  },
  "crawled_at": "2026-05-03T12:34:56.789012",
  "status_code": 200,
  "depth": 1
}