Python API

Writers

Serialize a list of Page objects to JSON, JSONL, Parquet, or plain text.

docs/api/writers.mdx·edit on github ↗·

Writer is a static helper class with one method per output format. Used internally by the CLI; available directly for programmatic use.

Import

from yoink.writers import Writer

Methods

NameTypeDefaultDescription
write_json(pages, path)staticmethodSingle JSON array. Good for small datasets.
write_jsonl(pages, path)staticmethodNewline-delimited JSON. Recommended for AI pipelines and large datasets.
write_parquet(pages, path)staticmethodColumnar Parquet. Requires the [parquet] extra (pyarrow). Snappy-compressed.
write_text(pages, path)staticmethodPlain text dump — URL, title, separator, extracted text.

Examples

After a crawl

from pathlib import Path
from yoink import Crawler
from yoink.writers import Writer
 
async def main():
    crawler = Crawler()
    pages = await crawler.crawl("https://example.com")
 
    Writer.write_jsonl(pages, Path("data.jsonl"))
    Writer.write_parquet(pages, Path("data.parquet"))

Parquet for analytics

import pandas as pd
 
Writer.write_parquet(pages, Path("data.parquet"))
 
df = pd.read_parquet("data.parquet")
print(df["depth"].value_counts())
print(df.groupby("depth")["num_links"].mean())

The Parquet schema is flattened for analytical queries:

parquet / flattened schemawriters.py · Writer.write_parquet
columntypenote
  • 01urlstring
  • 02titlestring?nullable
  • 03textstring?nullable
  • 04crawled_atstringISO 8601 timestamp
  • 05status_codeint64
  • 06depthint64
  • 07num_linksint64links list → count only
  • 08metadatastringJSON-encoded dict
!

links and html are dropped from Parquet output. The link count is preserved as num_links; raw HTML is never written even when save_html=True. Use JSONL if you need either.

snappy-compressed columnar format. portable across pandas / pyarrow / DuckDB / Athena.

Text dump

Writer.write_text(pages, Path("dump.txt"))

Output:

URL: https://example.com
Title: Example Domain
--------------------------------------------------------------------------------
This domain is for use in illustrative examples in documents...

================================================================================

URL: https://example.com/about
Title: About
...

When a page has no title, the field is rendered as Title: N/A.

Output format choice

FormatBest forStreaming?Compressed?
jsonlAI / ML pipelines, large datasetsyesno
jsonSmall datasets, debuggingnono
parquetAnalytics, pandas, columnar storageyes (rows)yes (snappy)
textQuick eyeballing, archivalnono

For most use cases, JSONL is the default answer.

See also