Writers
Serialize a list of Page objects to JSON, JSONL, Parquet, or plain text.
Writer is a static helper class with one method per output format. Used internally by the CLI; available directly for programmatic use.
Import
from yoink.writers import WriterMethods
| Name | Type | Default | Description |
|---|---|---|---|
| write_json(pages, path) | staticmethod | — | Single JSON array. Good for small datasets. |
| write_jsonl(pages, path) | staticmethod | — | Newline-delimited JSON. Recommended for AI pipelines and large datasets. |
| write_parquet(pages, path) | staticmethod | — | Columnar Parquet. Requires the [parquet] extra (pyarrow). Snappy-compressed. |
| write_text(pages, path) | staticmethod | — | Plain text dump — URL, title, separator, extracted text. |
Examples
After a crawl
from pathlib import Path
from yoink import Crawler
from yoink.writers import Writer
async def main():
crawler = Crawler()
pages = await crawler.crawl("https://example.com")
Writer.write_jsonl(pages, Path("data.jsonl"))
Writer.write_parquet(pages, Path("data.parquet"))Parquet for analytics
import pandas as pd
Writer.write_parquet(pages, Path("data.parquet"))
df = pd.read_parquet("data.parquet")
print(df["depth"].value_counts())
print(df.groupby("depth")["num_links"].mean())The Parquet schema is flattened for analytical queries:
columntypenote
- 01urlstring—
- 02titlestring?nullable
- 03textstring?nullable
- 04crawled_atstringISO 8601 timestamp
- 05status_codeint64—
- 06depthint64—
- 07num_linksint64links list → count only
- 08metadatastringJSON-encoded dict
!
links and html are dropped from Parquet output. The link count is preserved as num_links; raw HTML is never written even when save_html=True. Use JSONL if you need either.
Text dump
Writer.write_text(pages, Path("dump.txt"))Output:
URL: https://example.com
Title: Example Domain
--------------------------------------------------------------------------------
This domain is for use in illustrative examples in documents...
================================================================================
URL: https://example.com/about
Title: About
...
When a page has no title, the field is rendered as Title: N/A.
Output format choice
| Format | Best for | Streaming? | Compressed? |
|---|---|---|---|
jsonl | AI / ML pipelines, large datasets | yes | no |
json | Small datasets, debugging | no | no |
parquet | Analytics, pandas, columnar storage | yes (rows) | yes (snappy) |
text | Quick eyeballing, archival | no | no |
For most use cases, JSONL is the default answer.
See also
- Output formats reference.
Page— what gets serialized.