Reference

Output formats

JSON, JSONL, Parquet, and plain text — exact shapes, when to use each.

docs/reference/output-formats.mdx·edit on github ↗·

yoink writes crawl results in four formats. They all carry the same Page data, but differ in shape, streamability, and compression.

Format comparison

FormatBest forStreamableCompressedExtras needed
jsonlAI/ML, large datasetsyes (rows)no
jsonSmall datasets, debuggingnono
parquetAnalytics, pandasyes (rows)snappy[parquet]
textEyeballingnono

JSON

A single JSON array. Easy to read, easy to break: large arrays must be loaded entirely into memory.

yoink crawl https://example.com -f json -o data.json
[
  {
    "url": "https://example.com",
    "title": "Example Domain",
    "text": "...",
    "html": null,
    "links": ["https://example.com/about"],
    "metadata": {},
    "crawled_at": "2026-05-03T12:00:00",
    "status_code": 200,
    "depth": 0
  },
  { "url": "https://example.com/about", "...": "..." }
]

JSONL (recommended)

Newline-delimited JSON. One Page per line. Streamable and grep-friendly.

yoink crawl https://example.com -f jsonl -o data.jsonl
{"url": "https://example.com", "title": "Example Domain", ...}
{"url": "https://example.com/about", "title": "About", ...}

Reading with the standard library:

import json
from yoink import Page
 
with open("data.jsonl") as f:
    pages = [Page.model_validate_json(line) for line in f]

Streaming (don't load it all):

def iter_pages(path):
    with open(path) as f:
        for line in f:
            yield Page.model_validate_json(line)
 
for page in iter_pages("data.jsonl"):
    process(page)

Parquet

Columnar storage. Smaller files, faster analytical queries. Requires pip install "yoink[parquet]".

yoink crawl https://example.com -f parquet -o data.parquet

The schema is flattenedlinks becomes num_links, metadata becomes a JSON-encoded string. This is intentional: it keeps the file portable and analytical queries fast.

parquet / flattened schemawriters.py · Writer.write_parquet
columntypenote
  • 01urlstring
  • 02titlestring?nullable
  • 03textstring?nullable
  • 04crawled_atstringISO 8601 timestamp
  • 05status_codeint64
  • 06depthint64
  • 07num_linksint64links list → count only
  • 08metadatastringJSON-encoded dict
snappy-compressed columnar format. portable across pandas / pyarrow / DuckDB / Athena.

Compression is snappy for fast read/write. Read with pandas / pyarrow / DuckDB:

import pandas as pd
df = pd.read_parquet("data.parquet")
 
# Or DuckDB for SQL
import duckdb
duckdb.sql("SELECT depth, count(*) FROM 'data.parquet' GROUP BY depth").show()

Text

Plain text dump. Good for archival and quick visual inspection.

yoink crawl https://example.com -f text -o data.txt

Format:

URL: https://example.com
Title: Example Domain
--------------------------------------------------------------------------------
This domain is for use in illustrative examples in documents.

================================================================================

URL: https://example.com/about
Title: About
--------------------------------------------------------------------------------
...

This format is one-way — you can't reliably load it back into Page objects. Use JSONL for round-tripping.

Choosing

  • AI training / RAG indexing? JSONL.
  • Pandas / DuckDB / Athena? Parquet.
  • Throwaway one-shot? JSON.
  • Quick read? Text.

See also

  • Writer — programmatic output.
  • Page — the underlying data shape.