Python API

Crawler

The main async web crawler — wires together the fetcher, parser, scheduler, and rate limiter.

yoink.Crawler is the entry point for programmatic use. It owns the worker pool and orchestrates a crawl from a start URL.

Import

from yoink import Crawler, CrawlConfig

Constructor

Crawler(
    config: CrawlConfig | None = None,
    url_filter: CombinedFilter | None = None,
    checkpoint_manager: CheckpointManager | None = None,
)

Name	Type	Default	Description
config	CrawlConfig \| None	CrawlConfig()	Configuration for crawler behavior. See CrawlConfig reference.
url_filter	CombinedFilter \| None	—	Optional filter for include/exclude patterns and domain matching.
checkpoint_manager	CheckpointManager \| None	—	Optional checkpoint manager. If set, pages stream to disk/S3 as they're crawled.

Methods

`crawl(start_url, resume=False)`

Crawl a website starting from start_url.

async def crawl(
    self,
    start_url: str,
    resume: bool = False,
) -> list[Page]

Name	Type	Default	Description
start_url*	str	—	The starting URL. Must include scheme.
resume	bool	False	If True and a checkpoint_manager is set, restore visited URLs and queue from checkpoint before crawling.

Returns: list[Page] — every page yoinked. Note that pages are also accumulated in crawler.pages, which you can read mid-crawl from another coroutine.

`crawl_with_progress(start_url, resume=False)`

Same as crawl() but renders a tqdm progress bar to stderr. Used by the CLI.

async def crawl_with_progress(
    self,
    start_url: str,
    resume: bool = False,
) -> list[Page]

Attributes

Name	Type	Default	Description
config	CrawlConfig	—	The active configuration.
pages	list[Page]	—	Pages accumulated so far. Mutated during the crawl.
scheduler	Scheduler	—	The URL queue. Holds visited and filtered sets.
rate_limiter	RateLimiter	—	Per-domain rate limiter.
robots_checker	RobotsChecker \| None	—	Set when respect_robots=True.
checkpoint_manager	CheckpointManager \| None	—	The checkpoint manager passed to the constructor.

Examples

Minimal crawl

import asyncio
from yoink import Crawler
 
async def main():
    crawler = Crawler()
    pages = await crawler.crawl("https://example.com")
    return pages
 
asyncio.run(main())

With config and filter

from yoink import Crawler, CrawlConfig
from yoink.filters import CombinedFilter
 
config = CrawlConfig(
    max_depth=3,
    max_pages=500,
    requests_per_second=5.0,
    render_js=True,
)
 
url_filter = CombinedFilter.from_config(
    include_patterns=["*/api/*"],
    skip_extensions=["pdf", "zip"],
)
 
crawler = Crawler(config=config, url_filter=url_filter)
pages = await crawler.crawl("https://docs.example.com")

Mid-crawl progress (custom)

import asyncio
from yoink import Crawler, CrawlConfig
 
async def report(crawler: Crawler):
    while True:
        await asyncio.sleep(2)
        print(f"...crawled {len(crawler.pages)} pages")
 
async def main():
    crawler = Crawler(CrawlConfig(max_pages=1000))
    reporter = asyncio.create_task(report(crawler))
    try:
        return await crawler.crawl("https://example.com")
    finally:
        reporter.cancel()
 
asyncio.run(main())

With checkpointing

See Checkpointing and CheckpointManager for full coverage.

from yoink import Crawler, CrawlConfig, CheckpointManager
 
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
crawler = Crawler(config=CrawlConfig(), checkpoint_manager=checkpoint)
 
pages = await crawler.crawl("https://example.com", resume=True)