Python API

Crawler

The main async web crawler — wires together the fetcher, parser, scheduler, and rate limiter.

docs/api/crawler.mdx·edit on github ↗·

yoink.Crawler is the entry point for programmatic use. It owns the worker pool and orchestrates a crawl from a start URL.

Import

from yoink import Crawler, CrawlConfig

Constructor

Crawler(
    config: CrawlConfig | None = None,
    url_filter: CombinedFilter | None = None,
    checkpoint_manager: CheckpointManager | None = None,
)
NameTypeDefaultDescription
configCrawlConfig | NoneCrawlConfig()Configuration for crawler behavior. See CrawlConfig reference.
url_filterCombinedFilter | NoneOptional filter for include/exclude patterns and domain matching.
checkpoint_managerCheckpointManager | NoneOptional checkpoint manager. If set, pages stream to disk/S3 as they're crawled.

Methods

crawl(start_url, resume=False)

Crawl a website starting from start_url.

async def crawl(
    self,
    start_url: str,
    resume: bool = False,
) -> list[Page]
NameTypeDefaultDescription
start_url*strThe starting URL. Must include scheme.
resumeboolFalseIf True and a checkpoint_manager is set, restore visited URLs and queue from checkpoint before crawling.

Returns: list[Page] — every page yoinked. Note that pages are also accumulated in crawler.pages, which you can read mid-crawl from another coroutine.

crawl_with_progress(start_url, resume=False)

Same as crawl() but renders a tqdm progress bar to stderr. Used by the CLI.

async def crawl_with_progress(
    self,
    start_url: str,
    resume: bool = False,
) -> list[Page]

Attributes

NameTypeDefaultDescription
configCrawlConfigThe active configuration.
pageslist[Page]Pages accumulated so far. Mutated during the crawl.
schedulerSchedulerThe URL queue. Holds visited and filtered sets.
rate_limiterRateLimiterPer-domain rate limiter.
robots_checkerRobotsChecker | NoneSet when respect_robots=True.
checkpoint_managerCheckpointManager | NoneThe checkpoint manager passed to the constructor.

Examples

Minimal crawl

import asyncio
from yoink import Crawler
 
async def main():
    crawler = Crawler()
    pages = await crawler.crawl("https://example.com")
    return pages
 
asyncio.run(main())

With config and filter

from yoink import Crawler, CrawlConfig
from yoink.filters import CombinedFilter
 
config = CrawlConfig(
    max_depth=3,
    max_pages=500,
    requests_per_second=5.0,
    render_js=True,
)
 
url_filter = CombinedFilter.from_config(
    include_patterns=["*/api/*"],
    skip_extensions=["pdf", "zip"],
)
 
crawler = Crawler(config=config, url_filter=url_filter)
pages = await crawler.crawl("https://docs.example.com")

Mid-crawl progress (custom)

import asyncio
from yoink import Crawler, CrawlConfig
 
async def report(crawler: Crawler):
    while True:
        await asyncio.sleep(2)
        print(f"...crawled {len(crawler.pages)} pages")
 
async def main():
    crawler = Crawler(CrawlConfig(max_pages=1000))
    reporter = asyncio.create_task(report(crawler))
    try:
        return await crawler.crawl("https://example.com")
    finally:
        reporter.cancel()
 
asyncio.run(main())

With checkpointing

See Checkpointing and CheckpointManager for full coverage.

from yoink import Crawler, CrawlConfig, CheckpointManager
 
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
crawler = Crawler(config=CrawlConfig(), checkpoint_manager=checkpoint)
 
pages = await crawler.crawl("https://example.com", resume=True)

See also