Crawler
The main async web crawler — wires together the fetcher, parser, scheduler, and rate limiter.
yoink.Crawler is the entry point for programmatic use. It owns the worker pool and orchestrates a crawl from a start URL.
Import
from yoink import Crawler, CrawlConfigConstructor
Crawler(
config: CrawlConfig | None = None,
url_filter: CombinedFilter | None = None,
checkpoint_manager: CheckpointManager | None = None,
)| Name | Type | Default | Description |
|---|---|---|---|
| config | CrawlConfig | None | CrawlConfig() | Configuration for crawler behavior. See CrawlConfig reference. |
| url_filter | CombinedFilter | None | — | Optional filter for include/exclude patterns and domain matching. |
| checkpoint_manager | CheckpointManager | None | — | Optional checkpoint manager. If set, pages stream to disk/S3 as they're crawled. |
Methods
crawl(start_url, resume=False)
Crawl a website starting from start_url.
async def crawl(
self,
start_url: str,
resume: bool = False,
) -> list[Page]| Name | Type | Default | Description |
|---|---|---|---|
| start_url* | str | — | The starting URL. Must include scheme. |
| resume | bool | False | If True and a checkpoint_manager is set, restore visited URLs and queue from checkpoint before crawling. |
Returns: list[Page] — every page yoinked. Note that pages are also accumulated in crawler.pages, which you can read mid-crawl from another coroutine.
crawl_with_progress(start_url, resume=False)
Same as crawl() but renders a tqdm progress bar to stderr. Used by the CLI.
async def crawl_with_progress(
self,
start_url: str,
resume: bool = False,
) -> list[Page]Attributes
| Name | Type | Default | Description |
|---|---|---|---|
| config | CrawlConfig | — | The active configuration. |
| pages | list[Page] | — | Pages accumulated so far. Mutated during the crawl. |
| scheduler | Scheduler | — | The URL queue. Holds visited and filtered sets. |
| rate_limiter | RateLimiter | — | Per-domain rate limiter. |
| robots_checker | RobotsChecker | None | — | Set when respect_robots=True. |
| checkpoint_manager | CheckpointManager | None | — | The checkpoint manager passed to the constructor. |
Examples
Minimal crawl
import asyncio
from yoink import Crawler
async def main():
crawler = Crawler()
pages = await crawler.crawl("https://example.com")
return pages
asyncio.run(main())With config and filter
from yoink import Crawler, CrawlConfig
from yoink.filters import CombinedFilter
config = CrawlConfig(
max_depth=3,
max_pages=500,
requests_per_second=5.0,
render_js=True,
)
url_filter = CombinedFilter.from_config(
include_patterns=["*/api/*"],
skip_extensions=["pdf", "zip"],
)
crawler = Crawler(config=config, url_filter=url_filter)
pages = await crawler.crawl("https://docs.example.com")Mid-crawl progress (custom)
import asyncio
from yoink import Crawler, CrawlConfig
async def report(crawler: Crawler):
while True:
await asyncio.sleep(2)
print(f"...crawled {len(crawler.pages)} pages")
async def main():
crawler = Crawler(CrawlConfig(max_pages=1000))
reporter = asyncio.create_task(report(crawler))
try:
return await crawler.crawl("https://example.com")
finally:
reporter.cancel()
asyncio.run(main())With checkpointing
See Checkpointing and CheckpointManager for full coverage.
from yoink import Crawler, CrawlConfig, CheckpointManager
checkpoint = CheckpointManager.from_uri("./crawl.jsonl")
crawler = Crawler(config=CrawlConfig(), checkpoint_manager=checkpoint)
pages = await crawler.crawl("https://example.com", resume=True)See also
CrawlConfig— every knob.Page— the per-page output type.- Architecture — how the components fit.