Concepts

robots.txt compliance

How yoink parses, caches, and applies robots.txt directives — Allow, Disallow, Crawl-delay, Sitemap.

docs/concepts/robots-txt.mdx·edit on github ↗·

yoink respects robots.txt by default. The RobotsChecker is consulted before every fetch, and disallowed URLs are filtered out before they ever hit the queue.

What's supported

✅ User-agent matching — exact, partial substring, and * wildcard fallback.
✅ Disallow rules with wildcard (*) and end-anchor ($) patterns.
✅ Allow rules (longer/more-specific paths win).
✅ Crawl-delay — narrows the rate limiter for that domain.
✅ Sitemap directives — parsed and stored on each domain's RobotsDirectives.sitemaps list.
✅ Per-domain caching with a 1-hour default TTL.

How it fits in

robots / decision flowrobots.py

inputurl→

RobotsCheckerasync

01cached for this domain?TTL: 1h
02if not, fetch /robots.txtuses Crawler's Fetcher
03match user_agent blockexact → partial → *
04apply Allow / Disallow ruleslongest path wins

▼ outcome

allowed

fetcher.fetch(url)

continues to rate-limit gate

blocked

skip + log + return

never enters the queue

evaluated before every URL is added to the queue. 1-hour cache per domain.

Pattern matching

yoink approximates RFC 9309:

* matches any sequence of characters (greedy).
$ at the end of a pattern anchors the match to the end of the URL path.
Rules are sorted by path length (longest first), and the first match wins. Tie-breaks between equal-length Allow and Disallow rules go to whichever appears first in the file (Python's stable sort), not strictly to Allow as the RFC prefers. Author your Allow/Disallow rules with that in mind, or rely on the longer/more-specific path winning.

Examples:

User-agent: *
Disallow: /private/
Disallow: /*.pdf$
Allow: /private/public-page.html
Crawl-delay: 2

URL	Result	Why
`/about`	allowed	No matching rule
`/private/secrets`	blocked	`Disallow: /private/`
`/private/public-page.html`	allowed	`Allow` is more specific than `Disallow`
`/docs/manual.pdf`	blocked	`Disallow: /*.pdf$`
`/docs/manual.pdf?download=1`	allowed	The `$` anchor; query strings break the match

User-agent matching

yoink matches your configured user_agent against the robots.txt User-agent blocks in this order:

Exact match (case-insensitive).
Partial match — bidirectional substring (a in b or b in a). For example, User-agent: yoink matches the default UA yoink/0.3.0 (+...) because "yoink" is a substring of the UA.
Wildcard fallback (User-agent: *).

Caching

robots.txt is fetched once per domain and cached for 1 hour. This keeps yoink polite for long crawls without re-fetching robots.txt for every URL.

The cache is in-memory and per-Crawler instance — a fresh process or a new Crawler() will re-fetch.

Disabling robots.txt checks

# CLI
yoink crawl https://example.com --no-robots

# Python
config = CrawlConfig(respect_robots=False)

When disabled, yoink doesn't fetch robots.txt at all and crawls freely subject only to your other config.

Inspecting the rules

The cleanest way to inspect what RobotsChecker saw is to share the Crawler's instance — it already has the Fetcher wired up. Here's a one-shot script that prints what it learned about each domain it visited:

import asyncio
from yoink import Crawler, CrawlConfig
 
async def main():
    crawler = Crawler(CrawlConfig(max_pages=20))
    await crawler.crawl("https://example.com")
 
    rc = crawler.robots_checker
    if rc is None:
        return  # respect_robots was disabled
 
    for domain, cached in rc._cache.items():
        for ua, directives in cached.directives.items():
            print(f"[{domain}] User-agent: {ua}")
            print(f"  rules: {len(directives.rules)}")
            print(f"  crawl_delay: {directives.crawl_delay}")
            print(f"  sitemaps: {directives.sitemaps}")
 
asyncio.run(main())

For ad-hoc is_allowed() checks, use the public method (it's async):

allowed = await crawler.robots_checker.is_allowed("https://example.com/private/")