robots.txt compliance
How yoink parses, caches, and applies robots.txt directives — Allow, Disallow, Crawl-delay, Sitemap.
yoink respects robots.txt by default. The RobotsChecker is consulted before every fetch, and disallowed URLs are filtered out before they ever hit the queue.
What's supported
- ✅
User-agentmatching — exact, partial substring, and*wildcard fallback. - ✅
Disallowrules with wildcard (*) and end-anchor ($) patterns. - ✅
Allowrules (longer/more-specific paths win). - ✅
Crawl-delay— narrows the rate limiter for that domain. - ✅
Sitemapdirectives — parsed and stored on each domain'sRobotsDirectives.sitemapslist. - ✅ Per-domain caching with a 1-hour default TTL.
How it fits in
url→- 01cached for this domain?
- 02if not, fetch /robots.txt
- 03match user_agent block
- 04apply Allow / Disallow rules
fetcher.fetch(url)continues to rate-limit gate
skip + log + returnnever enters the queue
Pattern matching
yoink approximates RFC 9309:
*matches any sequence of characters (greedy).$at the end of a pattern anchors the match to the end of the URL path.- Rules are sorted by path length (longest first), and the first match wins. Tie-breaks between equal-length
AllowandDisallowrules go to whichever appears first in the file (Python's stable sort), not strictly toAllowas the RFC prefers. Author yourAllow/Disallowrules with that in mind, or rely on the longer/more-specific path winning.
Examples:
User-agent: *
Disallow: /private/
Disallow: /*.pdf$
Allow: /private/public-page.html
Crawl-delay: 2| URL | Result | Why |
|---|---|---|
/about | allowed | No matching rule |
/private/secrets | blocked | Disallow: /private/ |
/private/public-page.html | allowed | Allow is more specific than Disallow |
/docs/manual.pdf | blocked | Disallow: /*.pdf$ |
/docs/manual.pdf?download=1 | allowed | The $ anchor; query strings break the match |
User-agent matching
yoink matches your configured user_agent against the robots.txt User-agent blocks in this order:
- Exact match (case-insensitive).
- Partial match — bidirectional substring (
a in b or b in a). For example,User-agent: yoinkmatches the default UAyoink/0.3.0 (+...)because"yoink"is a substring of the UA. - Wildcard fallback (
User-agent: *).
Caching
robots.txt is fetched once per domain and cached for 1 hour. This keeps yoink polite for long crawls without re-fetching robots.txt for every URL.
The cache is in-memory and per-Crawler instance — a fresh process or a new Crawler() will re-fetch.
Disabling robots.txt checks
# CLI
yoink crawl https://example.com --no-robots# Python
config = CrawlConfig(respect_robots=False)When disabled, yoink doesn't fetch robots.txt at all and crawls freely subject only to your other config.
Inspecting the rules
The cleanest way to inspect what RobotsChecker saw is to share the Crawler's instance — it already has the Fetcher wired up. Here's a one-shot script that prints what it learned about each domain it visited:
import asyncio
from yoink import Crawler, CrawlConfig
async def main():
crawler = Crawler(CrawlConfig(max_pages=20))
await crawler.crawl("https://example.com")
rc = crawler.robots_checker
if rc is None:
return # respect_robots was disabled
for domain, cached in rc._cache.items():
for ua, directives in cached.directives.items():
print(f"[{domain}] User-agent: {ua}")
print(f" rules: {len(directives.rules)}")
print(f" crawl_delay: {directives.crawl_delay}")
print(f" sitemaps: {directives.sitemaps}")
asyncio.run(main())For ad-hoc is_allowed() checks, use the public method (it's async):
allowed = await crawler.robots_checker.is_allowed("https://example.com/private/")See also
- Rate limiting — how
Crawl-delayinteracts with yourrequests_per_second. - The
RobotsCheckersource: src/yoink/robots.py.