Rate limiting
Per-domain token bucket rate limiting with burst support, minimum delays, and Crawl-delay honoring.
yoink rate-limits at the fetcher gate — every outbound request has to acquire a token before going out. This protects target servers, keeps you on the right side of robots.txt Crawl-delay, and avoids tripping basic anti-abuse heuristics.
The mechanism: token bucket
A token bucket fills at a constant rate (your requests_per_second) and holds a fixed maximum number of tokens (its capacity). Each request consumes one token. If the bucket is empty, the request waits until a token regenerates.
This gives you smooth traffic shaping at sustained requests_per_second. The default burst_size=1 means there's no extra burst headroom — the very first request consumes the only token, and subsequent requests pace themselves at exactly your configured rate.
burst_size is a knob on the RateLimiter class but is not currently surfaced on CrawlConfig or the CLI. If you need bursts (e.g., 10 RPS sustained but happy to fire 5 in a row when idle), construct the limiter directly:
from yoink.rate_limiter import RateLimiter
limiter = RateLimiter(requests_per_second=10.0, burst_size=5)
# then pass to your fetcher manually if subclassingFor most workloads, the requests_per_second=2.0, burst_size=1 defaults are exactly what you want — polite, predictable, no surprises.
Per-domain isolation
Rate limits are scoped to each domain you crawl. If --follow-external is enabled and your crawl visits both docs.python.org and python.org, they each get an independent bucket. Misbehaving on one domain can't slow another.
config = CrawlConfig(
requests_per_second=5.0, # 5 RPS per domain
max_concurrency=20, # but only 20 concurrent overall
)request_delay — a wait-time floor
request_delay is a hard floor on the wait time computed by acquire() for each request to a given domain. With request_delay=0.5, every request to that domain (including the first) sleeps at least 500ms before being released, even if the token bucket has tokens available.
yoink crawl https://example.com --rate-limit 5.0 --request-delay 0.5
# Up to 5 RPS by token bucket, but every release sleeps ≥ 500msIn Python:
config = CrawlConfig(
requests_per_second=5.0,
request_delay=0.5, # seconds; per-acquire floor
)robots.txt Crawl-delay
When respect_robots=True (the default), yoink reads each domain's robots.txt and applies its Crawl-delay directive by reducing the bucket's refill rate to 1 / crawl_delay requests per second — but only if that's stricter than your configured rate. The stricter limit always wins.
Picking sane defaults
A non-exhaustive heuristic:
| Target | Suggested requests_per_second |
|---|---|
| Personal blog | 1.0 |
| Documentation site | 2.0 – 5.0 |
| Public API / large news site | 5.0 – 10.0 |
| Your own staging server | Whatever you want |
If the site you're crawling publishes a Crawl-delay, honor it — yoink does this for you, but you can also set request_delay explicitly to make the constraint visible at the call site.
Disabling rate limiting
You can't turn it fully off, but you can effectively disable it for testing:
config = CrawlConfig(
requests_per_second=1000, # absurdly high
request_delay=0.0,
)For real workloads: don't.
See also
CrawlConfig.requests_per_secondandrequest_delayreference.- robots.txt compliance — how
Crawl-delayis parsed and applied. - The
RateLimitermodule: src/yoink/rate_limiter.py.