Concepts

Rate limiting

Per-domain token bucket rate limiting with burst support, minimum delays, and Crawl-delay honoring.

docs/concepts/rate-limiting.mdx·edit on github ↗·

yoink rate-limits at the fetcher gate — every outbound request has to acquire a token before going out. This protects target servers, keeps you on the right side of robots.txt Crawl-delay, and avoids tripping basic anti-abuse heuristics.

The mechanism: token bucket

A token bucket fills at a constant rate (your requests_per_second) and holds a fixed maximum number of tokens (its capacity). Each request consumes one token. If the bucket is empty, the request waits until a token regenerates.

rate-limiter / token-bucketrate_limiter.py
refill · 2/s
1 / 1 tokens
consume · fetcher.fetch(url)
if bucket empty → await (1 - tokens) / rate seconds
refill rate: 2/s · capacity: 1 · each fetch consumes one token

This gives you smooth traffic shaping at sustained requests_per_second. The default burst_size=1 means there's no extra burst headroom — the very first request consumes the only token, and subsequent requests pace themselves at exactly your configured rate.

burst_size is a knob on the RateLimiter class but is not currently surfaced on CrawlConfig or the CLI. If you need bursts (e.g., 10 RPS sustained but happy to fire 5 in a row when idle), construct the limiter directly:

from yoink.rate_limiter import RateLimiter
 
limiter = RateLimiter(requests_per_second=10.0, burst_size=5)
# then pass to your fetcher manually if subclassing

For most workloads, the requests_per_second=2.0, burst_size=1 defaults are exactly what you want — polite, predictable, no surprises.

Per-domain isolation

Rate limits are scoped to each domain you crawl. If --follow-external is enabled and your crawl visits both docs.python.org and python.org, they each get an independent bucket. Misbehaving on one domain can't slow another.

config = CrawlConfig(
    requests_per_second=5.0,   # 5 RPS per domain
    max_concurrency=20,        # but only 20 concurrent overall
)

request_delay — a wait-time floor

request_delay is a hard floor on the wait time computed by acquire() for each request to a given domain. With request_delay=0.5, every request to that domain (including the first) sleeps at least 500ms before being released, even if the token bucket has tokens available.

yoink crawl https://example.com --rate-limit 5.0 --request-delay 0.5
# Up to 5 RPS by token bucket, but every release sleeps ≥ 500ms

In Python:

config = CrawlConfig(
    requests_per_second=5.0,
    request_delay=0.5,  # seconds; per-acquire floor
)

robots.txt Crawl-delay

When respect_robots=True (the default), yoink reads each domain's robots.txt and applies its Crawl-delay directive by reducing the bucket's refill rate to 1 / crawl_delay requests per second — but only if that's stricter than your configured rate. The stricter limit always wins.

Picking sane defaults

A non-exhaustive heuristic:

TargetSuggested requests_per_second
Personal blog1.0
Documentation site2.0 – 5.0
Public API / large news site5.0 – 10.0
Your own staging serverWhatever you want

If the site you're crawling publishes a Crawl-delay, honor it — yoink does this for you, but you can also set request_delay explicitly to make the constraint visible at the call site.

Disabling rate limiting

You can't turn it fully off, but you can effectively disable it for testing:

config = CrawlConfig(
    requests_per_second=1000,  # absurdly high
    request_delay=0.0,
)

For real workloads: don't.

See also