URL filtering
Include patterns, exclude patterns, file-extension filters, and domain filters — combine for precise targeting.
Most crawls don't want every URL. URL filters tell yoink which pages to follow and which to skip before they hit the queue.
The filter pipeline
For each candidate URL, CombinedFilter checks filters in this order. The first one that says "no" wins:
urlallowed_domainshonors subdomains
skip_extensionsfast path-suffix
include_patternsmust match ≥ 1
exclude_patternsmust match 0
URL is dropped and filtered set is updated. It never reaches the queue, never gets fetched. The stage that rejected is logged at debug.
DomainFilter runs first because it's a fast hostname check; if you've explicitly allowlisted a domain set, everything else is irrelevant for URLs outside it. Inside URLFilter, the order is extension → include → exclude — the cheap path-suffix check before any pattern matching.
CLI usage
yoink crawl https://example.com \
--include "*/blog/*" \
--include "*/docs/*" \
--exclude "*/private/*" \
--skip-extensions pdf,zip,exe--includeand--excludeare repeatable.--skip-extensionsis comma-separated.
Python usage
from yoink import Crawler, CrawlConfig
from yoink.filters import CombinedFilter
url_filter = CombinedFilter.from_config(
include_patterns=["*/blog/*", "*/docs/*"],
exclude_patterns=["*/private/*"],
skip_extensions=["pdf", "zip", "exe"],
allowed_domains=["example.com", "blog.example.com"],
)
crawler = Crawler(config=CrawlConfig(), url_filter=url_filter)
pages = await crawler.crawl("https://example.com")Pattern syntax
yoink auto-detects the kind of pattern based on its shape:
| Pattern shape | Treated as | Example |
|---|---|---|
Contains * or ? | Glob | */blog/*, *.html |
Starts ^ / ends $ / has [ | Regex | ^https://example\.com/v\d+/.*$ |
| Anything else | Substring match | /api/ |
Glob examples
# All blog posts
"*/blog/*"
# Anything under /docs/, any depth
"*/docs/*"
# Specific URL with placeholder
"https://example.com/posts/?"Regex examples
# Versioned API URLs
r"^https://api\.example\.com/v\d+/.*$"
# Posts from 2024 or later
r"/posts/(202[4-9]|20[3-9]\d)/.*"Substring examples
"/api/" # any URL containing /api/
"draft" # any URL containing 'draft'Extension filtering
Inside URLFilter, skip_extensions is checked before include/exclude patterns because it's cheap. It matches the lowercased URL path:
skip_extensions=["pdf", "zip", "exe", "jpg", "png"]You don't need the leading dot — yoink strips it. pdf, .pdf, and PDF all work.
Domain filtering
By default, yoink stays on the start URL's domain. With --follow-external, it'll follow links anywhere. To allow specific external domains only:
from yoink.filters import DomainFilter, CombinedFilter
domain_filter = DomainFilter(allowed_domains=["example.com", "docs.example.com"])
url_filter = CombinedFilter(domain_filter=domain_filter)
crawler = Crawler(
config=CrawlConfig(follow_external=True),
url_filter=url_filter,
)Domain matching honors subdomains: allowed_domains=["example.com"] matches example.com, www.example.com, and blog.example.com — but not evil-example.com.
Combining filters
Use CombinedFilter.from_config(...) for the common case:
from yoink.filters import CombinedFilter
url_filter = CombinedFilter.from_config(
include_patterns=["*/api/*"],
exclude_patterns=["*/api/internal/*"],
skip_extensions=["pdf"],
allowed_domains=["api.example.com"],
)Or compose lower-level filters explicitly:
from yoink.filters import URLFilter, DomainFilter, CombinedFilter
url_filter = CombinedFilter(
url_filter=URLFilter(
include_patterns=["*/api/*"],
exclude_patterns=["*/internal/*"],
skip_extensions=["pdf"],
),
domain_filter=DomainFilter(allowed_domains=["api.example.com"]),
)See also
- Filters API reference.
- The
Filterssource: src/yoink/filters.py.