Filters
URLFilter, DomainFilter, and CombinedFilter — pattern matching, extension filtering, domain allowlists.
from yoink.filters import URLFilter, DomainFilter, CombinedFilterURLFilter
Pattern-based URL filtering. Auto-detects glob, regex, or substring patterns.
URLFilter(
include_patterns: list[str] | None = None,
exclude_patterns: list[str] | None = None,
skip_extensions: list[str] | None = None,
)| Name | Type | Default | Description |
|---|---|---|---|
| include_patterns | list[str] | None | — | URL must match at least one to pass. Empty = no inclusion filter. |
| exclude_patterns | list[str] | None | — | URL fails if it matches any. Empty = no exclusion filter. |
| skip_extensions | list[str] | None | — | Path-suffix filter. Leading dots are stripped, case-insensitive. |
url_filter = URLFilter(
include_patterns=["*/blog/*", "*/docs/*"],
exclude_patterns=["*/private/*", r"^.*\?draft=1$"],
skip_extensions=["pdf", "zip", "exe"],
)
url_filter.should_crawl("https://example.com/blog/post-1") # True
url_filter.should_crawl("https://example.com/private/x") # False
url_filter.should_crawl("https://example.com/manual.pdf") # FalseDomainFilter
Domain allowlist with subdomain matching.
DomainFilter(allowed_domains: list[str] | None = None)domain_filter = DomainFilter(allowed_domains=["example.com"])
domain_filter.should_crawl("https://example.com/page") # True
domain_filter.should_crawl("https://blog.example.com/x") # True (subdomain)
domain_filter.should_crawl("https://other.com/page") # False
domain_filter.should_crawl("https://evil-example.com/x") # FalseSubdomain matching: a URL passes if its hostname is an allowed domain or ends with .{allowed_domain}.
CombinedFilter
Composes a URLFilter and a DomainFilter. This is what Crawler accepts.
CombinedFilter(
url_filter: URLFilter | None = None,
domain_filter: DomainFilter | None = None,
)The most ergonomic constructor is from_config():
CombinedFilter.from_config(
include_patterns: list[str] | None = None,
exclude_patterns: list[str] | None = None,
skip_extensions: list[str] | None = None,
allowed_domains: list[str] | None = None,
) -> CombinedFilterurl_filter = CombinedFilter.from_config(
include_patterns=["*/api/*"],
exclude_patterns=["*/internal/*"],
skip_extensions=["pdf"],
allowed_domains=["api.example.com"],
)
crawler = Crawler(config=CrawlConfig(), url_filter=url_filter)Pattern dispatch
| Pattern shape | Matched as |
|---|---|
Contains * or ? | Glob (fnmatch) |
Starts ^, ends $, or contains [ | Regex (re.match) |
| Anything else | Substring (in) |
See URL filtering for examples.
Custom filters
Anything implementing should_crawl(url: str) -> bool works as a filter. To plug it into the crawler, wrap it with a tiny adapter or use it directly:
class WeekendOnlyFilter:
def should_crawl(self, url: str) -> bool:
from datetime import datetime
return datetime.utcnow().weekday() >= 5 # Sat/Sun
# CombinedFilter accepts anything with a url_filter or domain_filter slot
# that has .should_crawl, so subclassing is the cleanest path:
class MyURLFilter(URLFilter):
def should_crawl(self, url: str) -> bool:
if "?utm" in url:
return False
return super().should_crawl(url)See also
- URL filtering concepts.
- The
Filterssource: src/yoink/filters.py.