Per-site CSS selectors
Hand-tuned extractors when the schema is stable.
When to use selectors
If you're scraping a known site with a stable structure, a hand-written CSS selector is ~100× cheaper than LLM extraction and more reliable. Selectors run during extraction; LLM is the fallback.
Defining one
filed under · src/scrape/extractors/selectors/__init__.py.python
def my_extractor(html: str | bytes, url: str) -> dict[str, Any]:
tree = HTMLParser(html)
title = tree.css_first("h1.product-title")
price = tree.css_first(".price-now")
return {
"url": url,
"title": title.text(strip=True) if title else None,
"price": price.text(strip=True) if price else None,
}
SELECTOR_REGISTRY["myshop.com"] = my_extractorLookup rules
The orchestrator tries the FQDN first, then falls back to longest-suffix match against eTLD+1. So registering products.example.com works for both products.example.com and example.com.