Per-site CSS selectors

Hand-tuned extractors when the schema is stable.

When to use selectors

If you're scraping a known site with a stable structure, a hand-written CSS selector is ~100× cheaper than LLM extraction and more reliable. Selectors run during extraction; LLM is the fallback.

Defining one

filed under · src/scrape/extractors/selectors/__init__.py.python

def my_extractor(html: str | bytes, url: str) -> dict[str, Any]:
    tree = HTMLParser(html)
    title = tree.css_first("h1.product-title")
    price = tree.css_first(".price-now")
    return {
        "url": url,
        "title": title.text(strip=True) if title else None,
        "price": price.text(strip=True) if price else None,
    }

SELECTOR_REGISTRY["myshop.com"] = my_extractor

Lookup rules

The orchestrator tries the FQDN first, then falls back to longest-suffix match against eTLD+1. So registering products.example.com works for both products.example.com and example.com.

← PREVIOUS

CAPTCHA

LLM extraction

/ EXCAVATING

Scrape/01

Features Pricing Field Manual Logs

Per-site CSS selectors

Hand-tuned extractors when the schema is stable.

When to use selectors

If you're scraping a known site with a stable structure, a hand-written CSS selector is ~100× cheaper than LLM extraction and more reliable. Selectors run during extraction; LLM is the fallback.

Defining one

filed under · src/scrape/extractors/selectors/__init__.py.python

def my_extractor(html: str | bytes, url: str) -> dict[str, Any]:
    tree = HTMLParser(html)
    title = tree.css_first("h1.product-title")
    price = tree.css_first(".price-now")
    return {
        "url": url,
        "title": title.text(strip=True) if title else None,
        "price": price.text(strip=True) if price else None,
    }

SELECTOR_REGISTRY["myshop.com"] = my_extractor

Lookup rules

The orchestrator tries the FQDN first, then falls back to longest-suffix match against eTLD+1. So registering products.example.com works for both products.example.com and example.com.

← PREVIOUS

CAPTCHA

LLM extraction