LLM schema extraction
Claude-powered structured extraction with prompt caching.
When to use it
LLM extraction shines when the schema is stable but the source DOM isn't, when you scrape many sites and don't want to write per-site selectors, or when you want partial success — Claude returns nulls for missing fields rather than failing the whole row.
How it works
- HTML → Markdown (selectolax, deterministic)
- Markdown + JSON schema → Claude with cacheable system prompt
- Structured JSON parsed and stored
The system prompt + schema are marked as ephemeral cache; same crawl job amortizes ~90% of input tokens.
Schema example
filed under · schema.yaml.yaml
type: object
properties:
title: { type: string }
price: { type: number, description: "strip currency symbols" }
currency: { type: string, description: "ISO 4217" }
in_stock: { type: boolean }
required: [title, price, in_stock]Cost
Claude Haiku 4.5 with prompt caching runs about $0.003 / page for typical product pages (~6k input tokens, ~200 output). At 1M pages that's $3,000 — vs $30k+ without caching.