LLM schema extraction

Claude-powered structured extraction with prompt caching.

When to use it

LLM extraction shines when the schema is stable but the source DOM isn't, when you scrape many sites and don't want to write per-site selectors, or when you want partial success — Claude returns nulls for missing fields rather than failing the whole row.

How it works

HTML → Markdown (selectolax, deterministic)
Markdown + JSON schema → Claude with cacheable system prompt
Structured JSON parsed and stored

The system prompt + schema are marked as ephemeral cache; same crawl job amortizes ~90% of input tokens.

Schema example

filed under · schema.yaml.yaml

type: object
properties:
  title:    { type: string }
  price:    { type: number, description: "strip currency symbols" }
  currency: { type: string, description: "ISO 4217" }
  in_stock: { type: boolean }
required: [title, price, in_stock]

Cost

Claude Haiku 4.5 with prompt caching runs about $0.003 / page for typical product pages (~6k input tokens, ~200 output). At 1M pages that's $3,000 — vs $30k+ without caching.

← PREVIOUS

Selectors

Output formats

type: object properties: title: { type: string } price: { type: number, description: "strip currency symbols" } currency: { type: string, description: "ISO 4217" } in_stock: { type: boolean } required: [title, price, in_stock]