The web,
excavated.
Strip the surface. Read the strata. Extract the signal. Production-grade scraping infrastructure that escalates only when blocked.
· No credit card · Self-hostable · Apache-2.0 licensed
# Stratum 0 — Surface
$ scrape crawl https://target.example/p \
--max-tier 2 --llm --schema product.yaml
╭─ field report ─────────────────────────────────────╮
│ 1,243 / 1,248 succeeded ↑ 99.6% │
├─ stratum 0 HTTP 1,012 pages $1.01 │
├─ stratum 1 Browser 218 pages $1.09 │
├─ stratum 2 + CAPTCHA 13 pages $0.26 │
╰─ filed: results.json $2.36 ──╯Cheap first.
Deep when forced.
Every URL starts at Stratum 00 — plain HTTP. The block detector reads each response. If a wall appears, the router escalates one stratum at a time. You never pay for a depth you didn't need.
Everything in one box.
No glue code. No second vendor. No CSV sync at 3am. The full pipeline ships in a single binary and a single SQLite file.
Real-browser handshakes via curl_cffi. Akamai hash matched, extension ordering preserved.
Camoufox + Nodriver. Coherent UA, screen, timezone, locale, fonts, WebGL — sourced from real devices.
Per-(host,fp) sticky residential sessions, automatic health scoring, 10-minute cooldowns on burned IPs.
Turnstile, reCAPTCHA v3, hCaptcha. Token injected & resubmitted automatically.
Claude with prompt caching. Drop in a JSON schema; get clean rows. Cache hit ratio = your discount.
Per-host rate limiter, async fan-out, content-addressed raw HTML, full Prometheus surface.
From URL to clean row.
Three operations. No tier-selection logic to maintain. The router decides; you receive.
Paste in the dashboard, POST to the API, or pipe from your CSV. Pick a max stratum.
Each URL starts at the surface. The router escalates only on confirmed block. ~80% never leave Stratum 0.
Stream results live, download JSON / CSV, or push to your webhook. Fully typed.
For people who ship.
A typed Python SDK. An OpenAPI 3.1 REST surface. SSE for live progress. HMAC-signed webhooks. Per-job extraction schemas. Use the dashboard for ad-hoc digs; wire the API into your pipeline for everything else.
- →Idempotent job submission with content hashing
- →Server-Sent Events for live progress
- →Webhooks on completion (HMAC signed)
- →Per-job extraction schemas
- →JSON, CSV, NDJSON exports
- →Per-fetch cost telemetry — proxy bytes & solver $
from scrape import Client, Stratum
client = Client(api_key="sk_live_...")
job = client.jobs.create(
name="product prices · q2",
urls=[f"https://shop.example.com/p/{s}" for s in skus],
max_stratum=Stratum.DEEP,
schema={
"title": "string",
"price": "number",
"currency": "string",
"in_stock": "boolean",
},
)
# Stream findings as they're excavated
for row in client.jobs.stream(job.id):
print(row.url, row.data["price"])You see every penny.
Every fetch records what it actually cost: residential proxy bytes plus paid CAPTCHA solver USD. Per row in storage, per tier in Prometheus, per job in the dashboard. No surprise invoice.
scrape_proxy_bytes_totalscrape_solver_cost_usd_total
Honors robots.txt by default. Per-host rate limits enforced before egress. Never auth-walled or PII content.
Bright Data, Decodo, IPRoyal, Oxylabs. Providers with documented ethical-sourcing audits, never malware botnets.
Apache-2.0 licensed. Self-host or use the managed cloud — same code. Auditable, forkable, yours.
Common questions, plain answers.
If you're picking a scraping stack and trying to understand whether Scrape fits, start here. Linked sources for everything.
What makes Scrape different from a simple HTTP scraper or a headless browser?→
A single-tier scraper either over-pays (everything goes through a $0.02 browser) or under-delivers (gets blocked on protected pages). Scrape routes each URL through four tiers — TLS-impersonated HTTP, stealth browser, CAPTCHA solver, managed unblock — and only pays for the depth a given page actually requires. ~80% of pages clear at Tier 0.
Does Scrape handle Cloudflare, DataDome, and PerimeterX?→
Yes. Cloudflare Turnstile and DataDome are typically beaten by Tier 0 (curl_cffi + residential IP) or Tier 1 (Camoufox stealth Firefox). PerimeterX and other behavioral-scoring vendors require Tier 3 — a commercial managed unblocker (Bright Data Web Unlocker or Scrapfly) wired in via UNBLOCK_PROVIDER.
Can I self-host Scrape?→
Yes — the entire stack is Apache-2.0 licensed and ships as a Docker Compose file. The free FlareSolverr container can serve as Tier 3, and Ollama can replace Anthropic for LLM extraction. Self-hosting needs zero paid services.
Does Scrape respect robots.txt?→
By default, yes. CRAWL_RESPECT_ROBOTS=true is the shipped default and the orchestrator skips disallowed URLs. Operators can opt out per deployment, but the default is ethical-by-design.
What does it cost to run at scale?→
At ~80% Tier-0 success on 1M pages, total spend is ~$280 (proxy bandwidth + LLM extraction with prompt caching). The same workload through a browser-only scraper is ~$17,000 — Scrape's tier router exists specifically to avoid that bill.
Begin your first dig
in under sixty seconds.
The free tier ships with 10,000 monthly fetches at Stratum 00. No credit card. First user becomes the admin.