One pipeline.
Every wall, beaten.
From the TCP handshake to the structured row in your database — every layer of the modern anti-bot stack has a counter built in.
Anti-bot bypass
TLS, fingerprint, behavior
JA3 / JA4+ matched, HTTP/2 frame ordering preserved, Akamai hash. curl-impersonate under the hood.
Camoufox (Firefox, C++-level patches) and Nodriver (raw CDP) with coherent fingerprint bundles.
Per (proxy, fingerprint, host) cookie + storage_state. No cross-IP cookie sharing — the canonical ban signal.
Bezier mouse paths, jittered scroll with easing, variable typing cadence — sourced from real session recordings.
Proxies & geography
Sticky sessions with health scoring
Decodo, IPRoyal, Bright Data, Oxylabs, custom — switch with one env var.
Same logical user keeps the same exit IP for the configurable sticky window. Health-scored per session.
ISO-2 country code per job. WireGuard egress for non-HTTP fingerprinting.
Three consecutive blocks → 10-minute cooldown, fresh sticky id. No ban-list maintenance.
CAPTCHA solving
Token injection at the browser
Detected on-page, sitekey extracted, token injected via CapSolver in ~5s.
Score-based — clean residential IP + behavioral warm-up earns a passing score.
Image-puzzle solving via vision-AI providers.
Swap CapSolver for any provider via the CaptchaSolver interface.
Extraction
Selectors & schema-driven AI
Hand-tuned CSS extractors when the schema is stable — 100× cheaper than LLM extraction.
Drop in a JSON schema, get structured rows. System + schema cached for ~90% input-token savings.
Selectolax-based HTML→Markdown, designed for token-efficient LLM consumption.
Each row carries a confidence proxy from the LLM's cache hit ratio + structural agreement.
Orchestration & ops
Concurrency, queues, observability
00=HTTP, 01=Browser, 02=+CAPTCHA, 03=Unblock API. Auto-promotes on block detection.
Concurrency cap + min delay enforced before a request leaves the box. Honors robots.txt.
Live progress over Server-Sent Events. HMAC-signed webhooks on completion.
SQLite for single-box, Postgres for scale. Content-addressed raw HTML for replay.
Observability you can trust.
Prometheus metrics out of the box. Per-domain success-rate dashboards. Tier mix and cost per 1k pages. Block-rate alerts that page you when something starts failing — not after the data lake is half-empty.
# HELP scrape_fetches_total Total fetches
# TYPE scrape_fetches_total counter
scrape_fetches_total{tier="0",block="none",ok="true"} 12384
scrape_fetches_total{tier="1",block="none",ok="true"} 2401
scrape_fetches_total{tier="0",block="challenge"} 312
scrape_fetches_total{tier="2",block="none",ok="true"} 47
# HELP scrape_fetch_latency_seconds
# TYPE scrape_fetch_latency_seconds histogram
scrape_fetch_latency_seconds_bucket{tier="0",le="0.5"} 11890
scrape_fetch_latency_seconds_bucket{tier="1",le="5.0"} 2390