The legal landscape for web scraping has shifted twice in the last 18 months — once when the Ninth Circuit re-affirmed hiQ in 2025, and again when the EU AI Act's training-data transparency requirements took effect.
What didn't change
Public data is still public. The CFAA does not criminalize scraping pages that don't require authentication. Honoring robots.txt is still industry best practice but not a legal obligation in the US.
What did change
GDPR has been weaponized against scrapers more aggressively post-Clearview. If you collect personal data of EU residents — phone numbers, email addresses, social profiles — you need a documented lawful basis and a way for subjects to exercise erasure rights. Most scrapers fail this test.
The EU AI Act's Article 53 requires "sufficiently detailed summary" of training data sources for general-purpose AI models. If your scraper feeds a model that's deployed in the EU, your sourcing has to be auditable end-to-end.
Defaults that ship the right way
- Honor robots.txt by default; require explicit per-host opt-out
- Per-host rate limiting in the standard library, not optional
- No scraping behind auth or paywall — refuse the request
- Use only proxy providers with audited ethical sourcing