Anti-bot / WAF Adoption Across The Top 1,000,000 Websites — Open Dataset (CC BY 4.0, ~1M Rows) [self-promotion]

I scanned the Tranco top 1,000,000 sites (June 2026) and recorded, per domain, which anti-bot/WAF vendor protects it and whether a plain request gets challenged. Releasing it as open data.

– 998,497 probed, 818,614 reachable

– Fields: domain, rank, reachable, protected, vendor(s), kind (waf/captcha/bot_management/…), difficulty band, block reason, enforcement, CAPTCHA type, final URL, status, probed_at — names only, no PII

– Plus a top-50k “deep-page census” (86,792 rows) with a page_type field (homepage vs product/listing/profile)

– License: CC BY 4.0

Headline: 53.5% of reachable sites run a managed anti-bot/WAF (Cloudflare ~45%), but only 9.8% actively challenged the request. The busiest sites run the least (top-1k 44% → long tail 54%).

Dataset (gzipped JSONL + sample + summary.json): https://github.com/Crawlora-org/anti-bot-adoption-index-data

Open-source detector CLI: go install github.com/Crawlora-org/crawlora-antibot@latest

submitted by /u/the_bigbang
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *