I scanned the Tranco top 1,000,000 sites (June 2026) and recorded, per domain, which anti-bot/WAF vendor protects it and whether a plain request gets challenged. Releasing it as open data.
– 998,497 probed, 818,614 reachable
– Fields: domain, rank, reachable, protected, vendor(s), kind (waf/captcha/bot_management/…), difficulty band, block reason, enforcement, CAPTCHA type, final URL, status, probed_at — names only, no PII
– Plus a top-50k “deep-page census” (86,792 rows) with a page_type field (homepage vs product/listing/profile)
– License: CC BY 4.0
Headline: 53.5% of reachable sites run a managed anti-bot/WAF (Cloudflare ~45%), but only 9.8% actively challenged the request. The busiest sites run the least (top-1k 44% → long tail 54%).
Dataset (gzipped JSONL + sample + summary.json): https://github.com/Crawlora-org/anti-bot-adoption-index-data
Open-source detector CLI: go install github.com/Crawlora-org/crawlora-antibot@latest
submitted by /u/the_bigbang
[link] [comments]