If you’re building a web crawler and need a large seed list, this might help.
I extracted ~72M unique domains from the latest Common Crawl snapshot and published them here:
https://github.com/digitalcortex/72m-domains-dataset/
Use it to bootstrap your crawling queue instead of starting from scratch.
submitted by /u/venturepulse
[link] [comments]