I built a large-scale scraping system that can extract data from thousands of sources simultaneously, bypass anti-bot protection, and convert unstructured formats (PDFs, scanned docs, complex HTML) into clean structured datasets.
What public datasets should exist but don’t because:
• Data is scattered across too many jurisdictions (every state/county has their own portal) • No one has aggregated it yet • It’s in PDFs or hard-to-parse formats • Sites actively block automated access
Not looking to sell—genuinely trying to understand what public data would be valuable if someone aggregated it. If there’s demand, I might build and release it.
submitted by /u/Sufficient-War-4020
[link] [comments]