I have a clean dataset with the last 20+ years of ncaa tournament games (round, seeds, result, score) along with ~100 traditional and advanced team stats from multiple public sources as they were pre-tournament. I’ve done a lot of feature engineering and can add those metrics in too (ex: 3-pt % to opponents 3-pt defense, raw and normalized by diff SOS type approaches).
It’s nothing crazy extensive (no player stats, injuries, trends) but it’s cleaner and more comprehensive than anything I’ve found available for free download / scraping.
I put the scripts together a few years ago with non-trivial code effort and manual QC (name formatting etc). It wouldn’t be particularly difficult to reproduce for a decent programmer. I’m sure AI has made that type of process more accessible but it’d still take some time for most.
Having never sold a dataset is there any value here? I’m not expecting much but the work is already done.
I’ve started the process of including regular season games (stats at game time) if that would help but probably won’t finish without understanding value. Same for game lines / betting info but only if the dataset is useless without them. They’re messier to pull.
submitted by /u/yourfinepettingduck
[link] [comments]