I’ve been working on this database for about a year during my sabbatical and released a preview version of it this week: https://baseball.computer/
I have two goals for the project – to facilitate reproducible baseball research and to create the most fun and interesting “toy dataset” possible for educational settings.
From a technical standpoint, the database runs entirely inside of your browser, which means that you can write SQL against event-level data and visualize the results directly on the website. The tables are all available to download as flat files, and there are instructions for connecting to the data in Python and R.
From a baseball standpoint, it contains thousands of individual columns that pre-calculate as many building blocks as possible for statistical analysis. These include:
Repeatable construction of WAR components like linear weights, win/run expectancy, and park factors An example of a Keras deep-and-cross deep learning model that can train using the entire dataset on a laptop Tables that correctly merge event-level, box-level, game-level, and season-level raw data Taxonomies and additional metadata for outcome types, batted balls, and pitches 100+ event-level atomic “counting stats” including granular information on traditional stats, baserunning advances, pitches, and batted-ball location/trajectory. Detailed event state tables that can be combined with the counting stats for calculating splits Inference/deduction for handling missing batted ball data, unknown fielders, and unusual scorekeeper tendencies
Extensive-but-spotty documentation is available for all tables on the site. This includes all of the source (SQL) code, the upstream and downstream dependencies of each table, and a link to directly download the table as a flat file (here is an example). There are also several hundred tests and data constraints. This is nowhere near enough coverage to guarantee ease of use or data integrity, but it will hopefully serve as a foundation for both as the project evolves.
A couple of requests for anyone interested in playing around with it – please send me any feedback (bugs, feature requests, use cases, etc.) and, if you find it interesting, please share with your other data communities!
submitted by /u/PaginatedSalmon
[link] [comments]