I’ve compiled a list of datasets that can be used to train LLMs to generate code from text. Let me know if there is any dataset that I’ve missed!
A large crowd-sourced dataset for developing natural language interfaces for relational databases.
The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.
CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode.
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
CodeSearchNet corpus is a dataset of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub. It contains code and documentation for several programming languages.
The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data.
The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on.
A collection of code intelligence tasks and a platform for model evaluation and comparison
submitted by /u/04RR
[link] [comments]