List of code generation datasets (open source)

I’ve compiled a list of datasets that can be used to train LLMs to generate code from text. Let me know if there is any dataset that I’ve missed!

WikiSQL

A large crowd-sourced dataset for developing natural language interfaces for relational databases.

TheVault

The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.

CodeContests

CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode.

The Pile

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

CodeSearchNet

CodeSearchNet corpus is a dataset of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub. It contains code and documentation for several programming languages.

GitHub Code

The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data.

MBPP

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on.

CodeXGLUE

A collection of code intelligence tasks and a platform for model evaluation and comparison

submitted by /u/04RR
[link] [comments]

List Of Code Generation Datasets (open Source)

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

Leave a Reply Cancel reply

Recent Posts

Recent Comments