I want to create a dataset using source code from GitHub to fine-tune a code generation LLM, specifically in the data science domain. Since I don’t have the budget to use LLMs to generate descriptions for the input, I’m designing a dataset where both the input and output are code (all crawled from GitHub).
Is there a pipeline that can help me create input-output code pairs with consistent context (i.e., the input should provide enough context for the output) and focus on a specific domain?
submitted by /u/Comfortable-Class905
[link] [comments]