Creating A Dataset For Fine-Tuning A Code Generation LLM In The Data Science Domain

I want to create a dataset using source code from GitHub to fine-tune a code generation LLM, specifically in the data science domain. Since I don’t have the budget to use LLMs to generate descriptions for the input, I’m designing a dataset where both the input and output are code (all crawled from GitHub).

Is there a pipeline that can help me create input-output code pairs with consistent context (i.e., the input should provide enough context for the output) and focus on a specific domain?

submitted by /u/Comfortable-Class905
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *