Sharing My Free Tool For Easy Handwritten Fine-tuning Datasets!

Hello everyone! I wanted to share a tool that I created for making hand written fine-tuning datasets, originally I built this for myself when I was unable to find conversational datasets formatted the way I needed when I was fine-tuning for the first time and hand typing JSON files seemed like some sort of torture so I built a little simple UI for myself to auto format everything for me.

I originally built this back when I was a beginner, so it is very easy to use with no prior dataset creation/formatting experience, but also has a bunch of added features I believe more experienced devs would appreciate!

I have expanded it to support :
– many formats; chatml/chatgpt, alpaca, and sharegpt/vicuna
– multi-turn dataset creation, not just pair-based
– token counting from various models
– custom fields (instructions, system messages, custom IDs),
– auto saves and every format type is written at once
– formats like alpaca have no need for additional data besides input and output, as default instructions are auto-applied (customizable)
– goal tracking bar

I know it seems a bit crazy to be manually typing out datasets, but handwritten data is great for customizing your LLMs and keeping them high-quality. I wrote a 1k interaction conversational dataset within a month during my free time, and this made it much more mindless and easy.

I hope you enjoy! I will be adding new formats over time, depending on what becomes popular or is asked for

Get it here

submitted by /u/ella0333
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *