Hello everyone! I wanted to share a tool that I created for making hand written fine-tuning datasets, originally I built this for myself when I was unable to find conversational datasets formatted the way I needed when I was fine-tuning for the first time and hand typing JSON files seemed like some sort of torture so I built a little simple UI for myself to auto format everything for me.
I originally built this back when I was a beginner, so it is very easy to use with no prior dataset creation/formatting experience, but also has a bunch of added features I believe more experienced devs would appreciate!
I have expanded it to support :
– many formats; chatml/chatgpt, alpaca, and sharegpt/vicuna
– multi-turn dataset creation, not just pair-based
– token counting from various models
– custom fields (instructions, system messages, custom IDs),
– auto saves and every format type is written at once
– formats like alpaca have no need for additional data besides input and output, as default instructions are auto-applied (customizable)
– goal tracking bar
I know it seems a bit crazy to be manually typing out datasets, but handwritten data is great for customizing your LLMs and keeping them high-quality. I wrote a 1k interaction conversational dataset within a month during my free time, and this made it much more mindless and easy.
I hope you enjoy! I will be adding new formats over time, depending on what becomes popular or is asked for
submitted by /u/ella0333
[link] [comments]