Open source project which constructed a 70:30 split dataset (translations:instructions) for fine-tuning Google’s TranslateGemma for improved bidirectional english welsh translations!

I constructed a 70:30 split of translations to instruction prompts for fine-tuning Google’s translategemma-4b-it LLM model which specializes in translation tasks, the project is fully open source.

Given my limited GPU budget I couldn’t expand this to include 100% of the welsh:english translation datasets, so a different data recipe could substantially improve the fine-tuning training data and resulting quality of output translations (especially if trained on 12B or 27B next).

What language translation pairs would you want to see fine-tuned into the TranslateGemma models? I was originally thinking of Klingon but I couldn’t easily find datasets for it on huggingface nor kaggle, so I went with Welsh since I found several million rows of data for it..

submitted by /u/ufos1111
[link] [comments]

Open Source Project Which Constructed A 70:30 Split Dataset (translations:instructions) For Fine-tuning Google’s TranslateGemma For Improved Bidirectional English Welsh Translations!

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

Leave a Reply Cancel reply

Recent Posts

Recent Comments