I constructed a 70:30 split of translations to instruction prompts for fine-tuning Google’s translategemma-4b-it LLM model which specializes in translation tasks, the project is fully open source.
Given my limited GPU budget I couldn’t expand this to include 100% of the welsh:english translation datasets, so a different data recipe could substantially improve the fine-tuning training data and resulting quality of output translations (especially if trained on 12B or 27B next).
What language translation pairs would you want to see fine-tuned into the TranslateGemma models? I was originally thinking of Klingon but I couldn’t easily find datasets for it on huggingface nor kaggle, so I went with Welsh since I found several million rows of data for it..
submitted by /u/ufos1111
[link] [comments]