Working on a few local AI use cases and hitting the same wall: lack of clean, high-quality non-English data.
English datasets are everywhere, but once you go into local languages/dialects, quality drops fast—noisy labels, inconsistent formats, cultural gaps. Fine-tuning models for real-world local use becomes painful.
Curious from others building outside the US/EU bubble:
- Where do you usually source non-English data?
- What’s the biggest issue: quantity, quality, or context?
- Have you paid for custom datasets before?
Feels like models are getting better faster than the data feeding them.
submitted by /u/Kind_Buyer8931
[link] [comments]