I’m looking for a multilingual SMS dataset for an application, but I can’t find one
Hello, as mentioned in the title, I’m looking for an SMS dataset. I found a few, but these
Critical Issues:
Class Imbalance – Raw: 4,825 (86.59%) | Spam: 747 (13.41%) → 6.46:1
~440 duplicates in each language (7.5-8%)
🟡 Medium-Level Issues:
Weak Hindi translation – Mixed characters, poor transcription
Wide length distribution – Especially in Hindi (max: 1406!)
Very short messages – Especially in Hindi (95 instances)
How can I find datasets without these issues?
submitted by /u/Extension-Onion2310
[link] [comments]