[DATASET][PAID] 1 Million Labeled Hinglish Dataset — Available for Licensing
Hey everyone, I’ve spent months building a large-scale Hinglish dataset and I’m making it available for licensing.
What’s in it: – 1,000,000 real Hinglish samples from social media – 6 labels per entry: intent, emotion, toxicity, sarcasm, language tag – Natural conversational Hinglish (not translated — actual how people type)
Why it matters: Hinglish is how 300M+ Indians actually communicate online. Most existing datasets are either pure Hindi or pure English. This fills a real gap for anyone building India-focused NLP models, chatbots, or content moderation systems.
Sample labels include: – Intent: Appreciation / Request / Question / Neutral – Emotion: Happy / Sad / Angry / Surprised / Neutral – Toxicity: Low / Medium / High – Sarcasm: Yes / No
Licensing: – Non-exclusive: $20,000 (multiple buyers allowed) – 5,000 sample teaser available for evaluation before purchase
Who this is for: – AI startups building for Indian markets – Researchers working on code-switching or multilingual NLP – Companies building content moderation for Indian platforms
Check the teaser here: https://github.com/theYugrathee/1-million-hinglish-dataset-sample-of-5k-/blob/main/hinglish_dataset_teaser.json
Drop a comment or DM if interested!
Disclosure: I am the creator and seller of this dataset.
submitted by /u/UniqueProfessional81
[link] [comments]