[DATASET][PAID] 1 Million Labeled Hinglish Dataset — Available For Licensing

[DATASET][PAID] 1 Million Labeled Hinglish Dataset — Available for Licensing

Hey everyone, I’ve spent months building a large-scale Hinglish dataset and I’m making it available for licensing.

What’s in it: – 1,000,000 real Hinglish samples from social media – 6 labels per entry: intent, emotion, toxicity, sarcasm, language tag – Natural conversational Hinglish (not translated — actual how people type)

Why it matters: Hinglish is how 300M+ Indians actually communicate online. Most existing datasets are either pure Hindi or pure English. This fills a real gap for anyone building India-focused NLP models, chatbots, or content moderation systems.

Sample labels include: – Intent: Appreciation / Request / Question / Neutral – Emotion: Happy / Sad / Angry / Surprised / Neutral – Toxicity: Low / Medium / High – Sarcasm: Yes / No

Licensing: – Non-exclusive: $20,000 (multiple buyers allowed) – 5,000 sample teaser available for evaluation before purchase

Who this is for: – AI startups building for Indian markets – Researchers working on code-switching or multilingual NLP – Companies building content moderation for Indian platforms

Check the teaser here: https://github.com/theYugrathee/1-million-hinglish-dataset-sample-of-5k-/blob/main/hinglish_dataset_teaser.json

Drop a comment or DM if interested!

Disclosure: I am the creator and seller of this dataset.

submitted by /u/UniqueProfessional81
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *