We’ve been working on conversational speech datasets recently and wanted to share a sample to get feedback from this community.
This is focused on real conversational behaviour rather than clean, scripted dialogue.
What it includes:
- multi-speaker conversations
- natural interruptions and overlapping speech
- code-switching (Hindi + English, Hinglish)
- context-driven interactions (not isolated utterances)
- speaker variability (accent, pace, fluency)
Languages covered in the sample:
- Indian English
- Hindi
- Hinglish
- Punjabi
- Marwadi
We’ve tried to keep the structure usable for training and evaluation, with metadata around speakers, turns, and context.
Still early, and would genuinely appreciate feedback on:
- dataset structure
- missing edge cases
- what would make this more useful in real pipelines
Happy to share access if anyone wants to take a closer look.
submitted by /u/Cautious-Today1710
[link] [comments]