Looking For Feedback On A Conversational Speech Dataset (multilingual, Real Interactions)

We’ve been working on conversational speech datasets recently and wanted to share a sample to get feedback from this community.

This is focused on real conversational behaviour rather than clean, scripted dialogue.

What it includes:

multi-speaker conversations
natural interruptions and overlapping speech
code-switching (Hindi + English, Hinglish)
context-driven interactions (not isolated utterances)
speaker variability (accent, pace, fluency)

Languages covered in the sample:

Indian English
Hindi
Hinglish
Punjabi
Marwadi

We’ve tried to keep the structure usable for training and evaluation, with metadata around speakers, turns, and context.

Still early, and would genuinely appreciate feedback on:

dataset structure
missing edge cases
what would make this more useful in real pipelines

Happy to share access if anyone wants to take a closer look.

submitted by /u/Cautious-Today1710
[link] [comments]

Leave a Reply Cancel reply