Looking For Feedback On A Conversational Speech Dataset (multilingual, Real Interactions)

We’ve been working on conversational speech datasets recently and wanted to share a sample to get feedback from this community.

This is focused on real conversational behaviour rather than clean, scripted dialogue.

What it includes:

  • multi-speaker conversations
  • natural interruptions and overlapping speech
  • code-switching (Hindi + English, Hinglish)
  • context-driven interactions (not isolated utterances)
  • speaker variability (accent, pace, fluency)

Languages covered in the sample:

  • Indian English
  • Hindi
  • Hinglish
  • Punjabi
  • Marwadi

We’ve tried to keep the structure usable for training and evaluation, with metadata around speakers, turns, and context.

Still early, and would genuinely appreciate feedback on:

  • dataset structure
  • missing edge cases
  • what would make this more useful in real pipelines

Happy to share access if anyone wants to take a closer look.

submitted by /u/Cautious-Today1710
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *