[Self-Release] 65 Hours Of Kenyan/Filipino English Dialogue | Split-Track WebRTC | VAD-Segmented

Hi all,

I’m the Co-founder of Datai. We are releasing a 65-hour dataset of spontaneous, two-speaker dialogues focused on Kenyan (KE) and Filipino (PH) English accents.

We built this to solve a specific internal problem: standard datasets (like LibriSpeech) are too clean. We needed data that reflects WebRTC/VoIP acoustics and non-Western prosody.

We are releasing this batch on Hugging Face for the community to use for ASR benchmarking, accent robustness testing, or diarization experiments.

The Specs:

  • Total Duration: ~65 hours (Full dataset is 800+ hours)
  • Speakers: >150 (Majority Kenyan interviewees, ~15 Filipino interviewers)
  • Topic: Natural, unscripted day-to-day life conversations.
  • Audio Quality: Recorded via WebRTC in Opus 48kHz, transcoded to pcm_s16le.
  • Structure: Split-track (Stereo). Each speaker is on a separate track.

Processing & Segmentation: We processed the raw streams using silero-vad to chunk audio into 1 to 30-second segments.

File/Metadata Structure: We’ve structured the filenames to help with parsing: ROOM-ID_TRACK-ID_START-MS_END-MS

  • ROOM-ID: Unique identifier for the conversation session.
  • TRACK-ID: The specific audio track (usually one speaker per track).

Technical Caveat (the edge case): Since this is real-world WebRTC data, we are transparent about the dirt in the data: If a speaker drops connection and rejoins, they may appear on a new TRACK-ID within the same ROOM-ID. We are clustering these in v2, but for now, treat Track IDs as session-specific rather than global speaker identities.

Access: The dataset is hosted on Hugging Face (gated to prevent bots/abuse, but I approve manual requests quickly).

Link is in the comments.

submitted by /u/Downtown_Valuable_44
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *