Hi all,
I’m the Co-founder of Datai. We are releasing a 65-hour dataset of spontaneous, two-speaker dialogues focused on Kenyan (KE) and Filipino (PH) English accents.
We built this to solve a specific internal problem: standard datasets (like LibriSpeech) are too clean. We needed data that reflects WebRTC/VoIP acoustics and non-Western prosody.
We are releasing this batch on Hugging Face for the community to use for ASR benchmarking, accent robustness testing, or diarization experiments.
The Specs:
- Total Duration: ~65 hours (Full dataset is 800+ hours)
- Speakers: >150 (Majority Kenyan interviewees, ~15 Filipino interviewers)
- Topic: Natural, unscripted day-to-day life conversations.
- Audio Quality: Recorded via WebRTC in Opus 48kHz, transcoded to
pcm_s16le. - Structure: Split-track (Stereo). Each speaker is on a separate track.
Processing & Segmentation: We processed the raw streams using silero-vad to chunk audio into 1 to 30-second segments.
File/Metadata Structure: We’ve structured the filenames to help with parsing: ROOM-ID_TRACK-ID_START-MS_END-MS
ROOM-ID: Unique identifier for the conversation session.TRACK-ID: The specific audio track (usually one speaker per track).
Technical Caveat (the edge case): Since this is real-world WebRTC data, we are transparent about the dirt in the data: If a speaker drops connection and rejoins, they may appear on a new TRACK-ID within the same ROOM-ID. We are clustering these in v2, but for now, treat Track IDs as session-specific rather than global speaker identities.
Access: The dataset is hosted on Hugging Face (gated to prevent bots/abuse, but I approve manual requests quickly).
Link is in the comments.
submitted by /u/Downtown_Valuable_44
[link] [comments]