Boredom Central - [Self-Release] 65 Hours of Kenyan/Filipino English Dialogue | Split-Track WebRTC

Hi all,

I’m the Co-founder of Datai. We are releasing a 65-hour dataset of spontaneous, two-speaker dialogues focused on Kenyan (KE) and Filipino (PH) English accents.

We built this to solve a specific internal problem: standard datasets (like LibriSpeech) are too clean. We needed data that reflects WebRTC/VoIP acoustics and non-Western prosody.

We are releasing this batch on Hugging Face for the community to use for ASR benchmarking, accent robustness testing, or diarization experiments.

The Specs:

Total Duration: ~65 hours (Full dataset is 800+ hours)
Speakers: >150 (Majority Kenyan interviewees, ~15 Filipino interviewers)
Topic: Natural, unscripted day-to-day life conversations.
Audio Quality: Recorded via WebRTC in Opus 48kHz, transcoded to pcm_s16le.
Structure: Split-track (Stereo). Each speaker is on a separate track.

Processing & Segmentation: We processed the raw streams using silero-vad to chunk audio into 1 to 30-second segments.

File/Metadata Structure: We’ve structured the filenames to help with parsing: ROOM-ID_TRACK-ID_START-MS_END-MS

ROOM-ID: Unique identifier for the conversation session.
TRACK-ID: The specific audio track (usually one speaker per track).

Technical Caveat (the edge case): Since this is real-world WebRTC data, we are transparent about the dirt in the data: If a speaker drops connection and rejoins, they may appear on a new TRACK-ID within the same ROOM-ID. We are clustering these in v2, but for now, treat Track IDs as session-specific rather than global speaker identities.

Access: The dataset is hosted on Hugging Face (gated to prevent bots/abuse, but I approve manual requests quickly).

Link is in the comments.

submitted by /u/Downtown_Valuable_44
[link] [comments]

[Self-Release] 65 Hours Of Kenyan/Filipino English Dialogue | Split-Track WebRTC | VAD-Segmented

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

Leave a Reply Cancel reply

Recent Posts

Recent Comments