Free-tier Launch Of An Original, Studio-recorded Human Voice Dataset For SaaS & Call Bot NLU Training (LJ Speech + JSON Schemas)

I wanted to share an original speech/audio dataset I’ve been compiling. I operate a technical voice data pipeline and decided to build a studio-mastered dataset explicitly tailored for conversational, automated customer service and phone line (IVR) architectures.

If you search for open-source conversational speech data, almost everything out there is either heavily compressed web-scraped data with inconsistent noise floors, or read-speech audio books that lack natural, conversational cadence.

The Content:

– Highly structured, realistic transactional human conversational lines tailored for B2B SaaS, ticketing, routing, and telephony pipelines.

– Completely mapped to the standard LJ Speech layout (⁠filename|transcription|normalized_transcription⁠) for drag-and-drop ingestion into standard model pipelines.

– Every single premium audio file is paired with an independent JSON sidecar detailing precise syntax tagging, phonetic structures, and specific semantic intent mappings.

Acoustic Specs:

– Engineered in an acoustic studio at 24-bit/48kHz PCM WAV. The audio files have been passed through a targeted high-pass filter curve to strip low-end room artifacts and is normalized for uniform gain.

Sourcing & Compliance:

This is 100% human-generated, original acoustic data. Because I am the data creator, it is fully cleared, compliant, and legally indemnified. There is zero scraped web content or automated text-to-speech generation inside this pack.

The baseline sample block of the dataset is completely open and free to download. It includes a Full Commercial Use License, meaning you can integrate it into live client demos, public applications, or commercial pipelines right away without the need for a credit card.

Hugging Face Repository (Free Download):https://huggingface.co/datasets/MarieDeVox/saas-corporate-conversational-voice-sample

GitHub (Free Download): https://github.com/MarieDeVox/saas-corporate-voice-dataset-sample

DISCLAIMER: I am the creator and independent owner of this dataset. While the sample block linked above is completely free with a full commercial license to keep forever, I do host full enterprise production expansions.

If you download the repository and play around with the mapping this weekend, let me know if you run into any parsing issues or formatting bottlenecks!

submitted by /u/MarieDeVox
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *