Where Do People Get Specialized Datasets For Training Voice AI Models?

Working on a Voice AI model and trying to get my hands on some specialized speech datasets. The open ones are fine for testing, but I need more real-world stuff — think support calls, regional dialects, or professional contexts. Has anyone tackled this before? Any tips on where to source or how to create these datasets efficiently?

submitted by /u/Selmakiley
[link] [comments]

0

Building My First Data Analyst Personal Project | Need A Mentor!!!

So, I am currently looking out for job opportunities as a Data Analyst. Now what I have realized is that talking about the work you have done and showcasing them are far more worth than gaining certificates.
so this is my Day 1 in journey of building projects, also my first project to work on my own.
I work better in a team, so if there are people out there who’d want to join me in my journey and work on projects, join me

submitted by /u/Puzzleheaded_Mud1923
[link] [comments]

0

Looking For Taglish/Filipino TikTok Dataset

Hello! I am currently working on thesis and desperately need more data on taglish/filipino, primarily hate speech content. It would really help if anyone would have lead on where I may find a working dataset. Thank you!

submitted by /u/Icy_Fan5276
[link] [comments]

0

Data Analysis In Excel| Question|Advice

So my question is, after you have done all technical work in excel ( cleaned data, made dashboard and etc). how you do your report? i mean with words ( recommendations, insights and etc) I just want to hear from professionals how to do it in a right format and what to include . Also i have heard in interview recruiters want your ability to look at data and read it, so i want to learn it. Help!

submitted by /u/dollywinnie
[link] [comments]

0

Looking For Free / Very Low-cost Sources Of Financial & Registry Data For Unlisted Private & Proprietorship Companies In India — Any Leads?

Hi, I’m researching several unlisted private companies and proprietorships (need: basic financials, ROC filings where available, import/export traces, and contact info). I’ve tried MCA (can view/download docs for a small fee), and aggregators like Tofler / Zauba — those help but can get expensive at scale. I’ve also checked Udyam/MSME lists for proprietorships.

submitted by /u/Interesting-Chef6209
[link] [comments]

0

Kopari Beauty Has Priced Up In Australia Sephora

Kopari’s adjustments span all five major categories:

Bath & Body (40 SKUs): +7.0% average uplift, max +14%
Skincare (19 SKUs): +7.9% average uplift, max +14%
Fragrance (1 SKU): +22%
Haircare (1 SKU): +22%
Makeup (1 SKU): +9%

I have created a Notion database for above by-SKU changes, completely free to use, link in comment.

submitted by /u/IntelligentHome2342
[link] [comments]

0

Medical Education Curriculum Dataset (Multi Turn Conversation)

https://huggingface.co/datasets/lukehinds/deepfabric-7k-medical-multi-turn-conversation

Note, this is a synthetic dataset , its not based on real events. It was generated with deepfabric open source dataset generation tool.

submitted by /u/DecodeBytes
[link] [comments]

0

Looking For OSINT-related Datasets For A University Project

Hi everyone,

I’m working on a university project on big data and would like to explore something in the area of OSINT (Open Source Intelligence).

I’ve already checked Kaggle but couldn’t find anything relevant.
Does anyone know of websites, repositories, or public datasets that might be useful?

Thanks a lot for your help!

submitted by /u/onesmartco0kie
[link] [comments]

0

A New Interpretable Clinical Model. Tell Me What You Think About The Code

Hello everyone, I wrote an article about how an XGBoost can lead to clinically interpretable models like mine. Shap is used to make statistical and mathematical interpretation viewable

submitted by /u/ksrio64
[link] [comments]

0

Why Is Modern Data Architecture So Confusing? (and What Finally Made Sense For Me – Sharing For Beginners)

I’m a data engineering student who recently decided to shift from a non-tech role into tech, and honestly, it’s been a bit overwhelming at times. This guide I found really helped me bridge the gap between all the “bookish” theory I’m studying and how things actually work in the real world.

For example, earlier this semester I was learning about the classic three-tier architecture (moving data from source systems → staging area → warehouse). Sounds neat in theory, but when you actually start looking into modern setups with data lakes, real-time streaming, and hybrid cloud environments, it gets messy real quick.

I’ve tried YouTube and random online courses before, but the problem is they’re often either too shallow or too scattered. Having a sort of one-stop resource that explains concepts while aligning with what I’m studying and what I see at work makes it so much easier to connect the dots.

Sharing here in case it helps someone else who’s just starting their data journey and wants to understand data architecture in a simpler, practical way.

https://www.exasol.com/hub/data-warehouse/architecture/

submitted by /u/UnusualRuin7916
[link] [comments]

0

Looking For Real‑Time Social Media Data Providers With Geographic Filtering

I’m working on a social listening tool and need access to real‑time (or near real‑time) social media datasets. The key requirement is the ability to filter or segment data by geography (country, region, or city level).

I’m particularly interested in:

Providers with low latency between post creation and data availability
Coverage across multiple platforms (Twitter/X, Instagram, Reddit, YouTube, etc.)
Options for multilingual content, especially for non‑English regions
APIs or data streams that are developer‑friendly

If you’ve worked with any vendors, APIs, or open datasets that fit this, I’d love to hear your recommendations, along with any notes on pricing, reliability, and compliance with platform policies.

submitted by /u/To_Iflal
[link] [comments]

0

[Resource] A Hub To Discover Open Datasets Across Government, Research, And Nonprofit Portals (I Built This)

Hi all, I’ve been working on a project called Opendatabay.com, which aggregates open datasets from multiple sources into a searchable hub.

The goal is to make it easier to find datasets without having to search across dozens of government portals or research archives. You can browse by category, region, or source.

I know r/datasets usually prefers direct dataset links, but I thought this could be useful as a discovery resource for anyone doing research, journalism, or data science.

Happy to hear feedback or suggestions on how it could be more useful to this community.

Disclaimer: I’m the founder of this project.

submitted by /u/Winter-Lake-589
[link] [comments]

0

[self Promotion] Databounties – Post Your Data Requests

I created a site called databounties.com I haven’t even launched it yet but it is for people seeking datasets, you can add your requests and have people apply or email you. Hopefully it helps people find more data and others find more jobs!

submitted by /u/RaccoonSignificant96
[link] [comments]

0

Looking For A Dataset For Project!! (stock Prediction Using Sentiment Analysis)

Any recommendations for datasets even remotely close to below structure plzz recommend

|| || |Comapny ticker|DJIA value of company on Day3(t-2)|DJIA value Day2(t-1)|DJIA value Day1(t)|Twitter Sentiment about company on day3|Twitter Sentiment on day2|Twitter Sentiment on day1|label : prediction (up or down)(t+1)|

where, day 3 is day before yersterday, day 2 is yesterday, day 1 is today and prediction(label) is of tomorrow.

Also, any recommendations for datasets on stock related tweets too!!

submitted by /u/Dull-Assignment-3273
[link] [comments]

0

What’s The Smoothest Way To Share Multi-gigabyte Datasets Across Institutions?

I’ve been collaborating with a colleague on a project that involves some pretty hefty datasets, and moving them back and forth has been a headache. Some of the files are 50–100GB each, and in total we’re looking at hundreds of gigabytes. Standard cloud storage options don’t seem built for this either they throttle speeds, enforce strict limits, or require subscriptions that don’t make sense for one off transfers.

We’ve tried compressing and splitting files, but that just adds more time and confusion when the recipient has to reassemble everything. Mailing drives might be reliable, but it feels outdated and isn’t practical when you need results quickly. Ideally, I’d like something that’s both fast and secure, since we’re dealing with research data.

For those of you who routinely share large datasets across universities, labs, or organizations what’s worked best in your experience? Do you stick with institutional servers and FTP setups, or is there a practical modern tool for big dataset transfers?

submitted by /u/d4rk_diamond
[link] [comments]

0

Waymo Self Driving Cars Crash Data CSVs. Including Crashes With SGO Identifier , Geographic Distribution And Outcomes

submitted by /u/cavedave
[link] [comments]

0

Looking For SQL Study Partners – Data Analyst Transition

submitted by /u/BlockSpirited344
[link] [comments]

0

The Final 50 Days Of R/gbnews: A Collection Of All Posts, Comments And Related Users.

The file is 59 Megabytes, formatted in JSON. If there are any issues with accessing the file please contact me. I would also greatly appreciate any credit for use of this dataset.

r/gbnews was responsible for pushing a large amount of disinformation and radicalization content. I collected this data with the intention of investigating the possibility of some of the accounts on the subreddit being botted.

If you have any further questions about the dataset, do not hesitate to ask!

submitted by /u/Slomas99
[link] [comments]

0

Little Alchemy/infinite Craft Like Dataset

The title might be a bit confusing, but what i am looking for is a dataset with a lot of elements and element combos. I plan on using this to train an AI for making something close to infinite craft, but in the terminal. I am working on making a training dataset for it, but i just need a dataset for it.

submitted by /u/Inyourface3445
[link] [comments]

0

Can Someone Help Me With This Frontiers

So i want the dataset for autism detection using eeg and so i got up to this thing
https://datasetcatalog.nlm.nih.gov/dataset?q=0001446834
this would open the US gov NLM, now there we can see the Dataset uri but when i go there it has nothing in there’s just one docx file that i can download nothing else.

I tried with this diff paper source too
https://datasetcatalog.nlm.nih.gov/dataset?q=0000451693
but it has same outcome the dataset url takes to frontier and there we find just one .docx file.

So is that intended or the dataset is missing as they might not publish it. or do i need to do something else in order to get that.
This is my first time finding dataset from web, Else i would get it from kaggle all the time.

submitted by /u/Available-Fee1691
[link] [comments]

0

MIMIC-IV Data Access Query For Baseline Comparison

Hi everyone,

I have gotten access to the MIMIC-IV dataset for my ML project. I am working on a new model architecture, and want to compare with other baselines that have used MIMIC-IV. All other baselines mention using “lab notes, vitals, and codes”.

However, the original data has 20+ csv files, with different naming conventions. How can I identify which exact files these baselines use, which would make my comparison 100% accurate?

submitted by /u/One-Feeling03
[link] [comments]

0

(OC) Comprehensive Dataset Of Features Extracted From Seizure EEG Recordings

I have been working on a personal project to extract features from seizure EEG recordings that I thought I would share, with the goal to use this data to build a novel seizure detection model I have in mind,

The dataset can be found on Kaggle: Feature Extract – Siena Scalp + CHB MIT EEG Files

The features were extracted from publicly available EEG files in these two databases:

– Siena Scalp: https://physionet.org/content/siena-scalp-eeg/1.0.0/

– CHB MIT: https://physionet.org/content/chbmit/1.0.0/

I have tried to include as much as possible on how the features were calculated in the dataset description, but in general, the features were extracted based on these categories:

Differential Entropy
- Sample, Permutation, and Approximate Entropy
PSD Features
Seizure Propagation Speeds
Wavelet
Time Domain
Connectivity
Phase-Amplitude Coupling (PAC)
Rhythmic

A word of caution, however, is that I have not been able to have these calculations reviewed or verified by another human but I hope to have someone review it soon. It therefore should only be taken with a grain of salt at the moment but hope it is still useful in some way. I have been also going through the data to see if I can essentially prove what has already been proven, which is how I have been iteratively testing and verifying the data up to this point.

submitted by /u/bonesclarke84
[link] [comments]

0

UK News Media Dataset, Archive Or Similar.

Hi everyone! I’m new to this community. We’re currently working on a project proposal and we’re looking for a dataset of UK news media articles or access to an archive of such. It doesn’t have to be free.

Currently, I can only find archives of the media outlets themselves.

Basically, we want to create a corpus on a specific issue across different media outlets to track the debate.

Any help you can provide would be greatly appreciated. Thank you!

submitted by /u/Saltedcamelcookie
[link] [comments]

0

Non Scripted TV Show Transcripts Database

I am looking for a database that holds tv show transcripts of non scripted television. I was wondering if anyone could offer me an inclination as to where I can find some.

submitted by /u/Plus-Yam-3821
[link] [comments]

0

Platforms For Sharing Or Selling Very Large Datasets (like Kaggle, But Paid)?

I was wondering if there are platforms that allow you to share very large datasets (even terabytes of data), not just for free like on Kaggle but also with the possibility to sell them or monetize them (for example through revenue-sharing or by taking a percentage on sales). Are there marketplaces where researchers or companies can upload proprietary datasets (satellite imagery, geospatial data, domain-specific collections, etc.) and make them available on the cloud instead of through physical hard drives?

How does the business model usually work: do you pay for hosting, or does the platform take a cut of the sales?

Does it make sense to think about a market for very specific datasets (e.g. biodiversity, endangered species, anonymized medical data, etc.), or will big tech companies (Google, OpenAI, etc.) mostly keep relying on web scraping and free sources?

In other words: is there room for a “paid Kaggle” focused on large, domain-specific datasets, or is this already a saturated/nonexistent market?

submitted by /u/panspective
[link] [comments]

0

[PAID] Historical Dataset Of Over 100,000 Federal Reserve Series

Hey r/datasets, after a few weeks of working after hours, I put together a dataset that I’m quite proud of.

It contains over 100k unique series from the Federal Reserve (FRED) and it’s updated daily. There’s over 50 million observations last I checked and growing.

For those unaware, FRED contains all the economic data you can think of. Think inflation, prices, housing, growth, and other rates from city to country level. It’s foundational for great ML and data analytics across companies.

Data refreshes are orchestrated using Dagster nightly. I built in asset data quality checks to ensure each step is performing correctly along the way.

FRED Series Observations has a 30 day free trial. Please give it a try (and cancel before the time is up)! 🙂 And let me know how I can improve it!

Let me know if you like to learn more about how I built the job to bring in the data. I would be more than happy to a post about it!

TLDR: I created an economic dataset containing the complete history of every single series from the Federal Reserve. What should I build next?

submitted by /u/fruitstanddev
[link] [comments]

0

[PAID] Blinkist, Shortform, GetAbstract And Instaread Summaries Dataset

Data from blinkist, shortform, getAbstract and instaread websites both text + audio available.

Text is converted to epub + pdf & audio is in mp3 format.

Last update: September, 2025

Price: 25$ (which includes the future updates too)

submitted by /u/waqarHocain
[link] [comments]

0

[self-promotion] Free Company Datasets (millions Of Records, Revenue + Employees + Industry

I work at companydata.com, where we’ve provided company data to organizations like Uber, Booking, and Statista.

We’re now opening up free datasets for the community, covering millions of companies worldwide with details such as:

Revenue
Employee size
Industry classification

Our data is aggregated from trade registries worldwide, making it well-suited for analytics, machine learning projects, and market research.

GitHub: https://github.com/companydatacom/public-datasets
Website: https://companydata.com/free-business-datasets/

We’d love feedback from the r/data community — what type of business data would be most useful for your projects?

We gave the Creative Commons Zero v1.0 Universal license

submitted by /u/tok108
[link] [comments]

0

DeepFashion2: Comprehensive Fashion Dataset Suitable For Instance Segmentation, Object Recognition And Other Clothing Related Computer Vision.

QLike and subscribe, enjoy ☺️

submitted by /u/GO-Away_1234
[link] [comments]

0

[Offer] Free Custom Synthetic Dataset Generation – Seeking Feedback Partners For Open Source Tool

Hi r/datasets community!

I’m the creator of DeepFabric (https://github.com/lukehinds/deepfabric), an open-source tool that generates synthetic datasets using LLMs and novel approaches leveraging graphs (DAG) and Trees. I’m looking for collaborators who need custom datasets and are willing to provide feedback on quality and usefulness.

What DeepFabric does: DeepFabric creates diverse, domain-specific synthetic datasets using a unique graph/tree-based architecture. It generates data in OpenAI chat format with more formats coming, minimizes redundancy through structured topic generation.

What I’m offering: I’ll create custom synthetic datasets tailored to your specific domain or use case, cover all LLM API costs myself, provide technical support and customization, and generate datasets ranging from small proof-of-concepts to larger training sets.

What I’m looking for: I need detailed feedback on dataset quality, diversity, and usefulness, insights into how well the synthetic data performs for your specific use case, suggestions for improvements or missing features, and optionally a brief case study write-up of your experience.

Ideal collaborators: I’m particularly interested in working with researchers or developers working in a professional capacity, doing model distillation or evaluation benchmarks, or anyone needing training data for specialized or niche domains for machine learning / statistical analysis – a good example might be people working with limited real-world data availability. I have so far received really good feedback from a medical professor who needed data around mock scenarios of someone complaining about symptoms that could signal risk of heart attack.

Examples of what I can generate: Think Q&A pairs for specific technical domains, conversational data for chatbot training, domain-specific instruction-following datasets, or evaluation benchmarks for specialized tasks. I am also able to convert to whatever format you need.

If you’re interested, please comment or PM with your domain/use case, approximate dataset size needed, brief description of your intended use, and timeline if you have one.

I’ll prioritize collaborations that offer the most learning opportunities for both of us. Looking forward to working with some of you!

Some examples: medical Q&A: https://huggingface.co/datasets/lukehinds/medical_q_and_a

Programming Challenges: https://huggingface.co/datasets/lukehinds/programming-challenges-one

Repository: https://github.com/lukehinds/deepfabric
Documentation: https://lukehinds.github.io/DeepFabric/synethic data