Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

What Percentage Of Humans End Up Having Children In Their Lifetime?

I can’t find any articles talking about overall human populations. I’ve just had this question while researching about ancient human life, natural selection, genetics, stuff like that. Do most people reproduce? Is it more 50/50? Ik our population is increasing still, but people are also living longer. From a childfree perspective, it seems that like 80% of the population has kids, but I’m probably not very accurate there lol.

submitted by /u/Puzzled_Ice3998
[link] [comments]

High-Energy UI Vocal Expressions & Speech Tokens [SAMPLE PACK]

I just launched a specialized vocal pack built specifically for indie game devs, gamified UIs, fitness apps, and conversational AI tools. The links below are to the [10-word] sample pack, which is available for download now! The complete pack includes 100 single-word vocal tokens such as Success, Level, Win, Combo, Wow, and Boost.

Specs:

  • Studio-Grade Audio: This audio is completely dry and background-reverb-free.
  • Pro Calibration: Standardized to -23 LUFS with a strict -1.0 dB True Peak ceiling with zero clipping or distortion.
  • Pipeline Ready: It includes a fully aligned mapping file for immediate ingestion.

If you would like to test the vocal quality in your project, check out the evaluation samples here:

I will be releasing a few more of these micro vocal packs, including a bundle item! Let me know if you check it out or if you would like something for your personal task!

submitted by /u/MarieDeVox
[link] [comments]

[Self-Promotion] HealthBench Multilingual: OpenAI’s Benchmark Translated To 30+ Languages

Hi there,

I wanted to share a multilingual version of OpenAI’s HealthBench dataset. It’s currently available in 32 languages, spoken by 5+ billion people.

Languages:

Amharic, Arabic, Bengali, Brazilian Portuguese, Chinese, Dutch, Estonian, Finnish, French, German, Hausa, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malay, Norwegian, Persian, Polish, Russian, Somali, Spanish, Swahili, Swedish, Thai, Turkish, Ukrainian, Urdu, and Vietnamese.

Dataset link: https://huggingface.co/datasets/projetogabi/healthbench-multilingual

Cheers

submitted by /u/larxel
[link] [comments]

Need Help Finding Construction Data In US

Hey guys, I’m working on a project and trying to figure out what data sources I’m still missing.

Still looking for good sources for:
State and local contract awards (DOTs, municipalities, utilities, etc.)
Utility interconnection queues (ERCOT, PJM, MISO, CAISO, SPP)
Data center / semiconductor / battery plant / LNG project tracking
Construction wage data by metro
Trade workforce retirement/aging data

Any suggestions or ideas?

submitted by /u/NelsoelBesto
[link] [comments]

A Website With Sourced Data To Compare Housing And Essential Service Costs Across Cities

Disclosure: I’m the creator of the website.

I have always considered this type of data useful, but I was never satisfied with the available alternatives, mainly for three reasons: 1) lack of transparency regarding the source, or the use of crowdsourced data; 2) missing, incomplete or unclear methodology; and 3) comparisons between data that are not always truly comparable.

That is why I decided to create this website: citycostatlas.com

All data has its source indicated — most of it comes from public institutions — and the methodology used to obtain the data is explained clearly. I try to ensure that the data being compared is actually comparable; when it is not fully comparable, this is indicated — for example, when comparing the sale price per m² of a house in the City of Helsinki with the value for the Greater City Area of Madrid, because they do not represent the same geographical/statistical area.

In this first version, I chose to include the capital cities of the European Union and some key costs: sale price per m² of apartments and houses, monthly rents for different dwelling types, household gas consumption between 20 GJ and 199 GJ, household electricity consumption between 2,500 kWh and 4,999 kWh, and water based on an annual consumption of 120 m³. The gas and electricity bands were chosen because they are intermediate, standardised household consumption categories used for comparison between countries. For water, I used 120 m³/year as a practical benchmark to make tariffs with different structures more comparable.

Suggestions, additional information or any errors you notice are welcome. Please contact: [migralept@gmail.com](mailto:migralept@gmail.com)

submitted by /u/miguelsims12
[link] [comments]

Help Finding A Minimum Wage Dataset For A School Project In Stata

hi all,

i’m having trouble finding a dataset to download that has minimum wage data by US state, along with the federal minimum wage and real vs nominal numbers. I found one that goes up to 2020, but i’m looking to go to 2024. i’ve been looking around on github and google but can’t find anything yet, and i don’t know how to scrape the table off the DOL website. can anyone please help me out? thanks

submitted by /u/IndominusTaco
[link] [comments]

I Scraped Over 2 Million Job Postings Across 100,000+ Company Career Sites Into A Unified, Daily-updated Dataset.

Over the past few months, I’ve been working on a high-scale scraping pipeline to aggregate listings directly from company job boards and applicant tracking systems. Mapping over 100,000 distinct companies to their career pages turned out to be a massive engineering headache, but it’s finally stable.

The result is a unified database of more than 2 million active job postings, which I’m opening up to everyone for free. I am running daily delta refreshes to keep it current.

Dataset Overview

  • Scale: 2M+ active job listings across 100,000+ unique companies.
  • Format: Parquet. (To keep storage costs to minimum)
  • Core Fields: job_title, company_name, company_website, job_description, location, post_date, and the original tracking URL. For more detailed info check here.
  • Update Cadence: Refreshed daily straight from the source.

Why I Built This

Finding a clean, scaled, and up-to-date job dataset is surprisingly difficult. Most available options are either heavily gatekept by expensive subscription APIs or restricted to a single job board like LinkedIn. By scraping the actual employer sites directly, this collection sidesteps the noise and captures a much cleaner cross-section of the live market.

How to Access It

I set up a dedicated project space where you can grab the data directly: Open Job data

Let me know what kind of analysis or projects you end up running with it. If you have questions about the engineering architecture behind handling this scale, or ideas for specific fields you’d like to see enriched next, let’s discuss in the comments.

submitted by /u/Invicto_50
[link] [comments]

Business Profile Data API — Looking For Feedback On Fields, Samples, And Data Quality

[self-promotion] Business profile data API — looking for feedback on fields, samples, and data quality

Hi r/datasets,

Disclosure first: this is my own project.

I’m building FastBusiness API, a business/company profile data API.

The basic idea is:

Input:

  • business name
  • optional website
  • optional country

Output:

  • business name
  • website
  • business type
  • country
  • industry
  • sector
  • headquarters
  • short description
  • ABN/ACN where available
  • stock ticker / exchange where available
  • confidence score
  • source links

I built it because I kept needing structured company data for different projects, but the data was usually scattered across websites, public registers, directories, search results, and company pages.

The use cases I’m thinking about are:

  • CRM enrichment
  • lead-gen datasets
  • business directories
  • BI dashboards
  • ETL/testing datasets
  • market mapping
  • company research workflows

I’m mainly looking for feedback from people who use datasets/APIs regularly:

  1. Are these fields useful, or is anything obvious missing?
  2. Would CSV/JSON sample downloads be more useful than only API access?
  3. Would source links per field matter, or is one source list per company enough?
  4. Is an overall confidence score enough, or would field-level confidence be better?
  5. Would update/refresh timestamps matter for this kind of dataset?
  6. Would people here care more about bulk exports or real-time lookup?
  7. What sample size would be useful before trying something like this?
  8. Any concerns around using company profile data like this in downstream projects?

I’m happy to add a free sample dataset if that would be more useful for this subreddit.

Link: https://fastbusinessapi.com

submitted by /u/Nacez
[link] [comments]

Clinical AI Voice Dataset For Medical Terminology Benchmark (Free Preview)

Finding clean, high-fidelity speech data for niche clinical vocabulary is a serious pain point if you’re training transcription pipelines or benchmarking clinical ambient dictation systems. Most open speech datasets lack complex pharmaceutical dosing, specific anatomical paths, or continuous surgical transcription flows.

To help developers who are benchmarking speech-to-text (STT/ASR) or clinical text-to-speech (TTS) models, I’ve released a pristine, studio-isolated preview pack explicitly targeting complex medical terminology.

Dataset Specs:

  • Audio Resolution: 24-bit Signed Linear PCM Mono WAV
  • Acoustic Profile: True studio floor (no room echo/reflections), transparent noise gating, speech-optimized EQ.
  • Target Loudness: Calibrated to -23 LUFS (with an absolute peak ceiling capped at -1.0 dB).
  • Transcription Format: Dual-format out of the box. Includes standard pipe-separated `metadata.csv` (LJ Speech layout compliance) and a developer-grade `metadata.json` sidecar pipeline parser.

The Free Preview Includes:

  1. `MED0003` — Complex Pathology Phonetics (*Oligodendroglioma*)

  2. `MED0012` — Pharmacological Dosing/Normalization Test (*Metoprolol succinate intravenous infusion*)

  3. `MED0028` — Continuous Surgical Flow Transcription

  4. `MED0032` — Clinical Dictation with Spoken Punctuation Integration (*Assessment and Plan Number one comma…*)

Data & Compliance:

  • 100% Opt-In Human Data: Completely human-voiced, verified data provenance. Zero scraping, zero synthetic generation fallbacks.
  • HIPAA / GDPR Safe: Scripts are strictly synthetic clinical scenarios containing completely fictional patient records with zero protected health information (PHI).

How to Access the Files Instantly:

Visit the following sites to access and download the sample pack:

Hugging Face: https://huggingface.co/datasets/MarieDeVox/clinical-voice-medical-terminology-mini

GitHub Repository: https://github.com/MarieDeVox/clinical-voice-medical-terminology-mini

Note: The data structures are built to be entirely plug-and-play with modern speech inference environments (Whisper fine-tuning, XTTS, etc.).

Please feel free to clone the preview pack and stress-test your pipelines. If you are tracking any specific word-error-rate (WER) improvements or pipeline constraints with these phonetically dense tracks, let me know! Thanks!

submitted by /u/MarieDeVox
[link] [comments]

What’s Your Playbook For Replacing A Legacy Access Pipeline With Python?

**What’s the best approach to migrate a legacy Access pipeline to Python when there’s no documentation?**

I’ve got a monthly MS Access data pipeline that processes ~375k rows across 26 European markets. It’s been built up over years with nested queries, correction tables, and lookup logic that nobody fully understands.

It works, but it’s fragile, slow, and entirely dependent on one process. I want to rebuild it in Python but I’m not sure where to start given the complexity.

The main challenges:
– Dozens of lookup tables that map raw data to business classifications (price bands, category codes, sub-categories)
– No primary keys, no version history, cryptic column names
– Queries that reference intermediate tables that reference other queries
– Years of manual corrections baked into the data with no record of what was changed or why

Has anyone successfully migrated something like this? What approach did you take? Particularly interested in how you handled extracting and validating the hidden business logic.

Happy to give more detail if it helps.

submitted by /u/SuperAMario
[link] [comments]

Looking For Program-level Normal Time / Credit Hours For Certificate Programs

Working on my dissertation using restricted BPS 12/17 data focused on certificate program completion. I need program-level normal time to completion (ideally in credit or clock hours) linkable by award level + CIP code + UNITID.

What I’ve already ruled out:

– IPEDS uses “normal time” for grad rate calculations but doesn’t publish it as a variable

– College Scorecard has program-level labor market outcomes but no program length

– The Gainful Employment disclosure files (from FSA) seemed relevant for this but again don’t include program length information

– NSLDS has what I need but requires restricted access I don’t have

I know this data must exist out there and it doesn’t seem like super sensitive information so I’m frustrated by the fact that I can’t find it.

Has anyone found a public dataset that includes this? Or a workaround that doesn’t involve manually scraping program pages?

submitted by /u/Sukky99
[link] [comments]

Built A Dataset Of 242 Credit Card Offers.

Hey everyone,

I got fed up with affiliate/referral sites when looking for credit card offers and decided to build my own dataset of credit card offers. I initially built it for myself but decided to release it so others can use it as well.

I hope folks on here will find this useful. I refreshed the dataset on 5/30 and if folks here like this kind of data then I’ll try to setup a weekly job to automatically refresh the data.

For full transparency, this does not include any affiliate or referral links.

submitted by /u/_fat_santa
[link] [comments]

Construction Updated Datasets Requested For The US

Hello, I’m looking for large US data sets related to construction/infrastructure within the US. Ideally data less than a year old but anything up to 5 years would be helpful as well.

Some examples include: public award data at the state and local level, utility capital plans, state economic development plans (especially in California, Texas, and Ohio), actual wage data. Willing to pay for data that is highly relevant and updated

* Not looking for photos of construction builds.

submitted by /u/NelsoelBesto
[link] [comments]

Free-tier Launch Of An Original, Studio-recorded Human Voice Dataset For SaaS & Call Bot NLU Training (LJ Speech + JSON Schemas)

I wanted to share an original speech/audio dataset I’ve been compiling. I operate a technical voice data pipeline and decided to build a studio-mastered dataset explicitly tailored for conversational, automated customer service and phone line (IVR) architectures.

If you search for open-source conversational speech data, almost everything out there is either heavily compressed web-scraped data with inconsistent noise floors, or read-speech audio books that lack natural, conversational cadence.

The Content:

– Highly structured, realistic transactional human conversational lines tailored for B2B SaaS, ticketing, routing, and telephony pipelines.

– Completely mapped to the standard LJ Speech layout (⁠filename|transcription|normalized_transcription⁠) for drag-and-drop ingestion into standard model pipelines.

– Every single premium audio file is paired with an independent JSON sidecar detailing precise syntax tagging, phonetic structures, and specific semantic intent mappings.

Acoustic Specs:

– Engineered in an acoustic studio at 24-bit/48kHz PCM WAV. The audio files have been passed through a targeted high-pass filter curve to strip low-end room artifacts and is normalized for uniform gain.

Sourcing & Compliance:

This is 100% human-generated, original acoustic data. Because I am the data creator, it is fully cleared, compliant, and legally indemnified. There is zero scraped web content or automated text-to-speech generation inside this pack.

The baseline sample block of the dataset is completely open and free to download. It includes a Full Commercial Use License, meaning you can integrate it into live client demos, public applications, or commercial pipelines right away without the need for a credit card.

Hugging Face Repository (Free Download):https://huggingface.co/datasets/MarieDeVox/saas-corporate-conversational-voice-sample

GitHub (Free Download): https://github.com/MarieDeVox/saas-corporate-voice-dataset-sample

DISCLAIMER: I am the creator and independent owner of this dataset. While the sample block linked above is completely free with a full commercial license to keep forever, I do host full enterprise production expansions.

If you download the repository and play around with the mapping this weekend, let me know if you run into any parsing issues or formatting bottlenecks!

submitted by /u/MarieDeVox
[link] [comments]

I Built An Open-source Dataset Of Every Major US Layoff

The federal WARN Act requires employers with 100+ workers to give 60 days notice before mass layoffs or plant closings (thresholds vary by state, but roughly 50+ jobs lost). That data is scattered across 50 state websites, each with its own format, broken links, and no API.

I think it should be easy-to-access public data, so I built a fully open-source aggregator for it.

Live app: https://layoffs.kadoa.com/

Repo: https://github.com/kadoa-org/layoffs-tracker

submitted by /u/madredditscientist
[link] [comments]

Disaster History And Live Feeds Upgrades

I’ve been working more on unifying all my datasets, adding live collectors.

So far earthquakes, tsunamis, and volcanos are the strongest, hurricanes are pretty solid but wildfires are taking some more work since they’re more crossed sourced and each country has their own agencies that give the best data.

I’ve been working more on the self hosted lane as well, you can download from GitHub I’m trying to make a better executable that makes it easier to set up and build a bit of a pack installer store (store is a relative word, all the packs are free to download for self hosting)

https://www.daedalmap.com/feeds

submitted by /u/Xyver
[link] [comments]

I Built An Open, Version-controlled Emission Factor Dataset Aligned To IPCC AR6 GWP-100 — Free To Use And Cite

I was building GreenCalculus (carbon accounting/calculator platform — disclosure: it’s my project) and kept running into the same problem:

There’s no single clean, open, version-controlled emission factor dataset aligned to IPCC AR6 GWP-100.

The data exists, but it’s scattered across:

  • DEFRA
  • EPA
  • IEA
  • IPCC PDFs

…with different units, different GWP vintages, and almost no visibility into what changed between versions.

So I consolidated it into one open repo:

https://github.com/greencalculus/greencalculus-methodology

Everything is free, public, and downloadable. No signup, no API key.

What’s inside:

  • gwp-values.json AR6 + AR5 values side-by-side for 16 greenhouse gases.
  • emission-factors.json + .csv Scope 1 fuel combustion + Scope 2 electricity grid factors across 15 countries.
  • METHODOLOGY.md Full calculation methodology with formulas + source references.
  • CITATION.cff Makes it easy to cite in BibTeX / APA.

One thing I think carbon accounting software gets wrong:

Emission factors should behave like versioned code dependencies.

If a methane GWP changes, you should be able to diff it, trace it, and reproduce historical outputs exactly.

Git is honestly a better audit trail than most ESG software I’ve seen.

Interesting migration issue I noticed while compiling this:

A lot of inventories still use older methane GWPs.

  • AR4 CH4 = 25
  • AR5 CH4 = 28
  • AR6 fossil CH4 = 29.8

So moving from AR4 → AR6 increases fossil methane impact by ~19% using the exact same activity data.

Even AR5 → AR6 is still about +6%.

PRs/corrections are genuinely welcome.

And if you just want to calculate emissions instead of building your own model:

https://greencalculus.com/calculators/

Happy to answer methodology questions or discuss factor provenance/versioning.

submitted by /u/greencalculus
[link] [comments]

Do You Consider Synthetic Datasets Useful For Real-world Data Work?

I’ve been thinking about the role of synthetic datasets in data projects, especially now that LLMs and generative models make data generation much easier.

On one hand, synthetic data can help with privacy, class imbalance, rare cases, benchmarking, and testing pipelines when real data is limited or sensitive.

On the other hand, I’m not sure how people evaluate whether a synthetic dataset is actually useful rather than just plausible-looking. Distribution shift, hidden bias, leakage from source data, and weak evaluation seem like real risks.

For people who have used synthetic datasets in practice: when did they work well, and when did they fail?

Also, what checks or metrics do you use before trusting a synthetic dataset for training, evaluation, or analysis?

Thanks in advance for any thoughts. This is especially important for me because one of the core directions I’m working on in OpenDCAI/DataFlow is large-scale synthetic data generation, and a recurring challenge is figuring out whether the synthetic data is actually useful.

submitted by /u/Puzzleheaded_Box2842
[link] [comments]

Extracting Current Company Executives From SEC Filings Is A Trap If You Start With The “obvious” Source. Here Is What Actually Works.

I recently rebuilt how I pull the executive roster (CEO plus named officers, with titles) for US public companies straight from SEC data. Coverage went from about 6,400 officer rows across 3,242 companies to roughly 62,500 rows across 24,358 companies, so close to 10x the rows and 7.5x the company coverage. Here is the journey, because the naive approach fails in interesting ways.

Attempt 1: the proxy statement (DEF 14A)

The intuitive source is the annual proxy. Since fiscal 2022 the SEC standardized the “Pay versus Performance” disclosure as inline XBRL, and there is a tag literally called ecd:PeoName (Principal Executive Officer Name). Perfect, right?

Not really. A lot of large filers tag the compensation numbers but never tag ecd:PeoName. Microsoft and Alphabet both returned exactly 0 officers for me this way. The names, when present at all, hide in a footnote text block (ecd:NamedExecutiveOfficersFnTextBlock) that:

  • usually names only the non-CEO officers, not the CEO,
  • is sometimes first-names-only (“Ruth, Philipp, and Kent”),
  • and for some filers is just an HTML table grid with no names at all.

The 10-K does not tag executive names either. Net coverage from the proxy route was only about 3,242 companies, and the CEO name was frequently the thing missing.

Attempt 2: Section 16 insider filings (Forms 3, 4, 5)

Every officer, director, and 10% owner of a US issuer files these, and they are structured XML with a reporting owner block: name, the owner’s own CIK, isOfficer / isDirector / isTenPercentOwner flags, and an officer title.

This is dramatically better, and one field is the hero: the owner’s CIK. It is a stable per-person identifier and it showed up on 100% of the officer rows in my data (1,753,055 of 1,753,055). Dedup by CIK and you collapse every name-spelling variant automatically, including surname changes that name matching can never catch. Real example: the same person filed as “Tabak Emily N” and later “Epstein Emily T”. Same CIK, one person. No fuzzy string matching survives that.

For dates, you join each filing to its filing date. Form 3, the initial statement, has no transaction date, so the filing date is your only signal for “first seen”. Last-seen doubles as a soft departure signal, since people stop filing once they leave.

The full rebuild produced 62,561 officers across 24,358 issuers and runs in about 1.4 seconds against the local DB. As a sanity check, of the 3,372 companies that had proxy compensation data, only 47 ended up with zero officers under the insider approach, and those were mostly tiny or unusual structures where officers genuinely do not file Form 4.

The challenges nobody warns you about

  1. Names are stored “Last First Middle”, often ALL CAPS, sometimes with a leading initial. “Keith R. Alexandra” is really Alexandra Keith. You have to skip leading initials when picking the display first name, without mangling a genuine two-letter name like “Bo”.
  2. The title field is free text and lies about who the CEO is. About 10,000 people across the market carry a CEO-ish title, and they are not all “the CEO”.
    • Some CEOs file as “Chairman” with no “CEO” in the title at all (Coca-Cola’s James Quincey).
    • Title lag happens. A newly promoted CEO can keep filing under the old title (COO) for months.
    • Divisional and subsidiary CEOs flood the data. JPMorgan has six people with “CEO” in their title (Co-CEO CIB, CEO CCB, CEO Asset and Wealth Management, and so on). Amazon has five, including a CEO of AWS and a CEO of Worldwide Stores. None of those is the principal CEO.
    • Genuine co-CEOs exist (Netflix has two), so you cannot just take one.
  3. Telling the principal CEO apart from a divisional one. What worked: survey the actual distribution of titles (the long tail is real, “Chief Executive Officer” covers about 2,700 people, “President and CEO” about 1,200, plain “CEO” about 700, then hundreds of unit-specific variants), then apply a “connector subtraction” rule. Strip the CEO phrase plus a known set of role and connector words (Chairman, President, Director, Founder, Interim, Co, CFO, and friends). If a business unit word is left over (“CCB”, “Amazon Web Services”, “Beauty”), it is divisional. If nothing is left, it is the principal CEO. I surface divisional CEOs as their own category rather than hiding them, since “who runs AWS” is useful.
  4. Foreign private issuers (think ASML, SAP, Shopify) are exempt from Section 16. They file no Forms 3, 4, or 5, so this source gives you nothing for them. Worth knowing before you promise global coverage.
  5. Last-seen is noisy. A CEO who trades rarely (Satya Nadella can go about 6 months between filings) looks stale even though he is very much active. So “current officer” has to be a recency window (I use 18 months), not an exact cutoff.

Takeaway: for US executive rosters, skip the proxy XBRL and build on Section 16 insider filings keyed by the reporting person’s CIK. Treat the title as a hint, not gospel, and handle divisional CEOs explicitly.

Bonus context: I work on a US stock market data API called StockFit API and went through all of this while rebuilding the executives endpoint. Happy to go deeper on any of it: SEC XBRL, Section 16 parsing, dedup strategy, whatever. Ask me anything.

submitted by /u/Either_Door_5500
[link] [comments]

Best Datasets For Neonatology? Preferably Low Barrier To Entry

MIMIC-III has NICU data and is relatively easy to get access to (i.e. just fill out a form on PhysioNet, take an ethics course). Most others, e.g. Pediatrix, seem to only give you data if you have a formal partnership with them via your university.

I recently graduated med school in the EU and my supervisor is the chair of the neonatology department if that makes any difference as far as obtaining data goes. I don’t mind filling out some forms, but I don’t think trying to have my university make official partnerships is likely to happen (although again, I could name drop my supervisor and have this person sign some forms if necessary). Similarly, I doubt my university would pay thousands of dollars to have access to this type of data.

In general, I want datasets with as many clinical variables as possible (e.g. lab values, outcomes, meds, etc.).

Any suggestions would be greatly appreciated.

submitted by /u/Far_Excitement_4430
[link] [comments]

Need Large-Scale Indian Audio/Music Dataset (100k+ Hours) For AI/ML Training

Hi Everyone,

I’m looking for large-scale Indian audio/music datasets (100,000+ hours preferred) mainly containing:
– Indian songs/music
– Vocals
– Bollywood music
– Regional language audio
– Speech + music mixed data
– Instrumental/music tracks

Purpose is AI/ML training and audio research.

I’m okay with both:
– Commercial datasets
– Non-commercial/free datasets

Would appreciate suggestions for:
– Indian music datasets
– Open-source audio datasets
– Hugging Face/Kaggle datasets
– Large audio archives
– APIs/platforms with Indian audio
– Any legal bulk audio source

If anyone has worked on similar projects or knows good sources, please share links/suggestions.

Thanks!

submitted by /u/No_Wafer_2023
[link] [comments]