Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Sources For European Energy / Weather Data?

Around 2018, towards the end of my PhD in math, I got hired by my university to work on a European project, Horizon 2020, which had the goal of predicting energy consumption and price.

I would like to publish under public domain some updated predictions using the models we built, the problem is that I can’t reuse the original data to validate the models, because it was commercially sourced. My questions is: where can I find reliable historical data on weather, energy consumption and production in the European union?

submitted by /u/servermeta_net
[link] [comments]

Indian Language Speech Datasets Available (explicit Consent From Contributors)

Hi all,

I’m part of a team collecting speech datasets in several Indian languages. All recordings are collected directly from contributors who provide explicit consent for their audio to be used and licensed.

The datasets can be offered with either exclusive or non-exclusive rights depending on the requirement.

If you’re working on speech recognition, text-to-speech, voice AI, or other audio-related ML projects and are looking for Indian language data, feel free to get in touch. Happy to share more information about availability and languages covered.

— Divyam Bhatia
Founder, DataCatalyst

submitted by /u/Trick-Praline6688
[link] [comments]

[Self Promotion] Feature Extracted Human And Synthetic Voice Datasets – Free Research Use, Legally Clean, No Audio.

tl;dr Feature extracted human and synthetic speech data sets free for research and non commercial use.

Hello,

I am building a pair of datasets, first the Human Speech Atlas has prosody and voice telemetry extracted from Mozilla Data Collective datasets, currently 90+ languages and 500k samples of normalized data. All PII scrubbed. Current plans to expand to 200+ languages.

Second the Synthetic Speech Atlas has synthetic voice feature extraction demonstrating a wide variety of vocoders, codecs, deep fake attack types etc. Passed 1 million samples a little while ago, should top 2 million by completion.

Data dictionary and methods up on Hugging Face.

https://huggingface.co/moonscape-software

First real foray into dataset construction so Id love some feedback.

submitted by /u/Wooden_Leek_7258
[link] [comments]

[self-promotion] Big Tech Quarterly CapEx-to-Revenue Ratio (2015–2026) — 210 Observations From SEC EDGAR, CSV + Python Script

Dataset covering quarterly capital expenditures as a percentage of revenue for AAPL, MSFT, GOOGL, AMZN, META, NVDA.

Extracted from SEC EDGAR XBRL API (10-K/10-Q cash flow statements). The tricky part was decomposing YTD cumulative figures into individual quarters.

CSV + Python extraction script: https://github.com/eco3min/sec-capex-tracker

Interactive chart + full methodology: https://eco3min.fr/en/big-tech-capex-revenue-ratio-quarterly-dataset/

Columns: ticker, company, calendar_year, calendar_quarter, quarter_end_date, revenue_usd_millions, capex_usd_millions, capex_to_revenue_pct

License: MIT (code) / CC BY 4.0 (data)

submitted by /u/Low_Ability4450
[link] [comments]

Irish Oireachtas Voting Records — 754k Rows, Every Dáil And Seanad Division [FREE]

Built this because there was no clean bulk download of Irish parliamentary votes anywhere. Pulled from the Oireachtas Open Data API and flattened into one row per member per vote — 754,000+ records going back to 2002.

Columns: date, house, TD/Senator name, party, constituency, subject, outcome, vote (Tá/Níl/Staon)

Free static version on Kaggle: https://www.kaggle.com/datasets/fionnhughes/irish-oireachtas-records-all-td-and-senator-votes

submitted by /u/Cool_Law_8915
[link] [comments]

[self-promotion] 4GB Open Dataset: Congressional Stock Trades, Lobbying Records, Government Contracts, PAC Donations, And Enforcement Actions (40+ Government APIs, AGPL-3.0)

Built a civic transparency platform that aggregates data from 40+ government APIs into a single SQLite database. The dataset covers 2020-present and includes:

  • 4,600+ congressional stock trades (STOCK Act disclosures + House Clerk PDFs)
  • 26,000+ lobbying records across 8 sectors (Senate LDA API)
  • 230,000+ government contracts (USASpending.gov)
  • 14,600+ PAC donations (FEC)
  • 29,000+ enforcement actions (Federal Register)
  • 222,000+ individual congressional vote records
  • 7,300+ state legislators (all 50 states via OpenStates)
  • 4,200+ patents, 60,000+ clinical trials, SEC filings

All sourced from: Congress.gov, Senate LDA, USASpending, FEC, SEC EDGAR, Federal Register, OpenFDA, EPA GHGRP, NHTSA, ClinicalTrials.gov, House Clerk disclosures, and more.

Stack: FastAPI backend, React frontend, SQLite. Code is AGPL-3.0 on GitHub.

submitted by /u/Prestigious-Wrap2341
[link] [comments]

[DATASET][PAID] 1 Million Labeled Hinglish Dataset — Available For Licensing

[DATASET][PAID] 1 Million Labeled Hinglish Dataset — Available for Licensing

Hey everyone, I’ve spent months building a large-scale Hinglish dataset and I’m making it available for licensing.

What’s in it: – 1,000,000 real Hinglish samples from social media – 6 labels per entry: intent, emotion, toxicity, sarcasm, language tag – Natural conversational Hinglish (not translated — actual how people type)

Why it matters: Hinglish is how 300M+ Indians actually communicate online. Most existing datasets are either pure Hindi or pure English. This fills a real gap for anyone building India-focused NLP models, chatbots, or content moderation systems.

Sample labels include: – Intent: Appreciation / Request / Question / Neutral – Emotion: Happy / Sad / Angry / Surprised / Neutral – Toxicity: Low / Medium / High – Sarcasm: Yes / No

Licensing: – Non-exclusive: $20,000 (multiple buyers allowed) – 5,000 sample teaser available for evaluation before purchase

Who this is for: – AI startups building for Indian markets – Researchers working on code-switching or multilingual NLP – Companies building content moderation for Indian platforms

Check the teaser here: https://github.com/theYugrathee/1-million-hinglish-dataset-sample-of-5k-/blob/main/hinglish_dataset_teaser.json

Drop a comment or DM if interested!

Disclosure: I am the creator and seller of this dataset.

submitted by /u/UniqueProfessional81
[link] [comments]

Scaling A RAG-based AI For Student Wellness: How To Ethically Scrape & Curate 500+ Academic Papers For A “White Box” Social Science Project?

Hi everyone!

I’m part of an interdisciplinary team (Sociology + Engineering) at Universidad Alberto Hurtado (Chile). We are developing Tuküyen, a non-profit app designed to foster self-regulation and resilience in university students.

Our project is backed by the Science, Technology, and Society (STS) Research Center. We are moving away from “Black Box” commercial AIs because we want to fight Surveillance Capitalism and the “Somatic Gap” (the physiological deregulation caused by addictive UI/UX).

The Goal: Build a Retrieval-Augmented Generation (RAG) system using a corpus of ~500 high-quality academic papers in Sociology and Psychology (specifically focusing on somatic regulation, identity transition, and critical tech studies).

The Technical Challenge: We need to move from a manually curated set of 50 papers to an automated pipeline of 500+. We’re aiming for a “White Box AI” where every response is traceable to a specific paragraph of a peer-reviewed paper.

I’m looking for feedback on:

  1. Sourcing & Scraping: What’s the most efficient way to programmatically access SciELO, Latindex, and Scopus without hitting paywalls or violating terms? Any specific libraries (Python) you’d recommend for academic PDF harvesting?
  2. PDF-to-Text “Cleaning”: Many older Sociology papers are messy scans. Beyond standard OCR, how do you handle the removal of “noise” (headers, footers, 10-page bibliographies) so they don’t pollute the embeddings?
  3. Semantic Chunking for Social Science: Academic prose is dense. Does anyone have experience with Recursive Character Text Splitting vs. Semantic Chunking for complex theoretical texts? How do you keep the “sociological context” alive in a 500-character chunk?
  4. Vector DB & Costs: We’re on a student/research budget (~$3,500 USD total for the project). We need low latency for real-time “Somatic Interventions.” Pinecone? Milvus? Or just stick to FAISS/ChromaDB locally?
  5. Ethical Data Handling: Since we deal with student well-being data (GAD-7/PHQ-9 scores), we’re implementing Local Differential Privacy. Any advice on keeping the RAG pipeline secure so the LLM doesn’t “leak” user context into the global prompt?

Background/Theory: We are heavily influenced by Shoshana Zuboff (Surveillance Capitalism) and Jonathan Haidt (The Anxious Generation). We believe AI should be a tool for autonomy, not a new form of “zombification” or behavioral surplus extraction.

Any advice, repo recommendations, or “don’t do this” stories would be gold! Thanks from the South of the world! 🇨🇱

submitted by /u/Spare-Customer-506
[link] [comments]

Building A Dataset Estimating The Real-time Cost Of Global Conflicts — Looking For Feedback On Structure/methodology

I’ve been working on a small project to estimate and standardize the cost of ongoing global conflicts into a usable dataset.

The goal is to take disparate public sources (SIPRI, World Bank, government data, etc.) and normalize them into something consistent, then convert into time-based metrics (per day / hour / minute).

Current structure (simplified):

– conflict / region

– estimated annual cost

– derived daily / hourly / per-minute rates

– last updated timestamp

– source references

A couple of challenges I’m running into:

– separating baseline military spending vs conflict-attributable cost

– inconsistent data quality across regions

– how to represent uncertainty without making the dataset unusable

I’ve put a simple front-end on top of it here:

https://conflictcost.org

Would really appreciate input on:

– how you’d structure this dataset differently

– whether there are better source datasets I should be using

– how you’d handle uncertainty / confidence levels in something like this

Happy to share more detail if helpful.

submitted by /u/eisseseisses
[link] [comments]

1M+ Explainable Linguistic Typos (Traceable JSONL, C-Based Engine)

I’ve managed to make a “Mutation Engine” that can generate (currently) 17 linguistically-inspired errors (metathesis, transposition, fortition, etc.) with a full audit trail.

The Stats:

  • Scale: 1M rows made in ~15 seconds (done in the C programming language, hits .75 microseconds per operation).
  • Traceability: Every typo includes the logical reasoning and step-by-step logs.
  • Format: JSONL.

Currently, it’s English-only and has a known minor quirk with the duplication operator (occasionally hits a u0000).

Link here.

I’m curious if this is useful for anyone’s training pipelines or something similar, and I can make custom sets if needed.

submitted by /u/Nitro224
[link] [comments]

Failure Data > Success Data………

I’ve been thinking about this a lot recently:

Most teams training LLMs for workflows focus heavily on successful traces — clean executions, ideal outputs, perfect tool calls.

But in real systems, that’s not where the useful signal is.

The interesting part is actually:

  • where the model breaks
  • where it calls the wrong tool
  • where it loops or stalls in multi-step flows

That’s where you start seeing patterns.

It almost feels like we’re missing a layer of training data that explicitly captures:
→ failure states
→ retries
→ decision mistakes

Instead of just “what to do,” we need “what not to do.”

Curious if others here are logging and structuring failure traces systematically, or just patching issues ad hoc?

(We’ve been experimenting with datasets around this at dinodsai.com — still early, but the shift in behavior is noticeable)

submitted by /u/JayPatel24_
[link] [comments]

[Dataset] Live Geopolitical Escalation Event Feed – AI-scored, Structured JSON, Updated Every 2h (free Public API)

I built and run a geopolitical signal aggregator that ingests RSS from BBC, Reuters, Al Jazeera, and Sky News every 2 hours, runs each conflict-relevant article through an AI classifier (Gemini 2.5 Flash), and stores the output as structured events. I'm sharing the free public API here in case it's useful for research or ML projects. **Disclosure:** I'm the builder. There's a paid plan on the site for higher-rate access, but the endpoints below are fully open with no auth required. --- **Schema — single event object:** ```json { "zone": "iran_me", "event_type": "military_action", "direction": "escalatory", "weight": 1.5, "summary": "US strikes bridge in Karaj, Iran vows retaliation.", "why_matters": "Direct US military action against Iran escalates regional conflict.", "watch_next": "Iran's retaliatory actions; US response.", "source": "Al Jazeera", "lat": 35.82, "lng": 50.97, "ts": 1775188873600 } ``` **Fields:** - `zone` — conflict region: `iran_me`, `ukraine_ru`, `taiwan`, `korea`, `africa`, `other` - `event_type` — `military_action`, `rhetorical`, `diplomatic`, `chokepoint`, `mobilisation`, `other` - `direction` — `escalatory`, `deescalatory`, `neutral` - `weight` — fixed scale from −2.0 to +3.0 (anchored to reference events: confirmed airstrike = +1.0, major peace deal = −2.0, direct superpower strike on sovereign territory = +2.0) - `summary`, `why_matters`, `watch_next` — natural language fields from the classifier - `lat`, `lng` — approximate geolocation of the event - `ts` — Unix timestamp in milliseconds **Free endpoints (no auth, no key):** 

GET https://ww3chance.com/api/events?limit=500 — 72h event feed GET https://ww3chance.com/api/zones — zone score breakdown GET https://ww3chance.com/api/history?days=7 — 7-day composite score time series GET https://ww3chance.com/api/score — current index snapshot

**Current snapshot (as of today):** - 53 events in the last 72 hours - Zones active: Iran/ME (zone score 13.29), Other (0.47), Ukraine/Russia (0.12) - Event type breakdown in this window: military actions, chokepoint signals, diplomatic moves, rhetorical escalation - 7-day index range: 13.5% → 15.2% **Potential uses:** - Training conflict/event classification models - NLP benchmarking on structured real-world news events - Time-series correlation analysis (e.g. against VIX, oil futures, shipping indices) - Geopolitical sentiment analysis - Testing event-detection pipelines against live data Full methodology (weight calibration, decay formula, source credibility rules, comparison to the Caldara-Iacoviello GPR index) is documented at ww3chance.com/methodology Happy to answer questions about the classification approach, known limitations, or the data structure. 

submitted by /u/Ok_Veterinarian446
[link] [comments]

How To Download The How2sign Dataset To My Google Drive?

My team and I are planning to do a project based on ASL. We would like to use the ‘How2sign’ dataset. Mainly the ‘RGB front videos’, ‘RGB front clips’ and the english translation.

We have planned to do the project via Google Colab. I wanted to download the necessary data in my Google Drive folder and make it a shared folder so that everyone can access the dataset but I’m unable to do so.

I’m tried clone the repo and run the download script given but it just doesn’t seem to work. Is there a better method that I’m missing or how do I make this work??

submitted by /u/Tanrat23
[link] [comments]

Is There Any Good RP Datasets In English Or Ukrainian ?

Title.

I’m currently training my small LLM (~192.8M RWKV v6 model) for edge-RP (Role Playing on phones, tablets, bad laptops etc, I already made full inference in Java (UI)+C and C++ (via JNI, C/C++, made both for CPU and GPU) for Android) and I wanna get new really good datasets (even if they’re small). I don’t really care if they’re synthetic, human-made, mixed or human with AI, cuz I only care if it’s good enough. Better, if its’ available via datasets python lib (if dataset available on huggigface.co).

Thanks !

EDIT: Please, mark if it’s in English, in Ukrainian (there’s almost no RP datasets in Ukrainian) or multi-languaged

submitted by /u/Lines25
[link] [comments]

Are There Efforts To Create Gold/silver Subsets For Open ML Datasets?

We experimented with MNIST and BDD100K and noticed two recurring issues: about 2–4% of samples were noisy or confusing, and there was significant redundancy in the datasets.

We achieved ~87% accuracy on MNIST with only 10 samples (1 per class), and on BDD, we matched baseline performance with less than ~40% of the dataset after removing obvious redundancies and very low-quality samples.

This made us wonder why we don’t see more “dataset goldifying” approaches, where datasets are split into something like:

  • Gold subset (very clean, ~1%)
  • Silver subset (medium, ~5%)
  • Full dataset

Are there any canonical methods or open-source efforts for creating curated gold/silver subsets of datasets?

submitted by /u/taranpula39
[link] [comments]

Good Snowflake Discussion Groups Links

Hey folks,

I’ve been working with Snowflake for a while now (mostly data engineering stuff), and recently started digging into things like Cortex, governance, and some advanced use cases.

Was looking for active communities links like discord, telegram, WhatsApp group chat out there where people actually discuss Snowflake, share stuff, help each other out, etc.

Basically anything where there’s real discussion happening

If you know any good ones, please drop the links or names. Even smaller or lesser-known communities are totally fine.

Appreciate the help!

submitted by /u/Key_Card7466
[link] [comments]

Data Professionals — How Much Of Your Week Honestly Goes Into Just Cleaning Messy Data?

Hello fellow data enthusiasts,

As a first-year data science student, I was truly taken aback by the level of disorganization I encountered when working with real datasets for the first time.

I’m curious about your experiences:

How much of your workday do you dedicate to data preparation and cleaning versus actual analysis?

What types of issues do you face most often? (Missing values, duplicates, inconsistent formats, encoding problems, or something else?)

How do you manage these challenges? Excel, OpenRefine, pandas scripts, or another tool?

I’m not here to sell anything; I’m simply trying to understand if my experience is common or if I just happened to get stuck with some bad datasets. 😅

I would greatly appreciate honest feedback from professionals in the field.

submitted by /u/Turbulent_Way_0134
[link] [comments]

Private Set Intersection, How Do You Do It?

I work with a company that sells data. As an example, let’s say we are selling email addresses. A frequent request we’ll get is, “We’ll we already have a lot of emails, we only want to purchase ones you have that we don’t”.

We need a way that we can figure out what data we have that they don’t, without us giving them all our data or them giving us all their data.

This is a classic case of private set intersection but I cannot find an easy to use solution that isn’t insanely expensive.

Usually we’re dealing with small counts, like 30k-100k. We usually just have to resort to the company agreeing to send us hashed versions of their data and hope we don’t brute force it. This is obviously unsafe. What do you guys do?

submitted by /u/EducationalTackle819
[link] [comments]

European Regions: Happiness, Kinship & Church Exposure; 353 Regions, 31 Countries (ESS + Schulz 2019)

Novel merged dataset linking European Social Survey life satisfaction (rounds 1–8, 2002–2016) with Schulz et al. (2019, Science) regional kinship data across 353 regions in 31 European countries.

This merge didn’t exist before: Schulz used internal region codes, not the standard NUTS codes that ESS uses. Building the crosswalk required: a) Eurostat classification tables; b) fuzzy name matching, and c) manual overrides for NUTS revision changes across countries.

Each row/observation is a European region. Columns/variables include weighted mean life satisfaction (0–10), happiness (0–10), centuries of Western Church exposure, first-cousin marriage prevalence (3 countries), standardised trust, fairness, individualism, conformity, latitude, temperature, and precipitation.

CC BY-NC-SA 4.0 (same as ESS license). Companion to the country-level dataset posted yesterday.

Disclosure: this is my own dataset.

submitted by /u/Effective-Aioli1828
[link] [comments]

Suggestions For Regular Data Extract (large Files)

dear all

i’ve been asked at work to pull two reports twice a month and join certain columns to make a master spreadhseet. each pull of the spreadhseet will be about 150k rows

with every report pulled, we have to append it onto the previous data set in order to track the changes so we can report at different stages

my manager has recommended MS access, however, i am trying it and having serious issues. we would also want to export the data at times to excel when needed

i am slightly technical and can learn with chatgpt but this will have to be accessible for my team, can anyone please recommend the best and easiest way?

submitted by /u/SoundDowntown5285
[link] [comments]

Best Data Source For Total Scheduled Departures Per Airport Per Day?

I’m building a forecasting model that needs a simple input: the number of scheduled departures from a given U.S. airport for the current day (only domestic is fine).

I’ve been using AeroDataBox and running into limitations:

  • Their FIDS/departures endpoint caps results at ~295 flights per call. A busy airport like ATL or JFK easily has 500-800+ departures/day, so I need multiple calls with different time windows just to cover one airport for one day. It works but it’s expensive and slow at scale.
  • Their “Airport Daily Routes” endpoint only returns a 7-day trailing average of flights per route — not the actual scheduled count for a specific day.

BTS On-Time Performance data is great for historical domestic flights but it lags by several months so it’s useless for current/future dates.

All I really need is a single number per airport per day — total scheduled departures. I don’t need individual flight details, passenger manifests, or real-time status. Just the count.

Is there an API or dataset that can give me this without having to paginate through hundreds of individual flight records?

Thanks in advance.

submitted by /u/sheeeeshkebabs
[link] [comments]

World Happiness 2017 Merged With Kinship Intensity, Church Exposure, Climate, Environmental Quality & Gender Security — 155 Countries, 34 Variables

Merged the World Happiness Report 2017 with five datasets that haven’t been combined before: Schulz et al. (2019, Science) Kinship Intensity Index, historical Western Church exposure, Yale Environmental Performance Index, Georgetown Women Peace & Security Index, and World Bank climate data. 155 countries, 34 variables, ready to use.

Includes the standard WHR variables (GDP, social support, life expectancy, freedom, trust, generosity) plus kinship sub-indices (polygyny, cousin marriage, clan structure, lineage rules), democracy, latitude, temperature, and precipitation.

10/10 usability score on Kaggle. CC BY 4.0. EIU Democracy Index excluded from the CSV due to proprietary license — shipped as a separate file for local use.

Disclosure: this is my own dataset

submitted by /u/Effective-Aioli1828
[link] [comments]

[SELF-PROMOTION] Share A Scrape On The Scrape Exchange

Anyone doing large-scale data collection from social media platforms knows the pain: rate limits, bot detection, infra costs. I built Scrape.Exchange to share that burden — bulk datasets distributed via torrent so you only scrape once and everyone benefits. The site is forever-free and you do not need to sign up for downloads, only for uploads. The scrape-python repo on Github includes tools to scrape YouTube and upload to the API so you can scrape and submit data yourself. Worth a look: scrape.exchange

submitted by /u/ScrapeExchange
[link] [comments]

Using YouTube As A Dataset Source For My Coffee Mania

I started working on a small coffee coaching app recently – something that would be my brew journal as well as give me contextual tips to improve each cup that I made.

I was looking for good data and realized most written sources are either shallow or scattered. YouTube, on the other hand, has insanely high-quality content (James Hoffmann, Lance Hedrick, etc.), but it’s not usable out of the box for RAG.

Transcripts are messy because YouTubers ramble on about sponsorships and random stuff, which makes chunking inconsistent. Getting everything into a usable format took way more effort than expected.

So I made a small CLI tool that extracts transcripts from all videos of a channel within minutes. And then cleans + chunks them into something usable for embeddings.

It basically became the data layer for my app, and funnily ended up getting way more traction than my actual coffee coaching app!

Repo: youtube-rag-scraper

submitted by /u/ravann4
[link] [comments]