Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Data Professionals — How Much Of Your Week Honestly Goes Into Just Cleaning Messy Data?

Hello fellow data enthusiasts,

As a first-year data science student, I was truly taken aback by the level of disorganization I encountered when working with real datasets for the first time.

I’m curious about your experiences:

How much of your workday do you dedicate to data preparation and cleaning versus actual analysis?

What types of issues do you face most often? (Missing values, duplicates, inconsistent formats, encoding problems, or something else?)

How do you manage these challenges? Excel, OpenRefine, pandas scripts, or another tool?

I’m not here to sell anything; I’m simply trying to understand if my experience is common or if I just happened to get stuck with some bad datasets. 😅

I would greatly appreciate honest feedback from professionals in the field.

submitted by /u/Turbulent_Way_0134
[link] [comments]

Private Set Intersection, How Do You Do It?

I work with a company that sells data. As an example, let’s say we are selling email addresses. A frequent request we’ll get is, “We’ll we already have a lot of emails, we only want to purchase ones you have that we don’t”.

We need a way that we can figure out what data we have that they don’t, without us giving them all our data or them giving us all their data.

This is a classic case of private set intersection but I cannot find an easy to use solution that isn’t insanely expensive.

Usually we’re dealing with small counts, like 30k-100k. We usually just have to resort to the company agreeing to send us hashed versions of their data and hope we don’t brute force it. This is obviously unsafe. What do you guys do?

submitted by /u/EducationalTackle819
[link] [comments]

European Regions: Happiness, Kinship & Church Exposure; 353 Regions, 31 Countries (ESS + Schulz 2019)

Novel merged dataset linking European Social Survey life satisfaction (rounds 1–8, 2002–2016) with Schulz et al. (2019, Science) regional kinship data across 353 regions in 31 European countries.

This merge didn’t exist before: Schulz used internal region codes, not the standard NUTS codes that ESS uses. Building the crosswalk required: a) Eurostat classification tables; b) fuzzy name matching, and c) manual overrides for NUTS revision changes across countries.

Each row/observation is a European region. Columns/variables include weighted mean life satisfaction (0–10), happiness (0–10), centuries of Western Church exposure, first-cousin marriage prevalence (3 countries), standardised trust, fairness, individualism, conformity, latitude, temperature, and precipitation.

CC BY-NC-SA 4.0 (same as ESS license). Companion to the country-level dataset posted yesterday.

Disclosure: this is my own dataset.

submitted by /u/Effective-Aioli1828
[link] [comments]

Suggestions For Regular Data Extract (large Files)

dear all

i’ve been asked at work to pull two reports twice a month and join certain columns to make a master spreadhseet. each pull of the spreadhseet will be about 150k rows

with every report pulled, we have to append it onto the previous data set in order to track the changes so we can report at different stages

my manager has recommended MS access, however, i am trying it and having serious issues. we would also want to export the data at times to excel when needed

i am slightly technical and can learn with chatgpt but this will have to be accessible for my team, can anyone please recommend the best and easiest way?

submitted by /u/SoundDowntown5285
[link] [comments]

Best Data Source For Total Scheduled Departures Per Airport Per Day?

I’m building a forecasting model that needs a simple input: the number of scheduled departures from a given U.S. airport for the current day (only domestic is fine).

I’ve been using AeroDataBox and running into limitations:

  • Their FIDS/departures endpoint caps results at ~295 flights per call. A busy airport like ATL or JFK easily has 500-800+ departures/day, so I need multiple calls with different time windows just to cover one airport for one day. It works but it’s expensive and slow at scale.
  • Their “Airport Daily Routes” endpoint only returns a 7-day trailing average of flights per route — not the actual scheduled count for a specific day.

BTS On-Time Performance data is great for historical domestic flights but it lags by several months so it’s useless for current/future dates.

All I really need is a single number per airport per day — total scheduled departures. I don’t need individual flight details, passenger manifests, or real-time status. Just the count.

Is there an API or dataset that can give me this without having to paginate through hundreds of individual flight records?

Thanks in advance.

submitted by /u/sheeeeshkebabs
[link] [comments]

World Happiness 2017 Merged With Kinship Intensity, Church Exposure, Climate, Environmental Quality & Gender Security — 155 Countries, 34 Variables

Merged the World Happiness Report 2017 with five datasets that haven’t been combined before: Schulz et al. (2019, Science) Kinship Intensity Index, historical Western Church exposure, Yale Environmental Performance Index, Georgetown Women Peace & Security Index, and World Bank climate data. 155 countries, 34 variables, ready to use.

Includes the standard WHR variables (GDP, social support, life expectancy, freedom, trust, generosity) plus kinship sub-indices (polygyny, cousin marriage, clan structure, lineage rules), democracy, latitude, temperature, and precipitation.

10/10 usability score on Kaggle. CC BY 4.0. EIU Democracy Index excluded from the CSV due to proprietary license — shipped as a separate file for local use.

Disclosure: this is my own dataset

submitted by /u/Effective-Aioli1828
[link] [comments]

[SELF-PROMOTION] Share A Scrape On The Scrape Exchange

Anyone doing large-scale data collection from social media platforms knows the pain: rate limits, bot detection, infra costs. I built Scrape.Exchange to share that burden — bulk datasets distributed via torrent so you only scrape once and everyone benefits. The site is forever-free and you do not need to sign up for downloads, only for uploads. The scrape-python repo on Github includes tools to scrape YouTube and upload to the API so you can scrape and submit data yourself. Worth a look: scrape.exchange

submitted by /u/ScrapeExchange
[link] [comments]

Using YouTube As A Dataset Source For My Coffee Mania

I started working on a small coffee coaching app recently – something that would be my brew journal as well as give me contextual tips to improve each cup that I made.

I was looking for good data and realized most written sources are either shallow or scattered. YouTube, on the other hand, has insanely high-quality content (James Hoffmann, Lance Hedrick, etc.), but it’s not usable out of the box for RAG.

Transcripts are messy because YouTubers ramble on about sponsorships and random stuff, which makes chunking inconsistent. Getting everything into a usable format took way more effort than expected.

So I made a small CLI tool that extracts transcripts from all videos of a channel within minutes. And then cleans + chunks them into something usable for embeddings.

It basically became the data layer for my app, and funnily ended up getting way more traction than my actual coffee coaching app!

Repo: youtube-rag-scraper

submitted by /u/ravann4
[link] [comments]

Does Anyone Have Access To The Full SHL Dataset?

Hi,

Does anyone here happen to have access to the full SHL dataset, or know how to get it?

I’m using it for my master’s thesis. So far I’ve only been able to find the preview version on IEEE Dataport, while the SHL site points there and mentions server issues. The archived version also does not let me download the actual data.

SHL website: http://www.shl-dataset.org/

IEEE preview: https://ieee-dataport.org/documents/sussex-huawei-locomotion-and-transportation-dataset

It’s only for academic use. If anyone has managed to access the full version, I’d really appreciate it.

submitted by /u/tryllepus
[link] [comments]

Looking For Bulk Balance Sheet PDFs (for RAG Project)

Hi everyone, I’m working on a retrieval-augmented generation (RAG) project and need a large dataset of balance sheet PDFs (ideally around 1000 files).

Does anyone know a good source where I can download them in bulk — preferably as a zip or via an API? I’m open to public datasets, financial repositories, or any structured sources that make large-scale download easier.

Thanks in advance for any leads!

RAG #MachineLearning #DataEngineering #NLP #Datasets #FinanceData #AIProjects

submitted by /u/dipk6545
[link] [comments]

Looking For Channel Separated Speaker Datasets

I am trying to find a dataset where speakers are separated cleanly on different tracks/channels. Ideally a recording of 2 people who are in a phone call, doing a podcast (This would be really nice) or having a normal conversation. The audio quality must be good as well. Fisher dataset is the closest I could find in open source.

If you know anyone who has this kind of data, tell them to reach out with a few samples please. I am open to discussing compensation.

submitted by /u/Louay-AI
[link] [comments]

Help Needed For My Project – Workout Logs

Hey everyone!

I’m working on a fitness/ML project and I’m looking for workout logs from the past ~60 days. If you track your workouts in apps like Hevy, Strong, Fitbod, notes, spreadsheets, etc., and are willing to share an export or screenshot, that would help a ton.

You can remove your name — I only care about the workouts themselves (exercises, sets, reps, weights, dates, physiology).

Even if your logs aren’t perfect or you missed days, that’s totally fine. Any training style is useful: bodybuilding, powerlifting, general fitness, beginner, advanced, anything.

If you’re interested, comment below or DM me. Thanks so much! 🙏

submitted by /u/xD_aviationgod3105
[link] [comments]

[Synthetic][Self-Promotion] Sleep Health & Daily Performance Dataset (100K Rows, 32 Features, 3 ML Targets)

I couldn’t find a realistic, ML-ready dataset for sleep analysis, so I built one.

This dataset contains:

  • 100,000 records
  • 32 features covering sleep, lifestyle, psychology, and health
  • 3 prediction targets (regression + classification)

It is synthetic, but designed to reflect real-world patterns using research-backed correlations (e.g., stress vs sleep quality, REM vs cognition).

Some highlights:
• Occupation-based sleep patterns (12 job types)
• Non-linear relationships (optimal sleep duration effects)
• Zero missing values (fully ML-ready)

Use cases:

  • Data analysis & visualization
  • Machine learning (beginner → advanced)
  • Research experiments

Dataset: https://www.kaggle.com/datasets/mohankrishnathalla/sleep-health-and-daily-performance-dataset

Would appreciate any feedback!

submitted by /u/Mohan137
[link] [comments]

[DATASET] Polymarket Prediction Market: 5.5 Billion Tick-level Orderbook Records, 21 Days, L2 Depth Snapshots, Trade Executions, Resolution Labels (CC-BY-NC-4.0)

Published a large-scale tick-level dataset from Polymarket, the largest prediction market. Useful for microstructure research, market efficiency studies, and ML on event-driven markets.

Scale:

Metric Count
Orderbook ticks 5,555,777,555
L2 depth snapshots 51,674,425
Trade executions 4,126,076
Markets tracked 123,895
Resolved markets 23,146
ML feature bars 5,587,547
Coverage 21 continuous days
Null values 0

Format: Daily Parquet files (ZSTD compressed), around 40 GB total. Includes pre-built 1-minute bar features with L2 depth imbalance ready for ML training on Kaggle’s free tier.

License: CC-BY-NC-4.0 (non-commercial/academic)

Link: https://www.kaggle.com/datasets/marvingozo/polymarket-tick-level-orderbook-dataset

Use cases: HFT signal detection, market maker strategy research, prediction efficiency studies, order flow toxicity (VPIN), cross-market correlation, event study analysis.

submitted by /u/Upset-Fly-454
[link] [comments]

Built A Dataset Generation Skill After Spending Way Too Much On OpenAI, Claude, And Gemini APIs

Hey 👋

Quick project showcase. I built a dataset generation skill for Claude, Codex, and Antigravity after spending way too much on the OpenAI, Claude, and Gemini APIs.

At first I was using APIs for the whole workflow. That worked, but it got expensive really fast once the work stopped being just “generate examples” and became:
generate -> inspect -> dedup -> rebalance -> verify -> audit -> re-export -> repeat

So I moved the workflow into a skill and pushed as much as possible into a deterministic local pipeline.

The useful part is that it is not just a synthetic dataset generator.
You can ask it to:
“generate a medical triage dataset”
“turn these URLs into a training dataset”
“use web research to build a fintech FAQ dataset”
“normalize this CSV into OpenAI JSONL”
“audit this dataset and tell me what is wrong with it”

It can generate from a topic, research the topic first, collect from URLs, collect from local files/repos, or normalize an existing dataset into one canonical pipeline.

How it works:
The agent handles planning and reasoning.
The local pipeline handles normalization, verification, generation-time dedup, coverage steering, semantic review hooks, export, and auditing.

What it does:
– Research-first dataset building instead of pure synthetic generation
– Canonical normalization into one internal schema
– Generation-time dedup so duplicates get rejected during the build
– Coverage checks while generating so the next batch targets missing buckets
– Semantic review via review files, not just regex-style heuristics
– Corpus audits for split leakage, context leakage, taxonomy balance, and synthetic fingerprints
– Export to OpenAI, HuggingFace, CSV, or flat JSONL
– Prompt sanitization on export so training-facing fields are safer by default while metadata stays available for analysis

How it is built under the hood:

SKILL.md (orchestrator)
├── 12 sub-skills (dataset-strategy, seed-generator, local-collector, llm-judge, dataset-auditor, …)
├── 8 pipeline scripts (generate.py, build_loop.py, verify.py, dedup.py, export.py, …)
├── 9 utility modules (canonical.py, visibility.py, coverage_plan.py, db.py, …)
├── 1 internal canonical schema
├── 3 export presets
└── 50 automated tests

The reason I built it this way is cost.
I did not want to keep paying API prices for orchestration, cleanup, validation, and export logic that can be done locally.

The second reason is control.
I wanted a workflow where I can inspect the data, keep metadata, audit the corpus, and still export a safer training artifact when needed.

It started as a way to stop burning money on dataset iteration, but it ended up becoming a much cleaner dataset engineering workflow overall.

If people want to try it:

git clone https://github.com/Bhanunamikaze/AI-Dataset-Generator.git cd AI-Dataset-Generator ./install.sh --target all --force or you can simply run curl -sSL https://raw.githubusercontent.com/Bhanunamikaze/ai-dataset-generator/main/install.sh | bash -s -- --online --target all 

Then restart the IDE session and ask it to build or audit a dataset.

Repo:
https://github.com/Bhanunamikaze/AI-Dataset-Generator

If anyone here is building fine-tuning or eval datasets, I would genuinely love feedback on the workflow.
⭐ Star it if the skill pattern feels useful
🐛 Open an issue if you find something broken
🔀 PRs are very welcome

submitted by /u/Illustrious-triffle
[link] [comments]

TTB Certificate Of Label Approval Data: 12,000+ US Spirits Labels With Distillery Cross-references

I’ve been working with the TTB (Alcohol and Tobacco Tax and Trade Bureau) COLA dataset: the public records of every spirits label approved for sale in the US. The raw data is available through TTB’s online search but it’s difficult to work with: session-gated URLs, no stable deep links, and the most useful fields (status, producer names, formula IDs) only exist on individual HTML detail pages, not in the CSV exports.

I built a pipeline that pulls CSV exports, scrapes the HTML detail pages for enrichment fields, and consolidates everything into structured JSON. The vodka subset alone covers 12,127 individual approvals across 9,038 product groups, 6,081 brands, and 2,439 producers.

What makes the data interesting:

Every label includes regulatory statements identifying who distilled, bottled, or imported the product, along with their DSP (Distilled Spirits Plant) permit number. Cross-referencing permits with facility names reveals the contract distilling network: which brands are produced at which facilities. About 1,035 producers in the dataset show up as contract distillers. You can trace the actual production topology behind the retail shelf.

Other fields include approval status (approved/expired/surrendered/revoked), class and type codes, proof ranges, label images, and formula references.

I’ve published the vodka data as a navigable site at https://buy.vodka: statically generated pages for every product group, brand, and producer, with cross-linking between them. The site is mainly useful for browsing and exploring relationships, but the underlying structured data is the real asset.

If there’s interest, happy to discuss the data schema or extraction approach. The source is entirely public government records.

submitted by /u/hunterleaman
[link] [comments]

Looking For A Fast Keypoint Annotation Tool

Hey everyone,
I’m currently working on annotating a human pose dataset (specifically of people swimming) and I’m struggling to find a tool that fits my workflow.

I’m looking for a click‑based labeling workflow, where I can define a specific order in which keypoints are placed and then simply click to place each point. Everything I’ve found so far uses drag‑and‑drop, which feels very inefficient for what I need.

Ideally, the tool should support most of the following features:

  • Multiple selections per image with persistent IDs
  • Skipping occluded or hard‑to‑see keypoints
  • (Less important) keypoint state annotations (e.g., occluded, blurry, visible)
  • Bounding box annotations

Does anyone know of a tool that works like this, or any keypoint labeling tool with a faster workflow than drag‑and‑drop? Any recommendations are much appreciated!

submitted by /u/Dizzy-Ad6240
[link] [comments]

Guys Does Paying $2199+/m For This Dataset Worth It?

Hey guys, need a reality check.

I came across a dataset that costs around $2k+ per year, and I’m trying to figure out if it’s actually worth it or just sounds good on paper.

It’s not generic marketing advice — it’s a structured set of 100+ psychology-based directives for SaaS growth.

Each one breaks down:

• where to use it (landing page, onboarding, pricing, etc.) • why it works (human behavior, not surface-level tips) • when NOT to use it • real SaaS examples + implementation 

Basically feels like a decision system for conversion, not just a list of ideas.

Here’s one example from it:

1 of 102 directives – “id”: “P1-001”,

“pillar”: “Attention & Pattern Interrupts”,

“pillar_code”: “P1”,

“principle_name”: “Zeigarnik Effect”,

“one_liner”: “Incomplete tasks hijack the brain until they’re finished.”,

“plain_english”: “Your brain hates unfinished business. Once you start something, a little alarm goes off that keeps bugging you until it’s done. Marketers use this by starting a story or a process and NOT finishing it — so your brain stays hooked and comes back.”,

“human_fear_or_desire”: “Fear of incompletion; desire for cognitive closure and resolution.”,

“when_to_use”: “Hero section headlines, onboarding checklists, email subject lines, multi-step signup flows, progress bars on pricing pages.”,

“when_NOT_to_use”: “Late-stage checkout flows where the user needs confidence to commit — open loops here create anxiety and kill purchases. Never use on enterprise demo request pages where trust must be absolute.”,

“saas_example”: {

“scenario”: “A B2B project management SaaS wants to increase free-trial signup completions.”,

“before”: “‘Sign up for free’ button on a single-step form. 68% of users who clicked never finished the form.”,

“after”: “Multi-step onboarding wizard that starts with ‘Step 1 of 3: What’s your team size?’ — visibly showing the incomplete progress bar after the user has already answered question one.”,

“result”: “Across 100+ analyzed SaaS onboarding experiments (including data from Intercom, Canva, and LinkedIn’s profile completion studies), surfacing an ‘X% complete’ progress indicator after the first action drives a 20–35% lift in full completion rates. The Zeigarnik loop is already open; users feel compelled to close it.”

},

“exact_implementation”: “If your signup form is a single page, then break it into 3 steps. Display a progress bar that shows ‘Step 1 of 3’ immediately after the user enters their email. The bar must be visually prominent and show incompletion — do not let the bar start at 0%. Start it at 33% so the user feels momentum, not a cold start.”,

“example_copy”: “You’re 33% of the way to your free workspace. Don’t leave it unfinished →”,

“power_level”: “High”,

“ethical_risk”: “Low”,

“combines_well_with”: [

“The Open Loop”,

“Curiosity Gap”,

“Cognitive Ease”

]

},

Now I’m stuck thinking:

• Is this actually worth ~$2k/year? • Or is this something you’d just figure out over time anyway? 

If you were running a SaaS or working with clients,

👉 Would you pay for something like this? Or not?

Trying to avoid making a dumb purchase 😅

submitted by /u/soloise
[link] [comments]