Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Does Anyone Have Access To The Full SHL Dataset?

Hi,

Does anyone here happen to have access to the full SHL dataset, or know how to get it?

I’m using it for my master’s thesis. So far I’ve only been able to find the preview version on IEEE Dataport, while the SHL site points there and mentions server issues. The archived version also does not let me download the actual data.

SHL website: http://www.shl-dataset.org/

IEEE preview: https://ieee-dataport.org/documents/sussex-huawei-locomotion-and-transportation-dataset

It’s only for academic use. If anyone has managed to access the full version, I’d really appreciate it.

submitted by /u/tryllepus
[link] [comments]

Looking For Bulk Balance Sheet PDFs (for RAG Project)

Hi everyone, I’m working on a retrieval-augmented generation (RAG) project and need a large dataset of balance sheet PDFs (ideally around 1000 files).

Does anyone know a good source where I can download them in bulk — preferably as a zip or via an API? I’m open to public datasets, financial repositories, or any structured sources that make large-scale download easier.

Thanks in advance for any leads!

RAG #MachineLearning #DataEngineering #NLP #Datasets #FinanceData #AIProjects

submitted by /u/dipk6545
[link] [comments]

Looking For Channel Separated Speaker Datasets

I am trying to find a dataset where speakers are separated cleanly on different tracks/channels. Ideally a recording of 2 people who are in a phone call, doing a podcast (This would be really nice) or having a normal conversation. The audio quality must be good as well. Fisher dataset is the closest I could find in open source.

If you know anyone who has this kind of data, tell them to reach out with a few samples please. I am open to discussing compensation.

submitted by /u/Louay-AI
[link] [comments]

Help Needed For My Project – Workout Logs

Hey everyone!

I’m working on a fitness/ML project and I’m looking for workout logs from the past ~60 days. If you track your workouts in apps like Hevy, Strong, Fitbod, notes, spreadsheets, etc., and are willing to share an export or screenshot, that would help a ton.

You can remove your name — I only care about the workouts themselves (exercises, sets, reps, weights, dates, physiology).

Even if your logs aren’t perfect or you missed days, that’s totally fine. Any training style is useful: bodybuilding, powerlifting, general fitness, beginner, advanced, anything.

If you’re interested, comment below or DM me. Thanks so much! 🙏

submitted by /u/xD_aviationgod3105
[link] [comments]

[Synthetic][Self-Promotion] Sleep Health & Daily Performance Dataset (100K Rows, 32 Features, 3 ML Targets)

I couldn’t find a realistic, ML-ready dataset for sleep analysis, so I built one.

This dataset contains:

  • 100,000 records
  • 32 features covering sleep, lifestyle, psychology, and health
  • 3 prediction targets (regression + classification)

It is synthetic, but designed to reflect real-world patterns using research-backed correlations (e.g., stress vs sleep quality, REM vs cognition).

Some highlights:
• Occupation-based sleep patterns (12 job types)
• Non-linear relationships (optimal sleep duration effects)
• Zero missing values (fully ML-ready)

Use cases:

  • Data analysis & visualization
  • Machine learning (beginner → advanced)
  • Research experiments

Dataset: https://www.kaggle.com/datasets/mohankrishnathalla/sleep-health-and-daily-performance-dataset

Would appreciate any feedback!

submitted by /u/Mohan137
[link] [comments]

[DATASET] Polymarket Prediction Market: 5.5 Billion Tick-level Orderbook Records, 21 Days, L2 Depth Snapshots, Trade Executions, Resolution Labels (CC-BY-NC-4.0)

Published a large-scale tick-level dataset from Polymarket, the largest prediction market. Useful for microstructure research, market efficiency studies, and ML on event-driven markets.

Scale:

Metric Count
Orderbook ticks 5,555,777,555
L2 depth snapshots 51,674,425
Trade executions 4,126,076
Markets tracked 123,895
Resolved markets 23,146
ML feature bars 5,587,547
Coverage 21 continuous days
Null values 0

Format: Daily Parquet files (ZSTD compressed), around 40 GB total. Includes pre-built 1-minute bar features with L2 depth imbalance ready for ML training on Kaggle’s free tier.

License: CC-BY-NC-4.0 (non-commercial/academic)

Link: https://www.kaggle.com/datasets/marvingozo/polymarket-tick-level-orderbook-dataset

Use cases: HFT signal detection, market maker strategy research, prediction efficiency studies, order flow toxicity (VPIN), cross-market correlation, event study analysis.

submitted by /u/Upset-Fly-454
[link] [comments]

Built A Dataset Generation Skill After Spending Way Too Much On OpenAI, Claude, And Gemini APIs

Hey 👋

Quick project showcase. I built a dataset generation skill for Claude, Codex, and Antigravity after spending way too much on the OpenAI, Claude, and Gemini APIs.

At first I was using APIs for the whole workflow. That worked, but it got expensive really fast once the work stopped being just “generate examples” and became:
generate -> inspect -> dedup -> rebalance -> verify -> audit -> re-export -> repeat

So I moved the workflow into a skill and pushed as much as possible into a deterministic local pipeline.

The useful part is that it is not just a synthetic dataset generator.
You can ask it to:
“generate a medical triage dataset”
“turn these URLs into a training dataset”
“use web research to build a fintech FAQ dataset”
“normalize this CSV into OpenAI JSONL”
“audit this dataset and tell me what is wrong with it”

It can generate from a topic, research the topic first, collect from URLs, collect from local files/repos, or normalize an existing dataset into one canonical pipeline.

How it works:
The agent handles planning and reasoning.
The local pipeline handles normalization, verification, generation-time dedup, coverage steering, semantic review hooks, export, and auditing.

What it does:
– Research-first dataset building instead of pure synthetic generation
– Canonical normalization into one internal schema
– Generation-time dedup so duplicates get rejected during the build
– Coverage checks while generating so the next batch targets missing buckets
– Semantic review via review files, not just regex-style heuristics
– Corpus audits for split leakage, context leakage, taxonomy balance, and synthetic fingerprints
– Export to OpenAI, HuggingFace, CSV, or flat JSONL
– Prompt sanitization on export so training-facing fields are safer by default while metadata stays available for analysis

How it is built under the hood:

SKILL.md (orchestrator)
├── 12 sub-skills (dataset-strategy, seed-generator, local-collector, llm-judge, dataset-auditor, …)
├── 8 pipeline scripts (generate.py, build_loop.py, verify.py, dedup.py, export.py, …)
├── 9 utility modules (canonical.py, visibility.py, coverage_plan.py, db.py, …)
├── 1 internal canonical schema
├── 3 export presets
└── 50 automated tests

The reason I built it this way is cost.
I did not want to keep paying API prices for orchestration, cleanup, validation, and export logic that can be done locally.

The second reason is control.
I wanted a workflow where I can inspect the data, keep metadata, audit the corpus, and still export a safer training artifact when needed.

It started as a way to stop burning money on dataset iteration, but it ended up becoming a much cleaner dataset engineering workflow overall.

If people want to try it:

git clone https://github.com/Bhanunamikaze/AI-Dataset-Generator.git cd AI-Dataset-Generator ./install.sh --target all --force or you can simply run curl -sSL https://raw.githubusercontent.com/Bhanunamikaze/ai-dataset-generator/main/install.sh | bash -s -- --online --target all 

Then restart the IDE session and ask it to build or audit a dataset.

Repo:
https://github.com/Bhanunamikaze/AI-Dataset-Generator

If anyone here is building fine-tuning or eval datasets, I would genuinely love feedback on the workflow.
⭐ Star it if the skill pattern feels useful
🐛 Open an issue if you find something broken
🔀 PRs are very welcome

submitted by /u/Illustrious-triffle
[link] [comments]

TTB Certificate Of Label Approval Data: 12,000+ US Spirits Labels With Distillery Cross-references

I’ve been working with the TTB (Alcohol and Tobacco Tax and Trade Bureau) COLA dataset: the public records of every spirits label approved for sale in the US. The raw data is available through TTB’s online search but it’s difficult to work with: session-gated URLs, no stable deep links, and the most useful fields (status, producer names, formula IDs) only exist on individual HTML detail pages, not in the CSV exports.

I built a pipeline that pulls CSV exports, scrapes the HTML detail pages for enrichment fields, and consolidates everything into structured JSON. The vodka subset alone covers 12,127 individual approvals across 9,038 product groups, 6,081 brands, and 2,439 producers.

What makes the data interesting:

Every label includes regulatory statements identifying who distilled, bottled, or imported the product, along with their DSP (Distilled Spirits Plant) permit number. Cross-referencing permits with facility names reveals the contract distilling network: which brands are produced at which facilities. About 1,035 producers in the dataset show up as contract distillers. You can trace the actual production topology behind the retail shelf.

Other fields include approval status (approved/expired/surrendered/revoked), class and type codes, proof ranges, label images, and formula references.

I’ve published the vodka data as a navigable site at https://buy.vodka: statically generated pages for every product group, brand, and producer, with cross-linking between them. The site is mainly useful for browsing and exploring relationships, but the underlying structured data is the real asset.

If there’s interest, happy to discuss the data schema or extraction approach. The source is entirely public government records.

submitted by /u/hunterleaman
[link] [comments]

Looking For A Fast Keypoint Annotation Tool

Hey everyone,
I’m currently working on annotating a human pose dataset (specifically of people swimming) and I’m struggling to find a tool that fits my workflow.

I’m looking for a click‑based labeling workflow, where I can define a specific order in which keypoints are placed and then simply click to place each point. Everything I’ve found so far uses drag‑and‑drop, which feels very inefficient for what I need.

Ideally, the tool should support most of the following features:

  • Multiple selections per image with persistent IDs
  • Skipping occluded or hard‑to‑see keypoints
  • (Less important) keypoint state annotations (e.g., occluded, blurry, visible)
  • Bounding box annotations

Does anyone know of a tool that works like this, or any keypoint labeling tool with a faster workflow than drag‑and‑drop? Any recommendations are much appreciated!

submitted by /u/Dizzy-Ad6240
[link] [comments]

Guys Does Paying $2199+/m For This Dataset Worth It?

Hey guys, need a reality check.

I came across a dataset that costs around $2k+ per year, and I’m trying to figure out if it’s actually worth it or just sounds good on paper.

It’s not generic marketing advice — it’s a structured set of 100+ psychology-based directives for SaaS growth.

Each one breaks down:

• where to use it (landing page, onboarding, pricing, etc.) • why it works (human behavior, not surface-level tips) • when NOT to use it • real SaaS examples + implementation 

Basically feels like a decision system for conversion, not just a list of ideas.

Here’s one example from it:

1 of 102 directives – “id”: “P1-001”,

“pillar”: “Attention & Pattern Interrupts”,

“pillar_code”: “P1”,

“principle_name”: “Zeigarnik Effect”,

“one_liner”: “Incomplete tasks hijack the brain until they’re finished.”,

“plain_english”: “Your brain hates unfinished business. Once you start something, a little alarm goes off that keeps bugging you until it’s done. Marketers use this by starting a story or a process and NOT finishing it — so your brain stays hooked and comes back.”,

“human_fear_or_desire”: “Fear of incompletion; desire for cognitive closure and resolution.”,

“when_to_use”: “Hero section headlines, onboarding checklists, email subject lines, multi-step signup flows, progress bars on pricing pages.”,

“when_NOT_to_use”: “Late-stage checkout flows where the user needs confidence to commit — open loops here create anxiety and kill purchases. Never use on enterprise demo request pages where trust must be absolute.”,

“saas_example”: {

“scenario”: “A B2B project management SaaS wants to increase free-trial signup completions.”,

“before”: “‘Sign up for free’ button on a single-step form. 68% of users who clicked never finished the form.”,

“after”: “Multi-step onboarding wizard that starts with ‘Step 1 of 3: What’s your team size?’ — visibly showing the incomplete progress bar after the user has already answered question one.”,

“result”: “Across 100+ analyzed SaaS onboarding experiments (including data from Intercom, Canva, and LinkedIn’s profile completion studies), surfacing an ‘X% complete’ progress indicator after the first action drives a 20–35% lift in full completion rates. The Zeigarnik loop is already open; users feel compelled to close it.”

},

“exact_implementation”: “If your signup form is a single page, then break it into 3 steps. Display a progress bar that shows ‘Step 1 of 3’ immediately after the user enters their email. The bar must be visually prominent and show incompletion — do not let the bar start at 0%. Start it at 33% so the user feels momentum, not a cold start.”,

“example_copy”: “You’re 33% of the way to your free workspace. Don’t leave it unfinished →”,

“power_level”: “High”,

“ethical_risk”: “Low”,

“combines_well_with”: [

“The Open Loop”,

“Curiosity Gap”,

“Cognitive Ease”

]

},

Now I’m stuck thinking:

• Is this actually worth ~$2k/year? • Or is this something you’d just figure out over time anyway? 

If you were running a SaaS or working with clients,

👉 Would you pay for something like this? Or not?

Trying to avoid making a dumb purchase 😅

submitted by /u/soloise
[link] [comments]

Action-oriented LLM Datasets (tool Use + Workflows + Decision Logic)

Most datasets rely on logs or real user data — which makes them messy, inconsistent, and hard to use due to privacy constraints.

What we’re doing differently:

  • fully synthetic, controllable data
  • structured as state → decision → action → outcome
  • built for tool use + multi-step workflows, not just text

So instead of cleaning logs, you can generate clean, privacy-safe datasets aligned to how your systems actually behave.

Curious if others are moving toward synthetic + behavior-driven datasets for agents?

submitted by /u/JayPatel24_
[link] [comments]

Almost Made A Dataset But Don’t Know What To Do With It

This weekend I was looking for a dataset on major air crashes (I like planes) containing the text of their final reports. Surprisingly I was unable to find even a single open source dataset matching this criteria. Anyway I started collecting a few reports and was in the stage of extracting and finalising the cleaning pipeline that I realized that I don’t really have a clear idea what to do with this data. Perhaps build a RAG but what benefit would that have? Has anyone worked with such reports?

submitted by /u/AbdullahKhanSherwani
[link] [comments]

10+ Years Of NOAA Hail Data, Geocoded And Queryable Via Free API

Thought this community might find this useful — I’ve built an API that makes NOAA’s hail data queryable by address.

The data:

  • MESH (Multi-Radar Multi-Sensor): Radar-derived hail size estimates from the NEXRAD network, 2020–present, ingested nightly
  • Storm Events Database: NOAA/NWS verified severe weather reports, going back to the 1950s (hail-specific events)

Both datasets are geocoded and spatially indexed, so you can query by any US address and get back every hail event within a configurable radius, with dates, estimated hail sizes (inches), distance from the address, and the data source.

Why I built it: NOAA’s raw data is publicly available but genuinely painful to work with at scale — scattered across FTP servers, inconsistent formats, no spatial indexing. I wanted a clean, fast API on top of it.

Access:

If you’re doing any research involving hail frequency, property risk, climate patterns, or severe weather trends, this might save you a bunch of data wrangling time.

Happy to answer questions about the data sources, coverage, or methodology.

submitted by /u/danny_greer
[link] [comments]

I’m Looking For 3D Geometry Datasets Of Bulk Parts

Hi I’m Searching a Datasets for bill parts. (Small handles, electrical, connectors, screws, Nuts, Bolts etc.)

I’m doing my Bachelorsthesis in the automatic parametrisation of Vibration feeders and I need to categorize the geometry before I can select the arrangement mechanism that I’ll need

Does anyone have a idea where I can search for them? 🙂

submitted by /u/HISTeu
[link] [comments]

Anyone Here Need A Very Specific Dataset Built?

Been working on a few dataset projects recently, mostly things like:

  • lead generation lists (by niche + location)
  • business directories (websites, contact info, categories)
  • market research datasets (competitors, pricing, etc.)
  • cleaning up messy CSVs / exports into something usable

Usually pulling from multiple sources (Google Maps, websites, public data, APIs), then deduping and structuring it into a clean dataset (CSV/XLSX).

Trying to figure out what’s actually worth building next.

If you could get one dataset built for you right now, what would it be?

Interested to see what people here actually need.

submitted by /u/jesse_jones_
[link] [comments]