Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Where Could I Find Datasets For Gym Exercising Logs

For my master’s thesis I am searching for gym exercising logs that include what exercise an individual has done, how many reps and sets and their weight. Potentially some more info if feasible. I’ve found plenty of datasets of just exercises that include their primary target muscles and what equipment is needed and such, but actual logs of users performing these exercising are scarce.

I have searched the internet for some time now, but can not seem to find any usable datasets besides one that includes logs from only one guy. Does anyone know of any datasets, or where I could potentially find these?

Thanks!

submitted by /u/RoaRos
[link] [comments]

MIMIC IV/ Physionet Datasets For Independent Access

Need access to some physionet datasets as a present hs student.
Physionet requires the following steps

  1. CITI Training: which I’ve completed through the MIT Affiliate option (as recommended by physionet). However under this question “We recommend providing an email address issued by Massachusetts Institute of Technology Affiliates or an approved affiliate, rather than a personal one like gmail, hotmail, etc. This will help Massachusetts Institute of Technology Affiliates officials identify your learning records in reports.” I had to put a gmail address because I don’t have an approved affiliate email id.
  2. Credentialed Access: This is what I was mainly concerned about. It allows you to put independent researcher, but then asks for a reference. Who can I ask as a reference to complete the form?

Just wanted to know if its possible to access Physionet datasets as a high schooler and if anyone has done it before could they answer my questions.

submitted by /u/KernelCrypt
[link] [comments]

I’m Looking For A Code Smells Dataset

I’m writing a thesis about how LLMs can correctly identify code smells. I would like to deal with this analysis on Datasets in which there are classes (possibly Java) whose Code Smells are already known.

I tried using the QScored dataset but couldn’t get it to work, and it seems to be out of use.

Can anyone recommend something else?

submitted by /u/BothAccount7078
[link] [comments]

Looking For An Automotive Data Provider In Europe (vehicle History, Damages, Mileage, OE Data)

Hi everyone,

We’re looking for a reliable automotive data provider (API or database) that covers European markets and can supply vehicle history information.

We need access to structured vehicle data, ideally via API, including:

• Country of first registration
• Export information (re-registration in another country)
• General vehicle details: year, color, fuel type, engine capacity, power, drivetrain, gearbox
• Last known mileage (value + date)
• Mileage timeline (from service / inspection / dealer records)
• Damage history (details, estimated cost, date, mileage, repair cost)
• Total loss / salvage / flood / fire / natural disaster / permanent deregistration
• Vehicle photos (from listings, auctions, or damage documentation)
• Theft records (coverage across Europe)
• Active finance or leasing
• Commercial usage (e.g. taxi or fleet)
• CO₂ emissions
• Safety information
• Market valuation (average market price)
• Manufacturer recalls
• OEM build sheet (factory equipment list)

We’re open to commercial partnerships and can offer a commission for valid introductions or verified data sources.

If you know a provider, broker, or contact who can help, please DM me or comment below.

Thanks in advance!

submitted by /u/cauchyez
[link] [comments]

Looking For A Labeled Dataset About Fake Or Fraudulent Real Estate Listings (housing Ads Fraud Detection Project)

I’m trying to work on a machine learning project about detecting fake or scam real estate ads (like fake housing or rental listings), but I can’t seem to find any good datasets for it. Everything I come across is about credit card or job posting fraud, which isn’t really the same thing. I’m looking for any dataset with real estate or rental listings, preferably with a “fraud” or “fake” label, or even some advice on how to collect and label this kind of data myself. If anyone’s come across something similar or has any tips, I’d really appreciate it!

submitted by /u/One_Ad_8437
[link] [comments]

Launching A New Ethical Data-sharing Platform — Anonymised, Consented Demographic + Location Data

We’re building Datalis, a data-sharing platform that collects consent-verified, anonymised demographic and location data directly from users. All raw inputs are stripped and aggregated before storage — no personal identifiers, no resale.

The goal is to create ground-truth datasets that are ethically sourced and representative enough for AI fairness and model evaluation work.

We’re currently onboarding early users via waitlist: 👉 datalis.app

Would love to connect with anyone building evaluation tools or working on ethical data sourcing.

submitted by /u/Crumbedsausage
[link] [comments]

Looking For A Rich Arabic Emotion Classification Dataset (Similar To GoEmotions)

I’m looking for a good Arabic dataset for my friend’s graduation project on emotion classification. I already tried Arpanemo, but it requires a Twitter API, which makes it inconvenient. Most of the other Arabic emotion datasets I found are limited to only three emotion labels, which is too simple compared to something like Google’s GoEmotions dataset that has 28 emotion labels. If anyone knows a dataset with richer emotional variety or something closer to GoEmotions but in Arabic, I’d appreciate your help.

submitted by /u/Safe_Shopping5966
[link] [comments]

Looking For Usage Logs Data Set Of Digital Mental Health Interventions (mental Health App, Etc.)

Hello!

I’ve tried Kaggle, Awesome Public Datasets (Github), Open Data Inception, KD Nuggets, etc. but can’t seem to find what I’m looking for. I’m kind of desperate to get my research study underway, so figured it’s worth a shot to ask here.

Specifically, I’m looking for anonymized usage log data such as timestamps of activity, session duration, and module completion rates, among others. I’m planning to use cluster analysis (using machine learning) to identify patterns of engagement with the intervention.

No specific sample size required, but the bigger the better. Interventions can be any medium (computer, app, website, etc.) or for any mental health disorder (anxiety, depression, eating disorder, insomnia, etc.).

Would appreciate any help or any leads! Thank you so much!

submitted by /u/psychologisaur
[link] [comments]

[Resource] Discover Open & Synthetic Datasets For AI Training And Research Via Opendatabay

Hey everyone 👋

I wanted to share a resource we’ve been working on that may help those who spend time hunting for open or synthetic datasets for AI/ML training, benchmarking, or research.

It’s called Opendatabay a searchable directory that aggregates and organizes datasets from various open data sources, including government portals, research repositories, and public synthetic dataset projects.

What makes it different:

  • Lets you filter datasets by type (real or synthetic), domain, and license
  • Displays metadata like views and downloads to gauge dataset popularity
  • Includes both AI-related and general-purpose open datasets

Everything listed is open-source or publicly available no paywall or gated access.
We’re also working on indexing synthetic datasets specifically designed for AI model training and evaluation.

Would love feedback from this community especially around what metadata or filters you’d find most useful when exploring large-scale datasets.

(Disclosure: I’m part of the team building Opendatabay.)

submitted by /u/Winter-Lake-589
[link] [comments]

How To Improve And Refine Categorization For A Large Dataset With 26,000 Unique Categories

I have got a beast of a dataset with about 2M business names and its got like 26000 categories some of the categories are off like zomato is categorized as a tech startup which is correct but on consumer basis it should be food and beverages and some are straight wrong and alot of them are confusing too But some of them are subcategories like 26000 is a whole number but on the ground it has a couple 100 categories which still is a shit load Any way that i can fix this mess as key word based cleaning aint working it will be a real help

submitted by /u/Existing_Pay8831
[link] [comments]

I’am Looking For Human3.6m, But Official Cite Is Not Respond For 3 Weeks

❓[HELP] 4D-Humans / HMR2.0 Human3.6M eval images missing — can’t find official dataset

I’m trying to reproduce HMR2.0 / 4D-Humans evaluation on Human3.6M, using the official config and h36m_val_p2.npz.

Training runs fine, and 3DPW evaluation works correctly —
but H36M eval completely fails (black crops, sky-high errors).

After digging through the data, it turns out the problem isn’t the code —
it’s that the h36m_val_p2.npz expects full-resolution images (~1000×1000)
with names like:

“`

S9_Directions_1.60457274_000001.jpg

“`

But there’s no public dataset that matches both naming and resolution:

Source Resolution Filename pattern Matches npz?
HuggingFace “Human3.6M_hf_extracted” 256×256 S11_Directions.55011271_000001.jpg ✅ name, ❌ resolution
MKS0601 3DMPPE 1000×1000 s_01_act_02_subact_01_ca_01_000001.jpg ✅ resolution, ❌ name
4D-Humans auto-downloaded h36m-train/*.tar 1000×1000 S1_Directions_1_54138969_001076.jpg close, but _ vs . mismatch

So the official evaluation .npz points to a Human3.6M image set that doesn’t seem to exist publicly. The repo doesn’t provide a download script for it, and even the HuggingFace or MKS0601 versions don’t match.


My question

Has anyone successfully run HMR2.0 or 4D-Humans H36M evaluation recently?

  • Where can we download the official full-resolution images that match h36m_val_p2.npz?
  • Or can someone confirm the exact naming / folder structure used by the authors?

I’ve already registered on the official Human3.6M website and requested dataset access,
but it’s been weeks with no approval or response, and I’m stuck.

Would appreciate any help or confirmation from anyone who managed to get the proper eval set.

submitted by /u/Last_Raise4834
[link] [comments]

Help To Find A Dataset For Regression

Hi, I’m looking for a dataset that has one continuous response variable, at least six continuous covariates, and one categorical variable with three or more categories. I’ve been searching for a while but haven’t found anything yet. If you know a dataset that fits that, I’d really appreciate it.

submitted by /u/SeaworthinessOk3084
[link] [comments]

Scientific Datasets For NLP And LLM Generation Models

👋 Hey i have Just uploaded 2 new datasets for code and scientific reasoning models:

  1. ArXiv Papers (4.6TB) A massive scientific corpus with papers and metadata across all domains.Perfect for training models on academic reasoning, literature review, and scientific knowledge mining. 🔗Link: https://huggingface.co/datasets/nick007x/arxiv-papers

  2. GitHub Code 2025 a comprehensive code dataset for code generation and analysis tasks. mostly contains GitHub’s top 1 million repos above 2 stars 🔗Link: https://huggingface.co/datasets/nick007x/github-code-2025

submitted by /u/its_just_me_007x
[link] [comments]

The Munich-Passau Snore Sound Corpus

I’ve been looking for a labeled snoring dataset which i needed for sleep apnea detection. I found out that many research papers have used the MPSSC dataset for their research and basically that is the largest and the best labeled dataset that is available. I have looked almost everywhere for it but I can’t find it. If anyone knows how to access that dataset or has it downloaded somewhere or a torrent, I’d really appreciate it if you could link it here or in my DMs.

submitted by /u/hydrastrix
[link] [comments]

Looking For A Datasets That Includes Luggage Information From Airport

I’m working on a final year project to optimise baggage handling by using ai to map better route baggage through airport and minimise carousel conflict and overloads to increase throughput but unfortunately there’s not much data I can find to work with. If anyone knows any data set that includes conveyor travel times, error rates, capacity at carousel ect… that would be great thank you.

submitted by /u/thelordgodj1
[link] [comments]