submitted by /u/Honest_Wash_9176
[link] [comments]
Category: Datatards
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
Is the site down? Accessed this morning, but can’t anymore!
submitted by /u/quiyum
[link] [comments]
Mubert got their dataset of 2.5 million samples from 310 artists. Would it be possible to get enough samples by donation?
submitted by /u/Alternative_Cold_680
[link] [comments]
Requesting recommendations for subscription-based data platforms, filterable by amount or owner type, or reputable bulk data vendors in the state unclaimed property records space.
Can anyone tell me who the pros (like asset recovery professionals) use?
Any guidance would be most appreciated.
submitted by /u/DBinSJ
[link] [comments]
Just found this dataset (from the https://www.behindthename.com/ website):
https://github.com/Anwarvic/Behind-The-Name/blob/master/dataset.csv
https://github.com/Anwarvic/Behind-The-Name/blob/master/dataset2.csv
https://github.com/Anwarvic/Behind-The-Name/blob/master/dataset3.csv
It’s 8 years old, so might need updating.
Thanks to the original sharer from this repo:
https://github.com/Anwarvic/Behind-The-Name/tree/master
submitted by /u/Efficient_Fix1026
[link] [comments]
Hey everyone,
I’ve been working on a project called the Men’s Global Wellbeing Index (MGWI) — a data-driven scoring system that compares men’s wellbeing conditions across different countries. I’ve put a lot into building the core foundation, but I’m shifting my focus to other projects and don’t want this one to sit unused.
I’m looking for someone who wants to take it over, expand it, or build something bigger on top of it. or someone who wants to repurpose it for a similiar project.
🔧 What MGWI Includes
- 10 fully defined metrics (Suicide, Social Bias, Child Custody, Legal Bias, Homelessness, Workplace Fairness, Freedom of Expression, Mental Health Access, Violence Against Men, Loneliness)
Each metric includes:
- Emoji marker
- Full rationale/explanation
- Consistent scoring system
Additional assets:
- 10 countries scored (100-point total index)
- Airtable backend with all data structured
- Softr dashboard (mock-up style)
- Name: Mensglobalwellbeingindex dot com
- Brand notes, methodology, and all assets included
🔎 SEO Notes
Some MGWI-related pages are already ranking on the first page for keywords like:
- global wellbeing index for men
- men’s wellbeing index
- men’s global index
- global index for men
- index for men’s global wellbeing
(Useful if someone wants to continue the project or build an SEO-focused site.)
🎯 Who This Is Good For
- Researchers
- Activists or NGOs
- University projects
- Startups in wellbeing, mental health, or analytics
- Indie makers looking for a meaningful data project
- Anyone wanting a niche SEO website with long-term potential
📦 What I Can Share If You’re Interested
- Demo video of the dashboard
- Sample of the dataset
- Full scoring methodology
- Asset list + structure
- Notes on future expansion (global rankings, crowdsourced sentiment, etc.)
I’m open to offers — mainly want this to go to someone who will actually build it out.
If you’re interested or want to see more, just comment or DM me.
submitted by /u/Zealousideal-Gap414
[link] [comments]
Most public datasets treat time as versions – snapshot at T1, T2, T3. But that makes it impossible to query change within the interval. I’m exploring an append-only, bitemporal structure (valid_from / valid_to) — but it’s storage-heavy and tricky for non-SQL users. Has anyone built a temporal model that’s efficient, queryable, and still human-readable?
submitted by /u/Vivid_Stock5288
[link] [comments]
Hi, does anybody know where I might be able to find a dataset of a single speaker in a conversation? So it’s just their side of the conversation? Thanks!
submitted by /u/Flamevein
[link] [comments]
Does anyone have a dataset that has students performance in school and their social media habits? Preferably one set in the United States but I’d take any suggestions. Thank you.
submitted by /u/fanaticfan1907
[link] [comments]
I’m seeking for guidance on data quality management (DQ rules & Data Profiling) in Ataccama and establishing a robust connection to Snowflake for sample data. What are your go-to strategies for profiling, cleansing, and enriching data in Ataccama, any blogs, videos?
submitted by /u/Substantial_Mix9205
[link] [comments]
The dataset i need needs to weight at least 1GB and it should be used later on some ML algorithms. It can be either regression or classification task. Thank you for the help!
submitted by /u/Mate0ff
[link] [comments]
https://cds.climate.copernicus.eu/
consider that i have downloaded models. but i am unsure as to whether i have downloaded the full amount of datasets.
I just want a way to get the provenance.json, provenance.png and the names of .nc files.
The rest is just comparing files names to confirm if I have downloaded and placed data correctly.
submitted by /u/__Muhammad_
[link] [comments]
I am looking for stereo image datasets of crop rows from within the field (not aerial) for row identification. Especially if they have depth and segmentation. I came accross CRBD and CropDeep but the latter doesn’t seem to be available for public yet. Any ideas would be really appreciated 🙂
submitted by /u/Majestic-Age-4636
[link] [comments]
So the thing is my gcp account’s free trial is expiring in 3 days. I was hoping to get some long-term value out of it, something that stays even after the free credits expire like maybe running a vm 24/7 for data extraction process but im not sure what kind of data to extract. Anything that can be of value to me later on after the credits expire doesnt have to be necessarily datasets
submitted by /u/Mean_Interest8611
[link] [comments]
We just put out a blog post with TabPFN benchmarks on datasets from 1M to 10M rows.
For context: TabPFN is a transformer pretrained on millions of synthetic datasets that does in-context learning for tabular classification/regression. No hyperparameter tuning needed – you just give it training data at inference and it predicts.
- TabPFNv2 published in Nature this year
- TabPFN-2.5 beats models tuned for 4h (report here), #1 on TabArena leaderboard atm
Compared our Scaling Mode against CatBoost, XGBoost, LightGBM on internal classification datasets. Performance keeps improving with more data and the gap to gradient boosting isn’t shrinking.
Benchmark results show normalized scores across datasets plus individual results showing ROC AUC improvements. You can find them here: https://priorlabs.ai/technical-reports/large-data-model
Would be interesting to keep on benchmarking this on public large tabular datasets. Anyone know good large public tabular datasets?
submitted by /u/Diligent_Inside6746
[link] [comments]
Hello Reddit! Apologies if this isn’t the right sub, but I’m working on a fun data project exploring how matcha lattes have exploded in popularity over the last year or so.
The thing is, I’m having a hard time finding any datasets that actually include matcha sales. My backup idea is to look for a dataset from a boba or Thai tea shop (since they usually sell matcha) and compare those sales to a cafe over the same time period that may not sell matcha?
This project is just for fun—mainly an excuse for me to play around with Kaggle, SQL, R, etc.—so the dataset doesn’t have to be perfect. If anyone has suggestions, dataset ideas, or guidance on where to look, I’d really appreciate it!
submitted by /u/Pristine-Rhubarb-787
[link] [comments]
I have a introductory data science class and my project requires me to do some basic analysis on some data set related to a topic I like. However my topic I am genuinely interested in is education in computer science. However I have had some trouble finding a data set I can work with, I found the annual stack overflow questionnaire but I don’t think it will work because of how they asked the questions. I also found another one that has all the schools that offer computer science in the US but my professor didn’t like that one. I have like two days to do the project so i need to find the data like today, please please if anyone knows Id love the help. Ive decided that it can be something related to just science in general or even education in general, its just a topic I want to study but I have struggled to find a good data set that I am pretty far from my original question anyways. Pleas and thanks to anyone who can help!
submitted by /u/papiyou
[link] [comments]
so I have tried to scrape current premier league table link is given here
i would try to update it every week if u like it dont forget to upvote it there and suggest what more dataset you want!
submitted by /u/Mental-Flight8195
[link] [comments]
I’ve built a large-scale plant dataset from iNaturalist research-grade observations:
96.1 million rows containing:
- species / genus / family names
- GBIF taxonomy IDs
- lat / lon
- event dates
- image URLs (iNat open data)
- license information
- dataset keys / source info
It’s meant for anyone doing:
- image classification (plants, ecology, biodiversity)
- large-scale ViT/ConvNext pretraining
- location-aware species modelling
- weak-supervised learning from image URLs
- training LoRA adapters for regional plant ID
Dataset (parquet, streamable via HF Datasets):
https://huggingface.co/datasets/juppy44/gbif-plants-raw
let me know what you build with it!
submitted by /u/Lonely-Marzipan-9473
[link] [comments]
This dataset is synthetically generated and contains a diverse set of HTTP requests, labeled as either ‘benign’ or ‘malicious’. It is designed for training and evaluating AI based Web Application Firewalls (WAFs).
submitted by /u/muneebdev
[link] [comments]