submitted by /u/storeLessBits
[link] [comments]
Category: Datatards
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
dont seem to find good databases/datasets for this. there are sporadic compilations which are completely inconsistent. trying to build using faker loses consistency very very quickly..
i need about 50k rows of hospital->patient -> procedures -> outcomes with chargebook references.
I undestand real-data is hard to comeby, but any synthetic alternatives?
submitted by /u/LibrarianUnlikely180
[link] [comments]
Looking for an excel/ spreadsheet version of the 2026 Forbes Billionaire list. Does anyone know how to do this?
submitted by /u/shivlor
[link] [comments]
Most datasets rely on logs or real user data — which makes them messy, inconsistent, and hard to use due to privacy constraints.
What we’re doing differently:
- fully synthetic, controllable data
- structured as state → decision → action → outcome
- built for tool use + multi-step workflows, not just text
So instead of cleaning logs, you can generate clean, privacy-safe datasets aligned to how your systems actually behave.
Curious if others are moving toward synthetic + behavior-driven datasets for agents?
submitted by /u/JayPatel24_
[link] [comments]
This weekend I was looking for a dataset on major air crashes (I like planes) containing the text of their final reports. Surprisingly I was unable to find even a single open source dataset matching this criteria. Anyway I started collecting a few reports and was in the stage of extracting and finalising the cleaning pipeline that I realized that I don’t really have a clear idea what to do with this data. Perhaps build a RAG but what benefit would that have? Has anyone worked with such reports?
submitted by /u/AbdullahKhanSherwani
[link] [comments]
I’ve put together a dataset containing tech fingerprints from a web crawl spanning February 6th – February 13th 2026. Checkout the preview for what’s here:
https://github.com/vdbio/versiondb_samples/tree/main/stats/2026_feb
The actual dataset can be found here:
https://github.com/vdbio/versiondb_samples/releases
Have fun!
submitted by /u/Upper-Character-6743
[link] [comments]
Hello,
Does anyone have twitter dataset that contains Username/id of the account with domain in description/URL?
submitted by /u/NebulaEast1757
[link] [comments]
Thought this community might find this useful — I’ve built an API that makes NOAA’s hail data queryable by address.
The data:
- MESH (Multi-Radar Multi-Sensor): Radar-derived hail size estimates from the NEXRAD network, 2020–present, ingested nightly
- Storm Events Database: NOAA/NWS verified severe weather reports, going back to the 1950s (hail-specific events)
Both datasets are geocoded and spatially indexed, so you can query by any US address and get back every hail event within a configurable radius, with dates, estimated hail sizes (inches), distance from the address, and the data source.
Why I built it: NOAA’s raw data is publicly available but genuinely painful to work with at scale — scattered across FTP servers, inconsistent formats, no spatial indexing. I wanted a clean, fast API on top of it.
Access:
- Free tier: 100 lookups/month (no credit card)
- Web demo at https://www.stormpull.com (just type an address)
- REST API docs: https://www.stormpull.com/docs
If you’re doing any research involving hail frequency, property risk, climate patterns, or severe weather trends, this might save you a bunch of data wrangling time.
Happy to answer questions about the data sources, coverage, or methodology.
submitted by /u/danny_greer
[link] [comments]
Hi I’m Searching a Datasets for bill parts. (Small handles, electrical, connectors, screws, Nuts, Bolts etc.)
I’m doing my Bachelorsthesis in the automatic parametrisation of Vibration feeders and I need to categorize the geometry before I can select the arrangement mechanism that I’ll need
Does anyone have a idea where I can search for them? 🙂
submitted by /u/HISTeu
[link] [comments]
I need to know for a training video I’m recording – do you pronouce it “eye-so” code OR “eye- ess- oh” code?
Sorry if this isn’t relevant here, but I couldn’t really find a better subreddit to ask on. I figured the dataset people would be familiar with it
submitted by /u/HobieBrowncloak
[link] [comments]
Been working on a few dataset projects recently, mostly things like:
- lead generation lists (by niche + location)
- business directories (websites, contact info, categories)
- market research datasets (competitors, pricing, etc.)
- cleaning up messy CSVs / exports into something usable
Usually pulling from multiple sources (Google Maps, websites, public data, APIs), then deduping and structuring it into a clean dataset (CSV/XLSX).
Trying to figure out what’s actually worth building next.
If you could get one dataset built for you right now, what would it be?
Interested to see what people here actually need.
submitted by /u/jesse_jones_
[link] [comments]
Disclosure: this is our own dataset.
Our dataset consists of 362 translation segments annotated by 48 professional linguists (not crowdsourced) across 16 language pairs.
MT systems evaluated: EuroLLM-22B, Qwen3-235B, TranslateGemma-12B.
Language pairs (all from English): Arabic (MSA, Egyptian, Moroccan, Saudi), Belarusian, French, German, Hmong, Italian, Japanese, Korean, Polish, Portuguese (Brazilian and European), Russian, Ukrainian.
Each segment includes full MQM error annotations:
- error category (accuracy, fluency, terminology, etc.)
- severity level (minor, major, critical)
- exact error span in the text
- multiple annotators per segment for inter-annotator agreement analysis
Methodology follows WMT guidelines. Kendall’s τ = 0.317 on IAA – roughly 2.6x what typical WMT campaigns report.
It may be useful for MT evaluation research and benchmarking translation quality.
Dataset: https://huggingface.co/datasets/alconost/mqm-translation-gold
Happy to answer questions about the annotation process!
submitted by /u/ritis88
[link] [comments]
Around 8 years ago, we had the idea of using geographic data (census, accidents, crimes) in our models — and it ended up being a top 3 predictor.
Since then, I’ve rebuilt that postcode/zip code-level dataset at every company I’ve worked at, with great results across a range of models.
The trouble is that this dataset is difficult to create (In my case, UK):
- data is spread across multiple sources (ONS, crime, transport, etc.)
- everything comes at different geographic levels (OA / LSOA / MSOA / coordinates)
- even within a country, sources differ (e.g. England vs Scotland)
- and maintaining it over time is even worse, since formats keep changing
Which probably explains why a lot of teams don’t really invest in this properly, even though the signal is there.
After running into this a few times, a few of us ended up putting together a reusable postcode feature set for Great Britain, to avoid rebuilding it from scratch.
If anyone’s interested, happy to share more details (including a sample).
https://www.gb-postcode-dataset.co.uk/
(Note: dataset is Great Britain only)
submitted by /u/Sweaty-Stop6057
[link] [comments]
One pattern we kept seeing while working with LLM systems:
The assistant sounds correct…
but nothing actually happens.
Example:
“Your issue has been escalated and your ticket has been created.”
But in reality:
- No ticket was created
- No tool was triggered
- No structured action happened
- The user walks away thinking it’s done
This feels like a core gap in how most datasets are designed.
Most training data focuses on: → response quality
→ tone
→ conversational ability
But in real systems, what matters is: → deciding what to do
→ routing correctly
→ triggering tools
→ executing workflows reliably
We’ve been exploring this through a dataset approach focused on action-oriented behavior:
- retrieval vs answer decisions
- tool usage + structured outputs
- multi-step workflows
- real-world execution patterns
The goal isn’t to make models sound better, but to make them actually do the right thing inside a system.
Curious how others here are handling this:
- Are you training explicitly for action / tool behavior?
- Or relying on prompting + system design?
- Where do most failures show up for you?
Would love to hear how people are approaching this in production.
submitted by /u/JayPatel24_
[link] [comments]
When people start learning data analytics, one common problem is they don’t have access to real company datasets.
I recently researched several practical ways beginners can still practice real data skills like SQL, Excel, and dashboards.
Some useful approaches include:
• Using public datasets from Kaggle or government portals
• Creating sample business datasets for practice
• Participating in Kaggle competitions
• Recreating dashboards from sample datasets
These methods help simulate real work scenarios and build a strong portfolio.
I also wrote a detailed guide explaining practical ways to practice data skills even without real company data.
submitted by /u/GrowthUpbeat6355
[link] [comments]
Are there any datasets about datasets that could tell what is the average/mean size of all possibly known datasets. I know this is somehow a very unrealistic question but I’m interested to know if there are known conducted research about it.
submitted by /u/josephricafort
[link] [comments]
Chart overview — 5 panels of real NVDA data
What it is
ULTRA is a flat CSV dataset that aligns three data layers on the same 15-minute timestamp:
- GDELT (~1,256 cols): The full GCAM emotional spectrum — WordNet Affect, SentiWordNet, Harvard IV, AFINN, Loughran-McDonald financial sentiment, Moral Foundations, plus geopolitical events (GoldsteinScale, QuadClass, CAMEO codes), media mentions, entity extraction, and macro themes.
- AI Analysis (18 cols): Contextual sentiment from Gemini — not word-counting, but actual comprehension of why sentiment is negative (export controls vs earnings miss vs CEO departure). Includes impact, novelty, actionability, narrative codes, and binary flags.
- Price (16 cols): Multi-source OHLCV from Polygon.io + Twelve Data, VWAP, trade count, cross-source mean and spread, 15-min return.
96 timestamps per day. Currently covering the Magnificent Seven (AAPL, AMZN, GOOG, META, MSFT, NVDA, TSLA).
Free sample + data dictionary
Full day of NVDA data (Jan 2, 2026) — all 1,437 columns, 96 rows. No paywall, no signup.
→ Sample CSV: marketsignal.solutions/data/samples/ULTRA_sample_NVDA.csv → Data Dictionary: marketsignal.solutions/data/samples/ULTRA_DataDictionary.txt
Quick load:
import pandas as pd df = pd.read_csv("ULTRA_sample_NVDA.csv") print(f"{df.shape[1]} columns, {df.shape[0]} timestamps") # AI sentiment + price at market open cols = ["meta_timestamp", "ai_sentiment_score", "ai_impact_score", "ai_narrative_primary_code", "poly_close", "price_return_15m"] print(df[df["poly_close"].notna()][cols].head(10).to_string(index=False))
Why I built it
GDELT is incredible — it’s the world’s largest open news database. But it’s raw, unfiltered, and has no ticker mapping. If you want to use it for quant research, you need months of pipeline engineering just to get it into a usable format.
I built the pipeline that: 1. Ingests 3 GDELT streams every 15 minutes (GKG, Events, Mentions) 2. Matches articles to S&P 100 tickers via org-name resolution 3. Parses all 1,256 GCAM dimensions per ticker 4. Runs Gemini AI on every batch for contextual analysis 5. Fuses with multi-source verified price data
The result is a single CSV you can pd.read_csv() and start researching.
What I’m NOT claiming
- This is not “beat the market” data. It’s research-grade alternative data.
- GDELT is open/public — I didn’t create it. I created the pipeline, the AI layer, and the fusion.
- Coverage is currently 7 tickers (Mag 7). S&P 100 expansion is in progress.
- The AI layer depends on Gemini — it’s contextual NLP, not proprietary.
Pricing
$99/month for the Mag 7 live feed. Details at marketsignal.solutions.
Happy to answer any questions about the data, the pipeline, or the methodology.
This dataset is for research purposes. Past patterns do not guarantee future performance.
submitted by /u/SuggestionDry6614
[link] [comments]
So… for my project, i want to train a cnn, and i need a dataset consist of user distance (preferably cm) from the device (eg. Laptop, PC, phone). Please help if found any good one!
submitted by /u/Glittering_Rub_8914
[link] [comments]
I am a figurative artist based in New York with work in the collections of the Metropolitan Museum of Art, MoMA, SFMOMA, and the British Museum. I recently published my catalog raisonne as an open dataset on Hugging Face.
What is in it:
∙ Roughly 3,000 to 4,000 documented works currently, spanning 1970s to present ∙ Media includes oil on canvas, works on paper, drawings, etchings, lithographs, and digital works ∙ Metadata fields: catalog number, title, year, medium, dimensions, collection, copyright holder, license, view type ∙ Images derived from 4x5 large format transparencies, medium format slides, and high resolution photography ∙ License: CC-BY-NC-4.0, free for research and non-commercial use
What makes it unusual:
Most fine art image datasets are scraped, aggregated, or institutionally compiled. This one is published directly by the artist, with metadata mapped from original physical archive records accumulated over fifty years. Every work is fully documented and provenance is intact. It is artist-controlled from the ground up.
The dataset currently represents roughly half my total output. I will keep adding works as scanning continues. It is a living dataset, not a static dump.
It has had over 2,500 downloads in its first week on Hugging Face.
Looking for:
Researchers or developers working with art image datasets who want to discuss potential uses or collaborations. Also interested in connecting with anyone building tools for visual archive navigation, as the Hugging Face default viewer is not adequate for this kind of dataset.
Dataset: huggingface.co/datasets/Hafftka/michael-hafftka-catalog-raisonne
submitted by /u/hafftka
[link] [comments]
Hey. Just launched a platform with cleaned, formatted data ready to pipe straight into model training, i swear im not trying to promote.
All 12,500 datasets are CC0 and free to download manually. The API just handles bulk or incremental access so you don’t have to write the data pipeline yourself.
And im giving away API access to 5 people who are actively training something. No catch, just want real feedback.
Drop a comment or DM if you’re building something.
submitted by /u/IndependentRatio2336
[link] [comments]
hi, i am David, and I need an advise
I am currently developing a data monetization platform, i am still working on the development, but mainly everything is going on the road
What i am worry about is that, in order to prove the platform, the concept and the workflow is actually viable, i am making a research myself, making all the work the platform would do, manually myself
The reason behind this, is because in the past i have already made a blog like website thought for developers and had to leave the project, for no people visited it, and in general even the ones mildly interested eventually leave, having to close everything; I didn´t want that to happen again so i took that decision
Many weeks have passed and in order to prove the platform is viable and to have a proper deployment, i have at least to have 1 dataset buyer and 50 volunteers who i am paying to participate, i have successfully confirmed 5 people to be volunteers in this time and contacted many possible dataset buyers, i have contacted from ai researchers to teachers from various universities, i got some curious replies, asking about the platform and the project on its own, i even got an email from a Standford professor saying the platform sounds like a really valuable resource and will tell his students if someone is interested, but after that no one replied, I keep looking everyday for possible buyers and email them to outstretch, look in forums, post on reddit and other platforms, but not really finding anyone; this problem also applies for the volunteers, however i could ease it a bit since i am using a survey platform and got those 5 who i talked earlier and expecting it to keep getting some more
All this process as been done in parallel with the development of the platform, since i am working alone i tried using antigravity to help with bugs and extra features
it made development more bearable
That is the place i am rn, i don´t wanna end the project, but its squeezing me
What should i do?
submitted by /u/Pegamento34
[link] [comments]
I am in need of a large string of english prose, like a book or blog post, that makes use of all 26 letters that is consistent to how often they’re used over all (x, z, q used uncommonly but still included)
submitted by /u/fejiberglibstein
[link] [comments]
hi everyone I am an AI/ML student and currently I am building a project that detects littered garbage by people in public places and calls out people for violating civic responsibility and raise a real time alaram but the catch is this will be detected through IP cameras so I need a valid set of data for the model to detect the garbage that people litter.
please help…
submitted by /u/Xo_xombie
[link] [comments]