Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Action-oriented LLM Datasets (tool Use + Workflows + Decision Logic)

Most datasets rely on logs or real user data — which makes them messy, inconsistent, and hard to use due to privacy constraints.

What we’re doing differently:

  • fully synthetic, controllable data
  • structured as state → decision → action → outcome
  • built for tool use + multi-step workflows, not just text

So instead of cleaning logs, you can generate clean, privacy-safe datasets aligned to how your systems actually behave.

Curious if others are moving toward synthetic + behavior-driven datasets for agents?

submitted by /u/JayPatel24_
[link] [comments]

Almost Made A Dataset But Don’t Know What To Do With It

This weekend I was looking for a dataset on major air crashes (I like planes) containing the text of their final reports. Surprisingly I was unable to find even a single open source dataset matching this criteria. Anyway I started collecting a few reports and was in the stage of extracting and finalising the cleaning pipeline that I realized that I don’t really have a clear idea what to do with this data. Perhaps build a RAG but what benefit would that have? Has anyone worked with such reports?

submitted by /u/AbdullahKhanSherwani
[link] [comments]

10+ Years Of NOAA Hail Data, Geocoded And Queryable Via Free API

Thought this community might find this useful — I’ve built an API that makes NOAA’s hail data queryable by address.

The data:

  • MESH (Multi-Radar Multi-Sensor): Radar-derived hail size estimates from the NEXRAD network, 2020–present, ingested nightly
  • Storm Events Database: NOAA/NWS verified severe weather reports, going back to the 1950s (hail-specific events)

Both datasets are geocoded and spatially indexed, so you can query by any US address and get back every hail event within a configurable radius, with dates, estimated hail sizes (inches), distance from the address, and the data source.

Why I built it: NOAA’s raw data is publicly available but genuinely painful to work with at scale — scattered across FTP servers, inconsistent formats, no spatial indexing. I wanted a clean, fast API on top of it.

Access:

If you’re doing any research involving hail frequency, property risk, climate patterns, or severe weather trends, this might save you a bunch of data wrangling time.

Happy to answer questions about the data sources, coverage, or methodology.

submitted by /u/danny_greer
[link] [comments]

I’m Looking For 3D Geometry Datasets Of Bulk Parts

Hi I’m Searching a Datasets for bill parts. (Small handles, electrical, connectors, screws, Nuts, Bolts etc.)

I’m doing my Bachelorsthesis in the automatic parametrisation of Vibration feeders and I need to categorize the geometry before I can select the arrangement mechanism that I’ll need

Does anyone have a idea where I can search for them? 🙂

submitted by /u/HISTeu
[link] [comments]

Anyone Here Need A Very Specific Dataset Built?

Been working on a few dataset projects recently, mostly things like:

  • lead generation lists (by niche + location)
  • business directories (websites, contact info, categories)
  • market research datasets (competitors, pricing, etc.)
  • cleaning up messy CSVs / exports into something usable

Usually pulling from multiple sources (Google Maps, websites, public data, APIs), then deduping and structuring it into a clean dataset (CSV/XLSX).

Trying to figure out what’s actually worth building next.

If you could get one dataset built for you right now, what would it be?

Interested to see what people here actually need.

submitted by /u/jesse_jones_
[link] [comments]

Professional MQM-annotated Machine Translation Dataset – 16 Lang Pairs, 48 Annotators

Disclosure: this is our own dataset.

Our dataset consists of 362 translation segments annotated by 48 professional linguists (not crowdsourced) across 16 language pairs.

MT systems evaluated: EuroLLM-22B, Qwen3-235B, TranslateGemma-12B.

Language pairs (all from English): Arabic (MSA, Egyptian, Moroccan, Saudi), Belarusian, French, German, Hmong, Italian, Japanese, Korean, Polish, Portuguese (Brazilian and European), Russian, Ukrainian.

Each segment includes full MQM error annotations:

  • error category (accuracy, fluency, terminology, etc.)
  • severity level (minor, major, critical)
  • exact error span in the text
  • multiple annotators per segment for inter-annotator agreement analysis

Methodology follows WMT guidelines. Kendall’s τ = 0.317 on IAA – roughly 2.6x what typical WMT campaigns report.

It may be useful for MT evaluation research and benchmarking translation quality.

Dataset: https://huggingface.co/datasets/alconost/mqm-translation-gold

Happy to answer questions about the annotation process!

submitted by /u/ritis88
[link] [comments]

Postcode/ZIP Code Dataset Is My Modelling Gold

Around 8 years ago, we had the idea of using geographic data (census, accidents, crimes) in our models — and it ended up being a top 3 predictor.

Since then, I’ve rebuilt that postcode/zip code-level dataset at every company I’ve worked at, with great results across a range of models.

The trouble is that this dataset is difficult to create (In my case, UK):

  • data is spread across multiple sources (ONS, crime, transport, etc.)
  • everything comes at different geographic levels (OA / LSOA / MSOA / coordinates)
  • even within a country, sources differ (e.g. England vs Scotland)
  • and maintaining it over time is even worse, since formats keep changing

Which probably explains why a lot of teams don’t really invest in this properly, even though the signal is there.

After running into this a few times, a few of us ended up putting together a reusable postcode feature set for Great Britain, to avoid rebuilding it from scratch.

If anyone’s interested, happy to share more details (including a sample).

https://www.gb-postcode-dataset.co.uk/

(Note: dataset is Great Britain only)

submitted by /u/Sweaty-Stop6057
[link] [comments]

Why LLMs Sound Right But Fail To Actually Do Anything (and How We’re Thinking About Datasets Differently)

One pattern we kept seeing while working with LLM systems:

The assistant sounds correct…
but nothing actually happens.

Example:

“Your issue has been escalated and your ticket has been created.”

But in reality:

  • No ticket was created
  • No tool was triggered
  • No structured action happened
  • The user walks away thinking it’s done

This feels like a core gap in how most datasets are designed.

Most training data focuses on: → response quality
→ tone
→ conversational ability

But in real systems, what matters is: → deciding what to do
→ routing correctly
→ triggering tools
→ executing workflows reliably

We’ve been exploring this through a dataset approach focused on action-oriented behavior:

  • retrieval vs answer decisions
  • tool usage + structured outputs
  • multi-step workflows
  • real-world execution patterns

The goal isn’t to make models sound better, but to make them actually do the right thing inside a system.

Curious how others here are handling this:

  • Are you training explicitly for action / tool behavior?
  • Or relying on prompting + system design?
  • Where do most failures show up for you?

Would love to hear how people are approaching this in production.

submitted by /u/JayPatel24_
[link] [comments]

How Do Beginners Practice Data Analysis Without Company Data?

When people start learning data analytics, one common problem is they don’t have access to real company datasets.

I recently researched several practical ways beginners can still practice real data skills like SQL, Excel, and dashboards.

Some useful approaches include:

• Using public datasets from Kaggle or government portals

• Creating sample business datasets for practice

• Participating in Kaggle competitions

• Recreating dashboards from sample datasets

These methods help simulate real work scenarios and build a strong portfolio.

I also wrote a detailed guide explaining practical ways to practice data skills even without real company data.

submitted by /u/GrowthUpbeat6355
[link] [comments]

[Self-Promotion] [Paid] I Built A 1,437-column Alternative Financial Dataset That Fuses GDELT News Intelligence, AI Sentiment, And Multi-source Price At 15-minute Resolution. Free Sample Inside.

Chart overview — 5 panels of real NVDA data

What it is

ULTRA is a flat CSV dataset that aligns three data layers on the same 15-minute timestamp:

  • GDELT (~1,256 cols): The full GCAM emotional spectrum — WordNet Affect, SentiWordNet, Harvard IV, AFINN, Loughran-McDonald financial sentiment, Moral Foundations, plus geopolitical events (GoldsteinScale, QuadClass, CAMEO codes), media mentions, entity extraction, and macro themes.
  • AI Analysis (18 cols): Contextual sentiment from Gemini — not word-counting, but actual comprehension of why sentiment is negative (export controls vs earnings miss vs CEO departure). Includes impact, novelty, actionability, narrative codes, and binary flags.
  • Price (16 cols): Multi-source OHLCV from Polygon.io + Twelve Data, VWAP, trade count, cross-source mean and spread, 15-min return.

96 timestamps per day. Currently covering the Magnificent Seven (AAPL, AMZN, GOOG, META, MSFT, NVDA, TSLA).

Free sample + data dictionary

Full day of NVDA data (Jan 2, 2026) — all 1,437 columns, 96 rows. No paywall, no signup.

Sample CSV: marketsignal.solutions/data/samples/ULTRA_sample_NVDA.csvData Dictionary: marketsignal.solutions/data/samples/ULTRA_DataDictionary.txt

Quick load:

import pandas as pd df = pd.read_csv("ULTRA_sample_NVDA.csv") print(f"{df.shape[1]} columns, {df.shape[0]} timestamps") # AI sentiment + price at market open cols = ["meta_timestamp", "ai_sentiment_score", "ai_impact_score", "ai_narrative_primary_code", "poly_close", "price_return_15m"] print(df[df["poly_close"].notna()][cols].head(10).to_string(index=False)) 

Why I built it

GDELT is incredible — it’s the world’s largest open news database. But it’s raw, unfiltered, and has no ticker mapping. If you want to use it for quant research, you need months of pipeline engineering just to get it into a usable format.

I built the pipeline that: 1. Ingests 3 GDELT streams every 15 minutes (GKG, Events, Mentions) 2. Matches articles to S&P 100 tickers via org-name resolution 3. Parses all 1,256 GCAM dimensions per ticker 4. Runs Gemini AI on every batch for contextual analysis 5. Fuses with multi-source verified price data

The result is a single CSV you can pd.read_csv() and start researching.

What I’m NOT claiming

  • This is not “beat the market” data. It’s research-grade alternative data.
  • GDELT is open/public — I didn’t create it. I created the pipeline, the AI layer, and the fusion.
  • Coverage is currently 7 tickers (Mag 7). S&P 100 expansion is in progress.
  • The AI layer depends on Gemini — it’s contextual NLP, not proprietary.

Pricing

$99/month for the Mag 7 live feed. Details at marketsignal.solutions.

Happy to answer any questions about the data, the pipeline, or the methodology.


This dataset is for research purposes. Past patterns do not guarantee future performance.

submitted by /u/SuggestionDry6614
[link] [comments]

[Dataset] 50-year Single-artist Fine Art Archive With Full Provenance Metadata — CC-BY-NC-4.0

I am a figurative artist based in New York with work in the collections of the Metropolitan Museum of Art, MoMA, SFMOMA, and the British Museum. I recently published my catalog raisonne as an open dataset on Hugging Face.

What is in it:

∙ Roughly 3,000 to 4,000 documented works currently, spanning 1970s to present ∙ Media includes oil on canvas, works on paper, drawings, etchings, lithographs, and digital works ∙ Metadata fields: catalog number, title, year, medium, dimensions, collection, copyright holder, license, view type ∙ Images derived from 4x5 large format transparencies, medium format slides, and high resolution photography ∙ License: CC-BY-NC-4.0, free for research and non-commercial use 

What makes it unusual:

Most fine art image datasets are scraped, aggregated, or institutionally compiled. This one is published directly by the artist, with metadata mapped from original physical archive records accumulated over fifty years. Every work is fully documented and provenance is intact. It is artist-controlled from the ground up.

The dataset currently represents roughly half my total output. I will keep adding works as scanning continues. It is a living dataset, not a static dump.

It has had over 2,500 downloads in its first week on Hugging Face.

Looking for:

Researchers or developers working with art image datasets who want to discuss potential uses or collaborations. Also interested in connecting with anyone building tools for visual archive navigation, as the Hugging Face default viewer is not adequate for this kind of dataset.

Dataset: huggingface.co/datasets/Hafftka/michael-hafftka-catalog-raisonne

submitted by /u/hafftka
[link] [comments]

12,500 Cleaned CC0 Datasets And Giving Away 5 Spots

Hey. Just launched a platform with cleaned, formatted data ready to pipe straight into model training, i swear im not trying to promote.

All 12,500 datasets are CC0 and free to download manually. The API just handles bulk or incremental access so you don’t have to write the data pipeline yourself.

And im giving away API access to 5 people who are actively training something. No catch, just want real feedback.

Drop a comment or DM if you’re building something.

submitted by /u/IndependentRatio2336
[link] [comments]

I Need A Real Advise……………..

hi, i am David, and I need an advise

I am currently developing a data monetization platform, i am still working on the development, but mainly everything is going on the road

What i am worry about is that, in order to prove the platform, the concept and the workflow is actually viable, i am making a research myself, making all the work the platform would do, manually myself

The reason behind this, is because in the past i have already made a blog like website thought for developers and had to leave the project, for no people visited it, and in general even the ones mildly interested eventually leave, having to close everything; I didn´t want that to happen again so i took that decision

Many weeks have passed and in order to prove the platform is viable and to have a proper deployment, i have at least to have 1 dataset buyer and 50 volunteers who i am paying to participate, i have successfully confirmed 5 people to be volunteers in this time and contacted many possible dataset buyers, i have contacted from ai researchers to teachers from various universities, i got some curious replies, asking about the platform and the project on its own, i even got an email from a Standford professor saying the platform sounds like a really valuable resource and will tell his students if someone is interested, but after that no one replied, I keep looking everyday for possible buyers and email them to outstretch, look in forums, post on reddit and other platforms, but not really finding anyone; this problem also applies for the volunteers, however i could ease it a bit since i am using a survey platform and got those 5 who i talked earlier and expecting it to keep getting some more

All this process as been done in parallel with the development of the platform, since i am working alone i tried using antigravity to help with bugs and extra features

it made development more bearable

That is the place i am rn, i don´t wanna end the project, but its squeezing me

What should i do?

submitted by /u/Pegamento34
[link] [comments]

In Need Of A Dataset For A Very Important Project

hi everyone I am an AI/ML student and currently I am building a project that detects littered garbage by people in public places and calls out people for violating civic responsibility and raise a real time alaram but the catch is this will be detected through IP cameras so I need a valid set of data for the model to detect the garbage that people litter.

please help…

submitted by /u/Xo_xombie
[link] [comments]