Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Synthetic Dataset For Chatbot Intent Detection Tasks

Hi everyone, this is a synthetic dataset created with the Artifex library used for training and evaluation of Intent Detection tasks in chatbots.

https://huggingface.co/datasets/tanaos/synthetic-intent-classifier-dataset-v1

It contains pairs of text samples – intent labels, where the intent labels (0 through 11) have the following meaning:

label intent
0 greeting
1 farewell
2 thank_you
3 affirmation
4 negation
5 small_talk
6 bot_capabilities
7 feedback_positive
8 feedback_negative
9 clarification
10 suggestion
11 language_change

The intents were chosen to be general enough to be applicable to most chatbots, regardless of their use.

Hope this is helpful for someone!

submitted by /u/Ok_Hold_5385
[link] [comments]

Full 2026 World Cup Match Schedule (CSV, SQLite)

Hi everyone! I was working on a small side project around the upcoming FIFA World Cup and put together the match schedule data into an easy-to-use way for my project because I couldn’t find it online. I decided to upload it to Kaggle for anyone to use! Check it out here: FIFA World Cup 2026 Match Data (Unofficial). There are 4 CSVs, teams, host cities, matches and tournament stages. There’s also a SQLite DB with the CSVs loaded in as tables for ease of use. Let me know if you have any questions, and reach out if you end up using it! 🙂

submitted by /u/incognitus_24
[link] [comments]

High Dimensional Dataset: Any Ideas?

For my master’s degree in statistics I’m attending a course on high dimensional data. We have to do a group project on an high dimensional dataset, but I’m struggling on choosing the right dataset.

Any suggestion on the dataset we could use? I’ve seen that there are many genomic dataset online, but I think they’re hard to interpret, so I was looking for something different.

Any ideas?

submitted by /u/Otherwise-Jelly-5973
[link] [comments]

Large-scale Image Dataset Of Perceptual Hashing?

‘Our dataset contains 1 200 original images’ which is not that many

Do you know of a big dataset of
URL, date first, date last, phash (or other well used perceptual hash)

for millions/billions of images

It seems to be the sort of thing that would be

  1. useful. ‘this photo first posted here’ is a useful thing to know.

  2. Fairly small. Those above would be about a kb per image. a billion of those is a terabyte.

  3. A complete pain to make the first time.

It would not get you images of the same scene or massively modified but the tiny size of the data means thats a trade off.

submitted by /u/cavedave
[link] [comments]

Football Match Datasets – Specification Of Event Times For Each Match In A Given Competition

Hello,

As stated in the title, I’m looking for a dataset that includes all events in a football match (e.g., goals, fouls, yellow cards, VAR incidents, etc.) with the exact minute at which each event occurs. The datasets I’m familiar with only provide descriptive statistics for certain variables, which doesn’t meet my needs. If anyone knows of a specific dataset or has any clue about where to build or reconstruct one easily, it would help me a lot!

Thanks in advance for your help, and have a great day.

submitted by /u/Taboulett
[link] [comments]

Anyone Here Run Human Data / RLHF / Eval / QA Workflows For AI Models And Agents? Looking For Your War Stories.

I’ve been reading a lot of papers and blog posts about RLHF / human data / evaluation / QA for AI models and agents, but they’re usually very high level.

I’m curious how this actually looks day to day for people who work on it. If you’ve been involved in any of:

RLHF / human data pipelines / labeling / annotation for LLMs or agents / human evaluation / QA of model or agent behaviour / project ops around human data

…I’d love to hear, at a high level:

how you structure the workflows and who’s involvedhow you choose tools vs building in-house (or any missing tools you’ve had to hack together yourself)what has surprised you compared to the “official” RLHF diagrams

Not looking for anything sensitive or proprietary, just trying to understand how people are actually doing this in the wild.

Thanks to anyone willing to share their experience. 🙏

submitted by /u/bibbletrash
[link] [comments]

Data-Driven “Men’s Global Wellbeing Index” Project (With Domain + Dashboard + Dataset)

Hey everyone,

I’ve been working on a project called the Men’s Global Wellbeing Index (MGWI) — a data-driven scoring system that compares men’s wellbeing conditions across different countries. I’ve put a lot into building the core foundation, but I’m shifting my focus to other projects and don’t want this one to sit unused.

I’m looking for someone who wants to take it over, expand it, or build something bigger on top of it. or someone who wants to repurpose it for a similiar project.

🔧 What MGWI Includes

  • 10 fully defined metrics (Suicide, Social Bias, Child Custody, Legal Bias, Homelessness, Workplace Fairness, Freedom of Expression, Mental Health Access, Violence Against Men, Loneliness)

Each metric includes:

  • Emoji marker
  • Full rationale/explanation
  • Consistent scoring system

Additional assets:

  • 10 countries scored (100-point total index)
  • Airtable backend with all data structured
  • Softr dashboard (mock-up style)
  • Name: Mensglobalwellbeingindex dot com
  • Brand notes, methodology, and all assets included

🔎 SEO Notes

Some MGWI-related pages are already ranking on the first page for keywords like:

  • global wellbeing index for men
  • men’s wellbeing index
  • men’s global index
  • global index for men
  • index for men’s global wellbeing

(Useful if someone wants to continue the project or build an SEO-focused site.)

🎯 Who This Is Good For

  • Researchers
  • Activists or NGOs
  • University projects
  • Startups in wellbeing, mental health, or analytics
  • Indie makers looking for a meaningful data project
  • Anyone wanting a niche SEO website with long-term potential

📦 What I Can Share If You’re Interested

  • Demo video of the dashboard
  • Sample of the dataset
  • Full scoring methodology
  • Asset list + structure
  • Notes on future expansion (global rankings, crowdsourced sentiment, etc.)

I’m open to offers — mainly want this to go to someone who will actually build it out.

If you’re interested or want to see more, just comment or DM me.

submitted by /u/Zealousideal-Gap414
[link] [comments]