Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Sales Analysis Yearly Report- Help A Newbie

Hello all, Hope evryone is doing well

I just started new job and have sales report coming up…are there anyone who’s into sales data who can tell me what metrics and visuals I can add to get more out of this kind of data(I have done some analysis and want some inputs from experts)the data is transaction wise with 1 year worth of data

Thank you in advance

submitted by /u/Afraid-Sound5502
[link] [comments]

[Dataset] Multi-Asset Market Signals Dataset For ML (leakage-safe, Research-grade)

I’ve released a research-grade financial dataset designed for machine

learning and quantitative research, with a strong focus on preventing

lookahead bias.

The dataset includes:

– Multi-asset daily price data

– Technical indicators (momentum, volatility, trend, volume)

– Macroeconomic features aligned by release dates

– Risk metrics (drawdowns, VaR, beta, tail risk)

– Strictly forward-looking targets at multiple horizons

All features are computed using only information available at the time,

and macro data is aligned using publication dates to ensure temporal

integrity.

The dataset follows a layered structure (raw → processed → aggregated),

with full traceability and reproducible pipelines. A baseline,

leakage-safe modeling notebook is included to demonstrate correct usage.

The dataset is publicly available here:

Kaggle link:

https://www.kaggle.com/datasets/DIKKAT_LINKI_BURAYA_YAPISTIR

Feedback and suggestions are very welcome.

submitted by /u/subcomandante_65
[link] [comments]

Github Top Projects From 2013 To 2025 (423,098 Entries)

Introducing the github-top-projects dataset: A comprehensive dataset of 423,098 GitHub trending repository entries spanning 12+ years (August 2013 – November 2025).

This dataset tracks the evolution of GitHub’s trending repositories over time, offering insights into software development trends across programming languages and domains spanning 12 years.

submitted by /u/Ok_Employee_6418
[link] [comments]

KashRock API Is In Public Beta — Normalized Player Props + DFS + Esports + Odds (looking For Testers)

Disclosure: I’m the developer of KashRock (this is my project).

I’m sharing a normalized sports betting markets dataset/API that unifies player props, main markets, esports props, and traditional odds across multiple books (DFS + sportsbooks). The core value is canonicalization: one stat key, one player name, consistent IDs across books (so merges/joining across sources is straightforward). Some records also include bet links.

What’s included

• Player props + main markets • Esports props • Traditional odds • DFS books (PrizePicks, Underdog, ParlayPlay, etc.) • Sportsbooks (bet365, Pinnacle, Hard Rock, Bovada, and more) 

What I want feedback on (from dataset users)

• Schema/field naming (what you’d change to make it easier to use) • Missing identifiers you need for joins (event/team/player IDs) • Any normalization edge cases you want covered 

Docs / access: https://api.kashrock.com/docs#/

submitted by /u/Apprehensive_Ice8314
[link] [comments]

How Do I Scrape Data From A Subreddit?

Hey everyone, I am new to the subreddit here but I have looked and looked and am not able to find a straight answer.

I am masters student who needs the data from a particular subreddit (r/antiwork). Part of it is available on Kaggle but I need the lastest posts as well. I know there have been some changes in the Reddit API rules and with Pushshift not being available any more… Is there a way I can get more data??

I am using R and have tried using the RedditExtractoR package but that only gives me about 250 posts at once. Any tips would be really helpful. Thank you!

submitted by /u/Legitimate-Bite4801
[link] [comments]

Any Recs For Solid Data Analysis Tools That Don’t Leak My Info?

I’m hunting for tools to help crunch data without the manual headache. What are you guys actually using for deep analysis, especially for mixing messy Excel sheets with PDFs?

Edit: I’ve messed around with a few—ChatGPT is decent for basic formulas, and Infinisynapse has been a game changer. It’s pretty sick because it handles cross-source analysis locally on my machine, so I can scrape web data straight into my DB without worrying about privacy leaks.

submitted by /u/MongWonP
[link] [comments]

How Do You Decide When A Messy Dataset Is “good Enough” To Start Modeling?

Lately I’ve been jumping between different public datasets for a side project, and I keep running into the same question: at what point do you stop cleaning and start analyzing?

Some datasets are obviously noisy – duplicated IDs, half-missing columns, weird timestamp formats, etc. My usual workflow is pretty standard: Pandas profiling → a few sanity checks in a notebook → light exploratory visualizations → then I try to build a baseline model or summary. But I’ve noticed a pattern: I often spend way too long chasing “perfect structure” before I actually begin the real work.

I tried changing the process a bit. I started treating the early phase more like a rehearsal. I’d talk through my reasoning out loud, use GPT or Claude to sanity-check assumptions, and occasionally run mock explanations with the Beyz coding assistant to see if my logic held up when spoken. This helped me catch weak spots in my cleaning decisions much faster. But I’m still unsure where other people draw the line.
How do you decide:

  • when the cleaning is “good enough”?
  • when to switch from preprocessing to actual modeling?
  • what level of missingness/noise is acceptable before you discard or rebuild a dataset?

Would love to hear how others approach this, especially for messy real-world datasets where there’s no official schema to lean on. TIA!

submitted by /u/jinxxx6-6
[link] [comments]

Seeking Tips For A Paid Dataset Of Twitter (X) High-follower Count Contact Info / Emails

I operate the Unofficial Twitter (X) Discord with 3400 members, and in 2026 we plan to begin hosting guest speakers with large followings to share their content strategy, tools they use etc.

I’m looking for a paid index or database of verified emails and Twitter profiles to automate the invitation process. Tweetscraper has a conversion rate of 10% contact emails which is a start. Bright Data has profile data and PII like real names but no contact information.

Any tips for other paid or free solutions are greatly appreciated!

submitted by /u/Alan-Foster
[link] [comments]

I Done Mt First Project Spotify Trends And Popularity Analysis

This is my first data analysis project, and I know it’s far from perfect.

I’m still learning, so there are definitely mistakes, gaps, or things that could have been done better — whether it’s in data cleaning, SQL queries, insights, or the dashboard design.

I’d genuinely appreciate it if you could take a look and point out anything that’s wrong or can be improved.
Even small feedback helps a lot at this stage.

I’m sharing this to learn, not to show off — so please feel free to be honest and direct.
Thanks in advance to anyone who takes the time to review it 🙏

github : https://github.com/1prinnce/Spotify-Trends-Popularity-Analysis

submitted by /u/1prinnce
[link] [comments]

Synthetic Dataset For Chatbot Intent Detection Tasks

Hi everyone, this is a synthetic dataset created with the Artifex library used for training and evaluation of Intent Detection tasks in chatbots.

https://huggingface.co/datasets/tanaos/synthetic-intent-classifier-dataset-v1

It contains pairs of text samples – intent labels, where the intent labels (0 through 11) have the following meaning:

label intent
0 greeting
1 farewell
2 thank_you
3 affirmation
4 negation
5 small_talk
6 bot_capabilities
7 feedback_positive
8 feedback_negative
9 clarification
10 suggestion
11 language_change

The intents were chosen to be general enough to be applicable to most chatbots, regardless of their use.

Hope this is helpful for someone!

submitted by /u/Ok_Hold_5385
[link] [comments]

Full 2026 World Cup Match Schedule (CSV, SQLite)

Hi everyone! I was working on a small side project around the upcoming FIFA World Cup and put together the match schedule data into an easy-to-use way for my project because I couldn’t find it online. I decided to upload it to Kaggle for anyone to use! Check it out here: FIFA World Cup 2026 Match Data (Unofficial). There are 4 CSVs, teams, host cities, matches and tournament stages. There’s also a SQLite DB with the CSVs loaded in as tables for ease of use. Let me know if you have any questions, and reach out if you end up using it! 🙂

submitted by /u/incognitus_24
[link] [comments]

High Dimensional Dataset: Any Ideas?

For my master’s degree in statistics I’m attending a course on high dimensional data. We have to do a group project on an high dimensional dataset, but I’m struggling on choosing the right dataset.

Any suggestion on the dataset we could use? I’ve seen that there are many genomic dataset online, but I think they’re hard to interpret, so I was looking for something different.

Any ideas?

submitted by /u/Otherwise-Jelly-5973
[link] [comments]

Large-scale Image Dataset Of Perceptual Hashing?

‘Our dataset contains 1 200 original images’ which is not that many

Do you know of a big dataset of
URL, date first, date last, phash (or other well used perceptual hash)

for millions/billions of images

It seems to be the sort of thing that would be

  1. useful. ‘this photo first posted here’ is a useful thing to know.

  2. Fairly small. Those above would be about a kb per image. a billion of those is a terabyte.

  3. A complete pain to make the first time.

It would not get you images of the same scene or massively modified but the tiny size of the data means thats a trade off.

submitted by /u/cavedave
[link] [comments]

Football Match Datasets – Specification Of Event Times For Each Match In A Given Competition

Hello,

As stated in the title, I’m looking for a dataset that includes all events in a football match (e.g., goals, fouls, yellow cards, VAR incidents, etc.) with the exact minute at which each event occurs. The datasets I’m familiar with only provide descriptive statistics for certain variables, which doesn’t meet my needs. If anyone knows of a specific dataset or has any clue about where to build or reconstruct one easily, it would help me a lot!

Thanks in advance for your help, and have a great day.

submitted by /u/Taboulett
[link] [comments]

Anyone Here Run Human Data / RLHF / Eval / QA Workflows For AI Models And Agents? Looking For Your War Stories.

I’ve been reading a lot of papers and blog posts about RLHF / human data / evaluation / QA for AI models and agents, but they’re usually very high level.

I’m curious how this actually looks day to day for people who work on it. If you’ve been involved in any of:

RLHF / human data pipelines / labeling / annotation for LLMs or agents / human evaluation / QA of model or agent behaviour / project ops around human data

…I’d love to hear, at a high level:

how you structure the workflows and who’s involvedhow you choose tools vs building in-house (or any missing tools you’ve had to hack together yourself)what has surprised you compared to the “official” RLHF diagrams

Not looking for anything sensitive or proprietary, just trying to understand how people are actually doing this in the wild.

Thanks to anyone willing to share their experience. 🙏

submitted by /u/bibbletrash
[link] [comments]