Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Looking For A Greenhouse Dataset For A University Project 🌱

Hi everyone! πŸ‘‹

I’m currently working on a university project related to greenhouse crop production and I’m in need of a dataset. Specifically, I’m looking for data that includes:

  • Crop yield (kg/ha) β€” for crops like tomato, cucumber, capsicum, or similar
  • Environmental and input parameters such as temperature, humidity, light, COβ‚‚, fertilizer usage, electricity consumption, and water usage

If anyone already has access to such a dataset or knows a reliable source where I could find one, I’d be incredibly grateful for your help. πŸ™

Thank you in advance for any leads or suggestions! 🌿

submitted by /u/BobcatNo8108
[link] [comments]

ITI Student Dropout Dataset For ML & Education Analytics

Hey everyone! πŸ‘‹

– Ever wondered which factors push students to drop out? πŸ€”

I built a synthetic dataset that lets you explore exactly that – combining academic, social, and personal variables to model dropout risk.

πŸ”— Check it out on Kaggle:

ITI Student Dropout Synthetic Dataset

πŸ“Š About the Dataset

The dataset contains 22 features covering:

  • 🎯 Demographics: age, gender, location, income, etc.
  • πŸ“˜ Academics: marks, attendance, backlogs, program type.
  • πŸ’¬ Personal & Social: motivation, family support, ragging, stress.
  • 🌐 Digital & Environmental: internet issues, distance from institute.

Target variable: dropout (Yes/No)

🧠 What You Can Do With It

  • Build and compare classification models (Logistic Regression, XGBoost, Random Forest, etc.)
  • Perform EDA and correlation analysis on academic + social factors.
  • Explore feature importance for understanding dropout causes.
  • Use it for education, ML portfolio, or student analytics dashboards.

πŸ“š Dataset Provenance:
Inspired by research like MDPI Data Journal’s dropout prediction study and India’s ITI Tracer Study (CENPAP), this dataset was programmatically generated in Python using probabilistic, rule-based logic to mimic real dropout patterns – fully synthetic and privacy-safe.

– ITI (Industrial Training Institute) offers vocational and technical education programs in India, helping students gain hands-on skills for industrial and technical careers.
These institutes mainly train students after 10th grade in trades like electrical, mechanical, civil, and computer IT.

If you like the dataset, please upvote, drop a comment, or try building models/code using it – so more learners and researchers can discover it and build something impactful!

submitted by /u/Grouchy-Peak-605
[link] [comments]

Made A 200 Dataset Save 50+ Hours Of Data Cleaning

I spent months cleaning and organizing 200+ datasets (CSV, Excel, JSON) for my own machine-learning and analytics projects.

They cover finance, retail, text, IoT, weather, and more β€” all structured, ready to use, and properly labeled.

It started as a side project but turned into something I use daily for modeling and dashboards.

If anyone’s interested in using them too, the link is in the comments πŸ‘‡

submitted by /u/Smurgen6000
[link] [comments]

Welcome To R/learndataa. Let’s Make Learning Data Actually Practical.

Hey everyone!

This subreddit is for anyone learning data science, analytics, and AI. From beginners trying to understand Python to pros sharpening their machine learning skills.

The goal is simple: learn data by doing data.

Here’s what you can expect:

  • Weekly practice challenges
  • Honest discussions about learning paths and projects
  • Tips, tools, and code snippets that actually help
  • Community-led learning projects

I’d love to hear from you. What’s your biggest struggle right now with learning data? Let’s build this space around your needs.

β€” u/Responsible-Gas-1474
Let’s learndataa, together.

submitted by /u/Responsible-Gas-1474
[link] [comments]

Sharing My Free Tool For Easy Handwritten Fine-tuning Datasets!

Hello everyone! I wanted to share a tool that I created for making hand written fine-tuning datasets, originally I built this for myself when I was unable to find conversational datasets formatted the way I needed when I was fine-tuning for the first time and hand typing JSON files seemed like some sort of torture so I built a little simple UI for myself to auto format everything for me.

I originally built this back when I was a beginner, so it is very easy to use with no prior dataset creation/formatting experience, but also has a bunch of added features I believe more experienced devs would appreciate!

I have expanded it to support :
– many formats; chatml/chatgpt, alpaca, and sharegpt/vicuna
– multi-turn dataset creation, not just pair-based
– token counting from various models
– custom fields (instructions, system messages, custom IDs),
– auto saves and every format type is written at once
– formats like alpaca have no need for additional data besides input and output, as default instructions are auto-applied (customizable)
– goal tracking bar

I know it seems a bit crazy to be manually typing out datasets, but handwritten data is great for customizing your LLMs and keeping them high-quality. I wrote a 1k interaction conversational dataset within a month during my free time, and this made it much more mindless and easy.

I hope you enjoy! I will be adding new formats over time, depending on what becomes popular or is asked for

Get it here

submitted by /u/ella0333
[link] [comments]

[WIP] ChatGPT Forecasting Dataset β€” Tracking LLM Predictions Vs Reality

Hey everyone,

I know LLMs aren’t typical predictors, but I’m curious about their forecasting ability. Since I can’t access the state of, say, yesterday’s ChatGPT to compare it with today’s values, I built a tool to track LLM predictions against actual stock prices.

Each record stores the prompt, model prediction, actual value, and optional context like related news. Example schema:

class ForecastCheckpoint: date: str predicted_value: str prompt: str actual_value: str = “” state: str = “Upcoming”

Users can choose what to track, and once real data is available, the system updates results automatically. The dataset will be open via API for LLM evaluation etc.

MVP is live: https://glassballai.com

Looking for feedback β€” would you use or contribute to something like this?

submitted by /u/aufgeblobt
[link] [comments]

Should My Business Focus On Creating Training Datasets Instead?

I run a YouTube business built on high-quality, screen-recorded software tutorials. We’ve produced 75k videos (2–5 min each) in a couple of months using a trained team of 20 operators. The business is profitable, and the production pipeline is consistent, cheap and scalable.

However, I’m considering whether what we’ve built is more valuable as AI agent training/evaluation data. Beyond videos, we can reliably produce:
– Human demonstrations of web tasks
– Event logs, (click/type/url/timing, JSONL) and replay scripts (e.g Playwright)
– Evaluation runs, (pass/fail, action scoring, error taxonomy) – Preference labels with rationales (RLAIF/RLHF)
– PII-safe/redacted outputs with QA metrics

I’m looking for some validation from anyone in the industry:
1. Is large-scale human web-task data (video + structured logs) actually useful for training or benchmarking browser/agent systems?
2. What formats/metadata are most useful (schemas, DOM cues, screenshots, replays, rationales)?
3. Do teams prefer custom task generation on demand or curated non-exclusive corpora?
4. Is there any demand for this? If so any recommendations of where to start? (I think i have a decent idea about this)

Im trying to decide whether to formalise this into a structured data/eval offering. Technical, candid feedback is much appreciated! Apologies if this isnt the right place to ask!

submitted by /u/cardDecline
[link] [comments]

I Analyzed 300+ Beauty Ads From 6 Major Brands. Here’s What Actually Worked.

1.Glossier & Rare Beauty: Emotion-led authenticity wins. Ads featuring real voices, personal moments, and self-expression hooks outperformed studio visuals by 42% in watch-through.

“This is how I wear it every day” outperformed polished tagline intros 3:1.
Lo-fi camera, warmth, and vulnerability = higher trust + saves.

2.Fenty Beauty & Dior Beauty: Identity & luxury storytelling rule. These brands drove results with bold openings + inclusivity or opulence.

Fenty’s shade range flex and Dior’s cinematic luxury scenes both delivered 38% higher brand recall and stronger engagement when paired with clear product hero shots.

Emotional tone + clear visual brand world = scroll-stopping authority.

3.The Ordinary & EstΓ©e Lauder: Ingredient authority converts. Proof-first ads highlighting hero actives (“Niacinamide 10% + Zinc”) or clinical claims delivered 52% higher CTR than emotion-only ads.

EstΓ©e Lauder’s “derm-tested” visuals with scientific overlays maintained completion rates above 70% impressive for long-form content.

Ingredient + measurable benefit = high-intent traffic.

Actionable Checklist

– Lead with a problem/solution moment, not a logo.

– Name one hero ingredient or one emotional hookβ€”not both.

– Match tone to brand: authentic (Glossier), confident (Fenty), expert (The Ordinary).

– Show proof before the CTA: testimonials, texture close-ups, or visible transformation.

– Keep the benefit visual (glow, smoothness, tone) front and center.

Want me to analyze your beauty niche next? Drop a comment.

This analysis was compiled as part of a project I’m working on. If you’re interested in this type of creative and strategic analysis, they’re still looking for alpha testers to help build and improve the product.

submitted by /u/RedBunnyJumping
[link] [comments]

[Release] I Built A Dataset Of Truth Social Posts/comments

I’m releasing a limited open dataset of Truth Social activity focused on Donald Trump’s account.
This dataset includes:

  • 31.8 million comments
  • 18,000 posts (Trump’s Truths and Retruths)
  • 1.5 million unique users

Media and URLs were removed during collection, but all text data and metadata (IDs, authors, reply links, etc.) are preserved.

The dataset is licensed under CC BY 4.0, meaning anyone can use, analyze, or build upon it with attribution.
A future version will include full media and expanded user coverage.

Heres the link πŸ™‚ https://huggingface.co/datasets/notmooodoo9/TrumpsTruthSocialPosts

submitted by /u/Ok-Analysis-6589
[link] [comments]

Exploring A Tool For Legally Cleared Driving Data Looking For Honest Feedback

Hi, I’m doing some research into how AI, robotics, and perception teams source real-world data (like driving or mobility footage) for training and testing models.

I’m especially interested in understanding how much demand there really is for high-quality, region-specific, or legally-cleared datasets β€” and whether smaller teams find it difficult to access or manage this kind of data.

If you’ve worked with visual or sensor data, I’d love your insight:

  • Where do you usually get your real-world data?
  • What’s hardest to find or most time-consuming to prepare?
  • Would having access to specific regional or compliant data be valuable to your work?
  • Is cost or licensing a major barrier?

Not promoting anything β€” just trying to gauge demand and understand the pain points in this space before I commit serious time to a project.
Any thoughts or examples would be massively helpful

submitted by /u/Warm_Sail_7908
[link] [comments]

Publish Data Snapshots As Versioned Datasets On The Hugging Face Hub

We just added a Hugging Face Datasets integration to fenic

You can now publish any fenic snapshot as a versioned, shareable dataset on the Hub and read it directly using hf:// URLs.

Example

“`python

Read a CSV file from a public dataset

df = session.read.csv(“hf://datasets/datasets-examples/doc-formats-csv-1/data.csv”)

Read Parquet files using glob patterns

df = session.read.parquet(“hf://datasets/cais/mmlu/astronomy/*.parquet”)

Read from a specific dataset revision

df = session.read.parquet(“hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/*/.parquet”) “` This makes it easy to version and share agent contexts, evaluation data, or any reproducible dataset across environments.

Docs: https://huggingface.co/docs/hub/datasets-fenic Repo: https://github.com/typedef-ai/fenic

submitted by /u/cpardl
[link] [comments]

Looking For A Dataset Of Threads.net Posts With Engagement Metrics (likes, Comments, Reposts)

Hi everyone,

I’m working on an automation + machine-learning project focused on content performance in the niche of AI automation (using n8n, workflow automations, etc). Specifically, I’m looking for a dataset of public posts from Instagram Threads (threads.net) that includes for each post:

– Post text/content

– Timestamp of publication

– Engagement metrics (likes, comments/replies, reposts/shares)

– Author’s follower count (or at least an indicator of their reach)

– Ideally, hashtags or keywords used

If you know of any publicly available dataset like this (free or open-source) or have scraped something similar yourself, I’d be extremely grateful. If not I’ll scrape it myself

Thanks in advance for any pointers, links, or repos!

submitted by /u/CauliflowerDry8400
[link] [comments]

Built A Glovo Product Data Scraper You Can Try For Free On Apify

I needed a glovo scraper on apify but the one that exists already has been broken for a few months. So I built one myself and uploaded it to apify for people to use it.

If you need to use the scraper for big data feel free to contact me and we can arrange a wayyyy cheaper option.

The current pricing is mainly for hobbyists and people to try it out with the free apify plan.

https://apify.com/blagoysimandoff/glovo-product-scraper

submitted by /u/Avatar111222333
[link] [comments]

Looking For A Way To Track Map Performance And Pick/ban Trends For Our Esports Teams

Hello guys,
I’m sure what I’m trying to do already exists. Has anyone else done something similar?

I manage an esports team, with four teams from four divisions competing each week in BO3 matches to reach Division 1. Before each BO, each team flips a coin, and the winner gets to pick two maps and ban two maps, while the loser gets to pick one map and ban two maps. I enjoy entering all this data into an Excel spreadsheet every week, but I would like to be able to generate a dashboard that shows me each team’s pick and ban trends for each map, as well as their win rate and loss rate on each map. This will allow me to anticipate a team’s favorite maps so I can ban them and, conversely, pick the maps they like the least. I’m terrible at pivot tables in Excel and can’t get what I want, and I don’t know what other tools could help me do that.

submitted by /u/cgx3577
[link] [comments]

Can I Legally Sell Data I’ve Scraped Myself?

A while back, someone needed a dataset that wasn’t available in its latest version. I ended up scraping it myself probably around 20k rows with 20–25 columns from some business related site and sent it over.

Now I’m wondering can I actually sell this kind of data on any platform, legally? If I collected it myself, am I allowed to list it somewhere for sale?

I’ve built tons of scrapers and have sent data directly to clients before, but I’m looking for a proper platform if there is one to sell on.

Yeah, I know I can Google it but Reddit bros are more brutally honest than search engines hehe…
Anyone know how this actually works?

submitted by /u/Gojo_dev
[link] [comments]

Looking For Early ChatGPT Responses – From Pineapple On Pizza To Global Unrest

Hi everyone, Im trying to track down historical ChatGPT question and response pairs, basically what ChatGPT was saying in its early days, to compare to responses now.

I’m mostly interested in culturally sensitive questions that require deeper thinking for example (but not exclusively these) -Is pineapple on pizza unhinged? -When will the Ukraine war end? -Who is the cause of biggest unrest in the world? -Should I vote Kamala or Trump? -Gay and civil right questions

Would be nice to have a few business orientated questions like what is the best ev to buy in 2022?

Does anyone know if there are public archives, scraped datasets, I will even take screen shots, or research projects that preserve these older Q&A interactions? I’ve seen things like OASST1, ShareGPT, both of which have been a good start to digging in.

English QA pairs at this stage. But will gladly take leads on other language sets if you have them.

Any leads from fellow hoarders, researchers, or time traveling prompt engineers would be amazing.

Any help greatly appreciated.

Stu

submitted by /u/Datavisualisation
[link] [comments]

Looking For Campaign Speech Datasets (ENG)

Good Day People of Reddit! Please help me graduate :))) by helping me find a suitable dataset that has the following:
1. US or any other English Speaking Country Electorial Campaign Dataset. (Debate, Speech, etc)
2. Either CSV or JSON. (Would also appreciate if you can help me find some links where i could data scrape)
3. Not limited to Presidents, Vice Presidents. Any Politician would do
4. Must be more than 10K.

For those that will recommend or comment. I thank you all!!!

submitted by /u/Actual_Quarter8447
[link] [comments]

Looking For The Most Comprehensive API Or Dataset For Upcoming Live Music Events By City And Date (including Indie Artists)

I’m trying to find the most complete source of live music event data β€” ideally accessible through an API.

For example, when I search Austin, TX or Portland, OR, I’ve noticed that Bandsintown seems to have a much more extensive dataset compared to Songkick or Jambase. However, it looks like Bandsintown doesn’t provide public API access for querying all artists or events by city/date.

Does anyone know of: – Any public (or affordable) APIs that provide event listings by city and date? – Any open datasets or scraping-friendly sources for live music events?

I’m building a project to build playlists based on upcoming live music events in a given city.

Thanks in advance for any leads!

submitted by /u/surely_normal
[link] [comments]

Datasets Into Managed APIs [self-promotion]

Hi datasets!

We have been working on https://tapintodata.com/, which lets you turn raw data files into managed, production-ready APIs in seconds. You upload your data, shape it with SQL transformations as needed, and then expose it via documented, secured endpoints.

We originally built it when we needed an API from the Scottish Energy Performance Certificate dataset, which is shared as a zip of 18 CSV files totalling 7.17 GB, which you can now access freely here: https://epcdata.scot/

It currently supports CSV, JSONL (optionally gzipped), JSON (array), Parquet, XLSX & ODS file formats for files of any size. The SQL transformations allow you to join across datasets, transform, aggregate and even geospatial indexing via H3.

It’s free to sign up with no credit card required and has generous free tier (1 GB or storage and 500 requests/month). We are still early and are looking for users that can help shape the product or any datasets you require as APIs that we can generate for you!

submitted by /u/hedgehogsinus
[link] [comments]

Social Media Hook Mastery: A Data-Driven Framework For Platform Optimization

We analyzed over 1,000 high-performing social media hooks across Instagram, YouTube, and LinkedIn using Adology’s systematic data collection and categorization.

By studying only top-performing content with our proprietary labeling methodology, we identified distinct psychological patterns that drive engagement on each platform.

What We Discovered: Each platform has fundamentally different hook preferences that reflect unique user behaviors and consumption patterns.

The Platform Truth:
> Instagram: Heavy focus on identity-driven content
> YouTube: Balanced distribution across multiple approaches
> LinkedIn: Professional complexity requiring specialized approaches

Why This Matters: Understanding these platform-specific psychological triggers allows marketers to optimize content strategy with precision, not guesswork. Our large-scale analysis reveals patterns that smaller studies or individual observation cannot capture.

Want my 1,000 hooks full list for free? Chat in the comment

submitted by /u/RedBunnyJumping
[link] [comments]