Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Master’s Project Ideas To Build Quantitative/data Skills?

Hey everyone,

I’m a master’s student in sociology starting my research project. My main goal is to get better at quantitative analysis, stats, working with real datasets, and python.

I was initially interested in Central Asian migration to France, but I’m realizing it’s hard to find big or open data on that. So I’m open to other sociological topics that will let me really practice data analysis.

I will greatly appreciate suggestions for topics, datasets, or directions that would help me build those skills?

Thanks!

submitted by /u/NebooCHADnezzar
[link] [comments]

I Built A Small AI That Reads Spreadsheets And Tells You The Story Inside — Want To Help Test It?

Hey everyone,
I’m testing a small experiment under Aptorie Labs, an AI that looks at your CSV or Excel files and writes a short, plain-English story about what’s really happening in the data.

It’s called Data-to-Narrative, and it’s built around a simple idea:
Instead of dashboards full of numbers, you get a short paragraph that sounds like a human analyst, no jargon, no buzzwords, just what matters.

I’m looking for a few early testers to try it out this week. You upload a dataset (sales, support tickets, survey results, etc.), and I’ll send back a written summary you can actually read and share with your team.

If you’re interested, DM me and I’ll send you the invite link to the beta upload form.
It’s part of a closed test, so I’m keeping the first batch small to make sure the summaries feel right.

Thanks in advance to anyone who wants to kick the tires. I’ll post a few anonymized examples once we’ve run the first round of tests.

Len

submitted by /u/lenbuilds
[link] [comments]

European Auto Data Startup: Partners & Providers Wanted

We are about to launch a new automotive data project, offering a highly detailed vehicle report for car checks. We will operate exclusively in the European market. Most of the data is already in place through our providers, but we are still exploring the market and are open to new collaborations.

We are looking for people who can help with the project: data providers, industry professionals, etc. Specifically, we are interested in providers for:

  • Commercial use status (taxi, rental, etc.)
  • Recalls
  • Damage information / Mileage information
  • Any other relevant data that could be integrated into our reports

We expect high volumes from launch, as we already have a large affiliate network and strong industry connections.

Thank you!

submitted by /u/cauchyez
[link] [comments]

You, Too Can Now Leverage “Artificial Indian”

There was a joke for a while, that “AI” actually stood for “Artificial Indian”, after multiple companys’ touted “AI” turned out to be a bunch of outsourced, low cost-of-living country workers remotely, behind the scenes.

I just found out that AWS’s assorted SageMaker AI offerings, now offer direct, non-hidden Artificial Indian for anyone to hire, through a convenient interface they are calling “Mechanical Turk”.

https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-management-public.html

I’m posting here, because its primary purpose is to give people a standardized AI to pay for HUMAN INPUT on labelling datasets, so I figured the more people on the research side who knew about this, the better.

Get your dataset captioned by the latest in AI technology! 🙂

(disclaimer: I’m not being paid by AWS for posting this, etc., etc.)

submitted by /u/lostinspaz
[link] [comments]

Will Using Synthetic Data Affect My ML Model Accuracy Or My Resume?

Hey everyone 👋 I’m currently working on my final year engineering project based on disease prediction using Machine Learning.

Since real medical datasets are hard to find, I decided to generate synthetic data for training and testing my model. Some people told me it’s not a good idea — that it might affect my model accuracy or even look bad on my resume.

But my main goal is to learn the entire ML workflow — from preprocessing to model building and evaluation.

So I wanted to ask: 👉 Will using synthetic data affect my model’s performance or generalization? 👉 Does it look bad on a resume or during interviews if I mention that I used synthetic data? 👉 Any suggestions to make my project more authentic or practical despite using synthetic data?

Would really appreciate honest opinions or experiences from others who’ve been in the same situation 🙌

submitted by /u/shrinivas-2003
[link] [comments]

Finance-Instruct-500k-Japanese Dataset

Introducing the Finance-Instruct-500k-Japanese dataset 🎉

This is a Japanese dataset that includes complex questions and answers related to finance and economics.

This dataset is useful for training, evaluating, and instruction-tuning LLMs on Japanese financial and economic reasoning tasks.

submitted by /u/Ok_Employee_6418
[link] [comments]

[Self-Promotion] VC And Funded Startups Databases

After 5 years of curating VC contacts and funded startup data, I’m moving on to a new project. Instead of letting all this data disappear, I’m offering one last chance to grab it at 60% off.

What’s included:

VC Contact Lists (13 databases):

  • Complete VC contact database (1,300+ firms)
  • Specialized lists: AI, Biotech, Fintech, HealthTech, SaaS VCs
  • Stage-focused: Pre-Seed VCs, Seed VCs
  • Geography-focused: Silicon Valley, New York, Europe, USA
  • Bonus: AI Investors list

Funded Startup Databases (10 databases):

  • Full database: 6,000+ verified funded startups
  • By sector: AI/ML, SaaS, Fintech, Biotech/Pharma, Digital Health, Climate Tech
  • By region: USA, Europe, Silicon Valley

Everything is in Excel format, ready to download and use immediately.

Link: https://projectstartups.com

Happy to answer questions!

submitted by /u/project_startups
[link] [comments]

We Have A 60M Influencer Database And We’re Ready To Share It With You

Hey everyone! We’re the Crossnetics team, and we specialize in large-scale web data extraction. We handle any type of request and build custom databases with 30, 50, 100+ million records in just a few days (yes, we really have that kind of power).

We’ve already collected a ready-to-use database of 60M influencers worldwide, and we’re happy to share it with you. We can export it in any format and with any parameters you need.

If you’re interested, drop a comment or DM us — we’ll send details and what we can build for you.

submitted by /u/unicornsz03
[link] [comments]

Looking For Reliable Live Ocean Data Sources – Australia

Hey everyone! I’m a Master’s student based in Melbourne working on a project called FLOAT WITH IT, an interactive installation that raises awareness about rip currents and beach safety to reduce drowning among locals and tourists who often visit Australian beaches without knowing the risks. The installation uses real-time ocean data to project dynamic visuals of waves and rip currents onto the ground. Participants can literally step into the projection, interact with motion-tracked currents, and learn how rip currents behave and more importantly, how to respond safely.

For this project, I’m looking for access to a live ocean data API that provides: Wave height / direction / period Tidal data Current speed and direction For Australian coastal areas (especially Jan Juc Beach, Victoria) I’ve already looked into sources like Surfline, and some open marine data APIs, but most are limited or don’t offer live updates for Australian waters. Does anyone know of a public, educational, or low-cost API I could use for this? Even tips on where to find reliable live ocean datasets would be super helpful! This is a non-commercial, university research project, and I’ll be crediting any data sources used in the final installation and exhibition. Thanks so much for your help I’d love to hear from anyone working with ocean data, marine monitoring, or interactive visualisation!

TLDR; Im a Master’s student creating an interactive installation about rip currents and beach safety in Australia. Looking for live ocean data APIs (wave, tide, current info, especially for Jan Juc Beach VIC). Need something public, affordable, or educational-access friendly. Any leads appreciated!

submitted by /u/pranavron
[link] [comments]

Looking For Official E-ZPass / Toll Transaction APIs Or Vendor Contacts (building Driver Platform)

Hi all — I’m building a platform for drivers that consolidates toll activity and alerts drivers to unpaid or missed E-ZPass transactions (cases where the transponder didn’t register at a toll booth, or missed/failed toll posts). This can save drivers and fleet owners thousands in fines and plate suspensions — but I’m hitting a roadblock: finding a lawful, reliable data source / API that provides toll transaction records (or near-real-time missed/toll event feeds).

What I’m looking for:

  • Official APIs or data feeds (state toll agencies, E-ZPass Group members, DOTs) that provide: account/plate/toll-event, timestamp, toll location, amount, status (paid/unpaid), and reconciliation IDs.
  • Vendor/portal contacts at toll system vendors or third-party integrators who expose APIs.
  • Advice on legal/contractual path: who to contact to get read-only access for fleets, or how others built partnerships with toll agencies.
  • Pointers to public datasets or FOIA requests that returned usable toll transaction data.

If you’ve done something similar, worked at a toll authority, or can introduce me to the right dev/ops/partnership contact, please DM or reply here. Happy to share high-level architecture and the compliance steps we’ll follow. Thanks!

submitted by /u/CustomerAway5611
[link] [comments]

Open Maritime Dataset: Ship-tracking + Registry + Ownership Data (Equasis + GESIS + Transponder Signals) — Seeking Ideas For Impactful Analysis

I’m developing an open dataset that links ship-tracking signals (automatic transponder data) with registry and ownership information from Equasis and GESIS. Each record ties an IMO number to: • broadcast identity data (position, heading, speed, draught, timestamps) • registry metadata (flag, owner, operator, class society, insurance) • derived events such as port calls, anchorage dwell times, and rendezvous proximity

The purpose is to make publicly available data more usable for policy analysis, compliance, and shipping-risk research — not to commercialize it.

I’m looking for input from data professionals on what analytical directions would yield the most meaningful insights. Examples under consideration: • detecting anomalous ownership or flag changes relative to voyage history • clustering vessels by movement similarity or recurring rendezvous • correlating inspection frequency (Equasis PSC data) with movement patterns • temporal analysis of flag-change “bursts” following new sanctions or insurance shifts

If you’ve worked on large-scale movement or registry datasets, I’d love suggestions on:

  1. variables worth normalizing early (timestamps, coordinates, ownership chains, etc.)

  2. methods or models that have worked well for multi-source identity correlation

  3. what kinds of aggregate outputs (tables, visualizations, or APIs) make such datasets most useful to researchers

Happy to share schema details or sample subsets if that helps focus feedback.

submitted by /u/captain_boh
[link] [comments]

Dataset Streaming For Distributed SOTA Model Training

“Streaming datasets: 100x More Efficient” is a new blog post sharing improvements on dataset streaming to train AI models

link: https://huggingface.co/blog/streaming-datasets

Summary of the blog post:

We boosted load_dataset('dataset', streaming=True), streaming datasets without downloading them with one line of code! Start training on multi-TB datasets immediately, without complex setups, downloading, no “disk out of space”, or 429 “stop requesting!” errors.
It’s super fast! Outrunning our local SSDs when training on 64xH100 with 256 workers downloading data. We’ve improved streaming to have 100x fewer requests, → 10× faster data resolution → 2x sample/sec, → 0 worker crashes at 256 concurrent workers.

there is also a 1min video explaining the impact of this: https://x.com/andimarafioti/status/1982829207471419879

submitted by /u/qlhoest
[link] [comments]

How To Get The Earthquake Data LATEST DATA From Japan Metereological Agency

HELLO!

Working on a project at the moment that has to do with earthquakes, and the agency only provides data until 2023 (provided in txt), and although they have updated information of their earthquakes in their site, they didn’t update their archives so I really can’t get the updated ones (that is already provided in txt). Is there anything I can do to aggregate the latest data without having to use other sites like USGS? Thank you so much.

submitted by /u/takoyaki_elle
[link] [comments]

Complete NBA Dataset, Box Scores From 1949 To Today

Hi everyone. Last year I created a dataset containing comprehensive player and team box scores for the NBA. It contains all the NBA box scores at team and player level since 1949, kept up to date daily. It was pretty popular, so I decided to keep it going for the 25-26 season. You can find it here: https://www.kaggle.com/datasets/eoinamoore/historical-nba-data-and-player-box-scores

Specifically, here’s what it offers:

  • Player Box Scores: Statistics for every player in every game since 1949.
  • Team Box Scores: Complete team performance stats for every game.
  • Game Details: Information like home/away teams, winners, and even attendance and arena data (where available).
  • Player Biographies: Heights, weights, and positions for all players in NBA history.
  • Team Histories: Franchise movements, name changes, and more.
  • Current Schedule: Up-to-date game times and locations for the 2025-2026 season.

I was inspired by Wyatt Walsh’s basketball dataset, which focuses on play-by-play data, but I wanted to create something focused on player-level box scores. This makes it perfect for:

  • Fantasy Basketball Enthusiasts: Analyze player trends and performance for better drafting and team-building strategies.
  • Sports Analysts: Gain insights into long-term player or team trends.
  • Data Scientists & ML Enthusiasts: Use it for machine learning models, predictions, and visualizations.
  • Casual NBA Fans: Dive deep into the stats of your favorite players and teams.

The dataset is packaged as .csv files for ease of access. It’s updated daily with the latest game results to keep everything current.

If you’re interested, check it out. Again, you can find it here: https://www.kaggle.com/datasets/eoinamoore/historical-nba-data-and-player-box-scores/

I’d love to hear your feedback, suggestions, or see any cool insights you derive from it! Let me know what you think, and feel free to share this with anyone who might find it useful.

Cheers.

submitted by /u/Low-Assistance-325
[link] [comments]

Looking For A Greenhouse Dataset For A University Project 🌱

Hi everyone! 👋

I’m currently working on a university project related to greenhouse crop production and I’m in need of a dataset. Specifically, I’m looking for data that includes:

  • Crop yield (kg/ha) — for crops like tomato, cucumber, capsicum, or similar
  • Environmental and input parameters such as temperature, humidity, light, CO₂, fertilizer usage, electricity consumption, and water usage

If anyone already has access to such a dataset or knows a reliable source where I could find one, I’d be incredibly grateful for your help. 🙏

Thank you in advance for any leads or suggestions! 🌿

submitted by /u/BobcatNo8108
[link] [comments]

ITI Student Dropout Dataset For ML & Education Analytics

Hey everyone! 👋

– Ever wondered which factors push students to drop out? 🤔

I built a synthetic dataset that lets you explore exactly that – combining academic, social, and personal variables to model dropout risk.

🔗 Check it out on Kaggle:

ITI Student Dropout Synthetic Dataset

📊 About the Dataset

The dataset contains 22 features covering:

  • 🎯 Demographics: age, gender, location, income, etc.
  • 📘 Academics: marks, attendance, backlogs, program type.
  • 💬 Personal & Social: motivation, family support, ragging, stress.
  • 🌐 Digital & Environmental: internet issues, distance from institute.

Target variable: dropout (Yes/No)

🧠 What You Can Do With It

  • Build and compare classification models (Logistic Regression, XGBoost, Random Forest, etc.)
  • Perform EDA and correlation analysis on academic + social factors.
  • Explore feature importance for understanding dropout causes.
  • Use it for education, ML portfolio, or student analytics dashboards.

📚 Dataset Provenance:
Inspired by research like MDPI Data Journal’s dropout prediction study and India’s ITI Tracer Study (CENPAP), this dataset was programmatically generated in Python using probabilistic, rule-based logic to mimic real dropout patterns – fully synthetic and privacy-safe.

– ITI (Industrial Training Institute) offers vocational and technical education programs in India, helping students gain hands-on skills for industrial and technical careers.
These institutes mainly train students after 10th grade in trades like electrical, mechanical, civil, and computer IT.

If you like the dataset, please upvote, drop a comment, or try building models/code using it – so more learners and researchers can discover it and build something impactful!

submitted by /u/Grouchy-Peak-605
[link] [comments]

Made A 200 Dataset Save 50+ Hours Of Data Cleaning

I spent months cleaning and organizing 200+ datasets (CSV, Excel, JSON) for my own machine-learning and analytics projects.

They cover finance, retail, text, IoT, weather, and more — all structured, ready to use, and properly labeled.

It started as a side project but turned into something I use daily for modeling and dashboards.

If anyone’s interested in using them too, the link is in the comments 👇

submitted by /u/Smurgen6000
[link] [comments]

Welcome To R/learndataa. Let’s Make Learning Data Actually Practical.

Hey everyone!

This subreddit is for anyone learning data science, analytics, and AI. From beginners trying to understand Python to pros sharpening their machine learning skills.

The goal is simple: learn data by doing data.

Here’s what you can expect:

  • Weekly practice challenges
  • Honest discussions about learning paths and projects
  • Tips, tools, and code snippets that actually help
  • Community-led learning projects

I’d love to hear from you. What’s your biggest struggle right now with learning data? Let’s build this space around your needs.

u/Responsible-Gas-1474
Let’s learndataa, together.

submitted by /u/Responsible-Gas-1474
[link] [comments]