Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

The Biggest Free & Open Football Results & Stats Dataset

Hello!

I want to point out the dataset that I created, including tens of thousands of historical football (soccer) match data that can be used for better understanding of the game or for training machine learning models. I am putting this up for free as an open resource, as per now it is the biggest openly and freely available football match result & stats & odds dataset in the world, with most of the data derived from Football-Data.co.uk:

https://github.com/xgabora/Club-Football-Match-Data-2000-2025

submitted by /u/AdkoSokdA
[link] [comments]

Swedish Conversation/dialog Datasets

I’ve been looking for datasets consisting of chats, conversations, or dialogues in Swedish, but it has been tough finding Swedish datasets. The closest solutions I have come up with are:

Building a program to record and transcribe conversations from my daily life at home.

Scraping Reddit comments or Discord chats.

Downloading subtitles from movies.

The issue with movie subtitles is that, without the context of the movie, the lines often feel disconnected or lack a proper flow. Anyone have better ideas or resources for Swedish conversational datasets?

I am trying to build an intention/text classification model. Do you have any ideas what I could/should do or where to search?

For those wondering, I am trying to build a simple Swedish NLP model as a hobby project.

Happy newyear!!

submitted by /u/Wallido17
[link] [comments]

NBA Historical Dataset: Box Scores, Player Stats, And Game Data (1949–Present) 🚀

Hi everyone,

I’m excited to share a dataset I’ve been working on for a while, now available for free on Kaggle! This comprehensive dataset includes detailed historical NBA data, meticulously collected and updated daily. Here’s what it offers:

Player Box Scores: Statistics for every player in every game since 1949. Team Box Scores: Complete team performance stats for every game. Game Details: Information like home/away teams, winners, and even attendance and arena data (where available). Player Biographies: Heights, weights, and positions for all players in NBA history. Team Histories: Franchise movements, name changes, and more. Current Schedule: Up-to-date game times and locations for the 2024-2025 season.

I was inspired by Wyatt Walsh’s basketball dataset, which focuses on play-by-play data, but I wanted to create something focused on player-level box scores. This makes it perfect for:

Fantasy Basketball Enthusiasts: Analyze player trends and performance for better drafting and team-building strategies. Sports Analysts: Gain insights into long-term player or team trends. Data Scientists & ML Enthusiasts: Use it for machine learning models, predictions, and visualizations. Casual NBA Fans: Dive deep into the stats of your favorite players and teams.

The dataset is packaged as a .sql file for database users, and .csv files for ease of access. It’s updated daily with the latest game results to keep everything current.

If you’re interested, check it out here: https://www.kaggle.com/datasets/eoinamoore/historical-nba-data-and-player-box-scores/

I’d love to hear your feedback, suggestions, or see any cool insights you derive from it! Let me know what you think, and feel free to share this with anyone who might find it useful.

Cheers.

submitted by /u/Low-Assistance-325
[link] [comments]

Normalized Database Dataset For Data Modeling

I’m interested in doing some data modeling on normalized database datasets. ecommerce, financial, really anything would probably be fine. I would like some sort of referential integrity so that foreign keys match up to primary keys.

Looking for recommendations.

I’ve already played with TPCH, looking for other suggestions.

submitted by /u/drunk_goat
[link] [comments]

Seeking Dataset: Private Company Valuations & Exit Multiples (Deal-Level & Industry Benchmarks)

Hi everyone,

I’m on the hunt for datasets or sources that offer insights into private company valuations, particularly exit multiples and benchmark data.

Here’s what I’m ideally looking for:

Exit multiples (e.g., revenue multiples, EBITDA multiples) on a deal-by-deal basis as well as industry-wide benchmarks. Data on geography-specific valuation metrics or benchmarks. Industry breakdowns to identify trends in specific sectors. Datasets or reports that cover private equity exits or M&A activity trends.

If you’re aware of any resources that provide a solid level of granularity, I’d be incredibly grateful for the help!

So far, I’ve explored platforms like PitchBook and CB Insights, but I’m curious if anyone knows of more detailed alternatives or supplementary datasets.

Likewise, if there are any public datasets, or even specific reports (e.g., whitepapers, academic studies, or proprietary research) that can provide similar insights, please send them my way.

Thank you in advance for any suggestions or pointers!

submitted by /u/Global-Departure3046
[link] [comments]

How To Generate Text Dataset Using LLama 3.1? [Synthetic]

So I am working on my semester mini-project. It’s titled “Indianism Detection in Texts Using Machine Learning” (yeah, I just randomly made it up during idea submissions). Now the problem is, there’s no such dataset for this in the entire world. To counter this, I came up with a pipeline to convert a normal (correct) English phrase into English with Indianisms using my local LLama 3.1 and then save both the correct and converted sentences into a dataset with labels, respectively.

I also created a simple pipeline for it (a kind of constitutional AI) but can’t seem to get any good responses. Could anyone suggest something better? (I’m 6 days away from the project submission deadline.)

I explained the current pipeline in this GitHub repo’s README. Check it out:
https://github.com/iamDyeus/Synthetica

submitted by /u/dyeusyt
[link] [comments]

I’m Working On A Tool That Allows Anyone To Create Any Dataset They Want With Just Titles

I work full-time at a startup where I collect structured data with LLMs, and wanted to create a tool that does this for everyone. The idea is to eventually create a luxury system that can create any dataset you want with unique data points, no matter how large, and hallucination free. If you’re interested in a tool like this, check out the website I just made to collect signups.

batchdata.ai

submitted by /u/D4isyy
[link] [comments]

Looking For Annual Datasets Of Any Kind For African Cities

Hi guys,

I am writing a paper on the changes in vulnerability of african cities and I’ve had a problem with finding data. I am looking for indicators that are annual (at least 30 years back) of any kind, although economic or environmental ones are more needed. While it is not difficult to find such data for african countries, african cities are borderline impossible. The only resource I found was Global Data Lab which is kind of the perfect example of what I am looking for:

example

Again, any data in this form is appreciated though I’m aware how hard it is to find.

submitted by /u/Used-Ad1876
[link] [comments]

Our 3D Traffic Light And Sign Dataset Is Available On Kaggle

If you have much free time during the holiday season and want to play with 3D traffic lights and sign detection, our new Kaggle dataset is what you need!

The dataset consists of accurate and temporally consistent 3D bounding box annotations for traffic lights and signs, effective up to a range of 200 meters.

https://www.kaggle.com/datasets/tamasmatuszka/aimotive-3d-traffic-light-and-sign-dataset

submitted by /u/MatuszkaT
[link] [comments]

Does Anyone Know Where To Find A Dataset With Website Traffic Data?

Hi everyone,

I’m looking for some data to practice analyzing website performance. Specifically, I’d like information on metrics like time spent on page, number of pages viewed, and similar stats. My goal is to do some basic analysis—nothing too advanced.

Ideally, I’d love to work with e-commerce website data, but if that’s not available, data from any type of website would be great!

Does anyone know where I can find datasets like this?

submitted by /u/Pedro17f
[link] [comments]

🚗 Open-Source Car Dataset For Price Prediction! 📊

Hi everyone! 👋

We’re excited to share a dataset we’ve been working on that could be helpful for anyone interested in exploring machine learning and data analysis.

🔍 Why Use This Dataset?

Perfect for beginner-friendly ML projects. Ideal for experimenting with algorithms like linear regression, decision trees, or neural networks. Great for data visualization to identify trends in car pricing.

🚀 How to Get the Dataset

The dataset is hosted on https://www.kaggle.com/datasets/qubdidata/auto-market-dataset/data.

🛠️ Example Use Cases

Building a car price prediction model. Analyzing the relationship between features like mileage and price. Comparing the performance of ML models on this dataset.

🤝 Community Collaboration

This is an open-source project, so feel free to:

Contribute additional data points or clean the dataset. Share your analysis or models built using the data. Provide feedback to improve the dataset.

Let’s make this a valuable resource for the community! 🚗✨

Looking forward to seeing what you create. If you have any questions or suggestions, drop them in the comments below. 👇

submitted by /u/Qubdi
[link] [comments]

I’ve Collected A Dataset Of 1M+ App Store And Play Store Entries – Anyone Interested?

Hey everyone,

For my personal research, I’ve compiled a dataset containing over a million entries from both the App Store and Play Store. It includes details about apps, and I thought it might be useful for others working in related fields like app development, market analysis, or tech trends.

If anyone here is interested in using it for your own research or projects, let me know! Happy to discuss the details.

Cheers!

submitted by /u/26th_Official
[link] [comments]

Guidance Needed For Creating A Supervised Fine-Tuning Dataset Using PDFs

Hi Everyone,
I have a collection of about 15,000 pages of documents in PDF format authored by the same writer, covering topics like economics, linguistics, anthropology, history, religion, sociology, political science, and arts. These are spread across 17 different volumes.

I aim to create a supervised fine-tuning dataset from this corpus but lack access to human annotators. I am exploring the possibility of using LLMs for this purpose.

Could anyone guide me on how to:

Extract and preprocess the text efficiently? Use LLMs for generating labels or annotations? Handle diverse topics while ensuring the dataset’s quality and relevance?

I would greatly appreciate any tools, libraries, or workflows you recommend. 🙏🏻

Thank you!

submitted by /u/Famous-Airline571
[link] [comments]

Looking For Historical Domain Sales Data (Willing To Buy)

I’m currently working on expanding my database of historical domain sales. Right now, I’ve got a solid collection of 1.1M sales records, but I’m looking to take it to the next level by increasing it to 1.5M (similar to NAmeBio) or more like DnPrices.

If anyone here has access to such data and is willing to share or sell it, please let me know. I’m ready to purchase if the dataset aligns with what I’m looking for. Feel free to drop me a message or comment below if you’re interested.

submitted by /u/ilyasKerbal
[link] [comments]

Seeking Medical Dataset For Virtual Staining (Unstained & H&E-Stained Images)

Hello everyone,

I am a final-year student working on my project involving virtual staining using AI and deep learning techniques. Specifically, I am looking for a medical dataset that includes paired images of unstained cells and their corresponding stained counterparts (preferably H&E stained).

If anyone knows of publicly available datasets or resources where I can find such data, I would greatly appreciate your help.

Thank you in advance for your suggestions!

submitted by /u/its_codenova
[link] [comments]

Public Datasets Of FMRI Or SMRI Scans Of Mental Disorders

I am currently doing a research project in my college that I will have to present in July of the next year. The project is currently in it’s infancy and the basis are just starting to lay down, as I have to start to gather the data for training the model, but the basic idea is pretty much set. I have some experience in this type of research as I have already trained a Deep Learning model by using a Vision Transformer that could differentiate signs of the ASL alphabet at real time.

However, based on the current research I have done (I still have to do tons more) it seems that some of these Datasets have a special type of file format (.nii) that require special preprocessing. The scope of the project is very malleable because I can define the labels based on the type of data that is publicly available in the internet. Since I am still relatively new in this area, I don’t know if anyone of you have already been with this subject and trained a model related to the matter. If you are, It’s highly apareciate that you could offer some guidance and If the data of the current Datasets available, like ADHD-200 or the one in SchizoConnect is good. Thank you.

submitted by /u/MessierKatr
[link] [comments]

Please Help! Request For ADNI Dataset

Hi all,

I’m a master’s student currently conducting research on MCI conversion to Alzheimer’s disease using neuroimages. So far, I’ve found that the ADNI dataset is the only relevant resource for MCI related data. However, I’m wondering if there are other datasets or sources of relevant data that you’d recommend for MCI related research?

Regarding the ADNI dataset, I submitted a request for access few days ago. For those with experience, is the approval rate generally high and straightforward? How long does it usually take to get access?

I’m asking because if the process is too difficult, I may need to consider changing my topic or exploring alternative data sources. (which I hope not)

Please help and thank you!

submitted by /u/ccss0103
[link] [comments]