Tag The Picture Activity. Netherlands Museum Of The World

submitted by /u/cavedave
[link] [comments]

Need A Messy Dataset For A Class I’m In, Where Can I Go To Get One?

I’m in college right now and I need an “unclean/untidy” dataset. One that has a bunch of missing values, poor formatting, duplicate entries, etc., is there a website I can go to that gives data like this? I hope to get into the renewable energy field, so data covering that topic would be exactly what I’m looking for, but any website that has this sort of this would help me.

Thanks in advance

submitted by /u/timedoesnotwait
[link] [comments]

0

Uncleaned Dataset With At Least 20k Entries

hi guys, for a project i need a large dataset that’s uncleaned so that i can show i can clean it and make visualizations and draw analysis from it. if anyone can help please reach out thank you so much.

submitted by /u/bubblbubbles
[link] [comments]

0

Does Anyone Has An Extensive Case Study (data Based) That I Can Use To Practice Some Analytics And Analysis?

Can anyone help with some resource which has a full case study that I can work on and if possible there is a solution that I can compare with. The solution part is not a must. Just looking for a case study to try my hands on. Thanks

submitted by /u/NegotiationAnnual977
[link] [comments]

0

Just Came Across A New List Of Open-access Databases.

No logins, no paywalls—just links to stuff that’s (supposed to be) freely available. Some are solid, some not so much. Still interesting to see how scattered this space is.

Here’s the link: Free and Open Databases Directory

submitted by /u/opendatahunter
[link] [comments]

0

To Everyone In The Datasets Community, I Would Like To Give An Update

My name is Jason Baumgartner and I am the founder of Pushshift. I have been dealing with some health issues but hopefully my eye surgery will be coming up soon. I developed PSCs (posterior subcapular cataracts) from late onset Diabetes.

I have been working lately to bring more amazing APIs and tools to the research community including making available a large amount of datasets containing YouTube data and many other social media datasets.

Currently I have collected around 15 billion Youtube comments and billions of YouTube channel metadata and video metadata.

My goal, once my surgery is completed and my eyes heal is to get back into the community and invite others who love data to work with all this data.

I greatly appreciate everyone who donates or spreads the word about my gofundme.

I will be providing updates over time, but if you want to reach out to me, please use the email in my Reddit profile (the gmail one).

I want to thank all of the datasets moderators for assisting me during this challenging period in my life.

I am very excited to get back into the saddle and pursuing my biggest passion – data science and datasets.

I no longer control the Pushshift domain bit I will be sharing a new name soon and letting everyone know what’s been happening over the past 2 years.

Thanks again and I will try to respond to as many emails as possible.

You can find the link to my gofundme in my Reddit profile or my post in /r/pushshift.

Feel free to ask questions in this post and I will try to answer as soon as possible. Also, if you have any questions about specific social media data that you are interested in, I would be happy to clarify what data I currently have and what is on the roadmap in the future. It would be very helpful to see what data sources people are interested in!

submitted by /u/Stuck_In_the_Matrix
[link] [comments]

0

Like Will Smith Said In His Apology Video, “It’s Been A Minute (although I Didn’t Slap Anyone)

submitted by /u/hypd09
[link] [comments]

0

What Is That One Problem Which Is Taking Too Much Time Or Effort?

Hey there, I’m currently trying to start my first SaaS and I’m searching for a genuinly painful problem to create a solution. Need your help. Got a quick minute to help me?
I’m specifically interested in things that are taking your time, money, or effort. Would be great if you tell me the story.

submitted by /u/HectorAlcazar11
[link] [comments]

0

Tideon AI Makes Analyzing Excel Datasets 5x Faster — Try It Free

If you work with Excel files regularly, I wanted to share something that’s been a game-changer for me: Tideon AI — an AI-powered platform that lets you chat with your datasets instantly.

Instead of manually digging through spreadsheets, you can:

Upload Excel files and ask questions in plain English
Get instant insights without writing formulas

Would love to hear if this helps anyone here streamline their workflow!

Link: https://tideon.ai

submitted by /u/Narrow_Ground1495
[link] [comments]

0

Looking For Fraud Detection Dataset And SOTA Model For This Task

Hi Community, So I have a task to fine tune Llama 3.1 model on fraud detection dataset. Ask is simple, anyone here knows what the best datasets that can be utilized for this task are. What is the best known model SOTA for fraud detection in the market so far.

submitted by /u/i_wont_converge
[link] [comments]

0

VC Contact And Funded Startups Datasets

Paid: 60% off everything before Nov-10 shutdown.

submitted by /u/project_startups
[link] [comments]

0

Looking For Disease Prevalence Per Country Dataset

I’m working on an ML project related to medicine access inequality. For that, I need a dataset showing disease prevalence(or incidence rates) by country I’ve already looked into WHO, healthdata.org etc. but most of what I’ve found is either aggregated by disease type or missing country-level granularity. Please give some guidance on how to get this data, thank you.

submitted by /u/nude_makise
[link] [comments]

0

Made My First Dataset! Ca. 100 Scanned Pages Of Books From 1910-1920, Serbian Cyrillic. Kaggle And HF

Hi everyone, first time building a dataset. This is a v0.1, about 100 scans of book pages (both single and double-page per scan). The books are in the public domain. The intended use is for anyone looking to do image-to-text software work.

The scans are in a .jpg format, with a PDF with the whole collection.

I have also included 2 .txt files:

1)”raw” (aka not corrected for halluciations, artifacts, etc.) .txt file for anyone looking to do a check. The file is in Markdown.

2) A “corrected” .txt file, where the hallucinations, artifacts, errors, etc. were manually corrected. This file is in .txt, not Markdown.

Looking for feedback if this is useful, how to make a dataset like this better, etc.

Kaggle: https://www.kaggle.com/datasets/booksofjeremiah/serbian-cyrillic-script-printed

Huggingface: https://huggingface.co/datasets/Books-of-Jeremiah/raw-OCR-serbian-cyrillic

Any feedback on whether the set is useful for other use cases or how it can be made better is appreciated!

submitted by /u/Books_Of_Jeremiah
[link] [comments]

0

[REQUEST] Dataset Of Firefighting Radio Traffic Transcripts.

Looking for a dataset containing text from radio messages generated by firefighters at incidents. I can’t find anything, and my next step is to feed audio databases into a transcriber and create my own.

submitted by /u/TieConnect3072
[link] [comments]

0

[P] Training Better LLMs With 30% Less Data – Entropy-Based Data Distillation

I’ve been experimenting with data-efficient LLM training as part of a project I’m calling Oren, focused on entropy-based dataset filtering.

The philosophy behind this emerged from knowledge distillation pipelines, where student models basically inherit the same limitations of intelligence as the teacher models have. Thus, the goal of Oren is to change LLM training completely – from the current frontier approach of rapidly upscaling in compute and GPU hours to a new strategy: optimizing training datasets for smaller, smarter models.

The experimentation setup: two identical 100M-parameter language models.

Model A: trained on 700M raw tokens
Model B: trained on the top 70% of samples (500M tokens) selected via entropy-based filtering

Result: Model B matched Model A in performance, while using 30% less data, time, and compute. No architecture or hyperparameter changes.

Open-source models:

🤗 Model A – Raw (700M tokens)

🤗 Model B – Filtered (500M tokens)

Full documentation:

👾GitHub Repository

I’d love feedback, especially on how to generalize this into a reusable pipeline that can be directly applied onto LLMs before training and/or fine-tuning–I’m currently thinking of a multi-agent system, with each agent being a SLM trained on a subdomain (i.e., coding, math, science), each with their own scoring metrics. Would love feedback from anyone here who has tried entropy or loss-based filtering and possibly even scaled it

submitted by /u/Jolly-Act9349
[link] [comments]

0

Dataset Scrapped From The FootballManager23

i have scraped the fm23 data and got the 90k+ player information hope its helpful for u if u like it upvote on the kaggle and here too

more information on the kaggle website

thanks for reading this

submitted by /u/Mental-Flight8195
[link] [comments]

0

New EV And Petrol Car Price Dataset. Visualization Beginner

Hello, For a personal learning project in data visualization I am looking for the most up-to-date database possible containing all the models of new vehicles sold in France and europa with car characteristics and recommended official price. Ideally, this database would contain the data of the last 2 to 5 years. I want to be able to plot EV car price per kilometer and buying price vs autonomy etc. thank you in advance it is my first Reddit post

submitted by /u/Accomplished-Cat5112
[link] [comments]

0

Building A Synthetic Dataset From A 200MB Documented C#/YAML Codebase For LoRA Fine-Tuning

hello everyone.

I’m building a synthetic dataset from our ~200MB private codebase to fine-tune a 120B parameter GPT-OSS LLM using QLoRA. The model will be used for bug fixing, new code/config generation.

Codebase specifics:

Primarily C# with extensive JSON/YAML configs (with common patterns)
Good documentation & comments exist throughout
Total size: ~200MB of code/config files

My plan:

Use tree-sitter to parse C# and extract methods/functions with their docstrings
Parse JSON/YAML files to identify configuration patterns
Generate synthetic prompts using existing docstrings + maybe light LLM augmentation
Format as JSONL with prompt-completion pairs
Train using QLoRA for efficiency

Specific questions:

Parsing with existing docs: Since I have good comments/docstrings, should I primarily use those as prompts rather than generating synthetic ones? Or combine both?
Bug-fixing specific data: How would you structure training examples for bug fixing? Should I create “broken code -> fixed code” pairs, or “bug report -> fix” pairs?
Configuration generation: For JSON/YAML, what’s the best way to create training examples? Show partial configs and train to complete them?
Scale considerations: For a 200MB codebase targeting a 120B model with LoRA – what’s a realistic expected dataset size? Thousands or tens of thousands of examples?
Tooling recommendations: Are there any code-specific dataset tools that work particularly well with documented codebases?

Any experiences with similar code-to-dataset pipelines would be incredibly valuable! especially from those who’ve worked with C# codebases or configuration generation.

submitted by /u/gagarinten
[link] [comments]

0

Dataset Search Help Required Urgently!!!

Hi guys I want help finding diseased plant images with it’s metadata specifically it’s geolocation and timestamps for a research based project please help me out.

submitted by /u/Plane_Race_840
[link] [comments]

0

Fine Tuning Scene Classification Fine Tuning

I am building a scene classification AI, and I was wondering where I could find a dataset that contains a bunch of different images from a certain room. For example, I would want a lot of images of different kitchens.

submitted by /u/Such_Photograph_5757
[link] [comments]

0

Appreciation And Continued Contribution Of Tech Datasets

👋 Hey everyone!

The response to my first datasets has been insane – thank you! 🚀

Your support made these go viral, and they’re still trending on the Hugging Face datasets homepage:

🏆 Proven Performers: – GitHub Code 2025 (12k+ downloads, 83+ likes) – Top 10 on HF Datasets – ArXiv Papers (8k+ downloads, 51+ likes) – Top 20 on HF Datasets

Now I’m expanding from scientific papers and code into hardware, maker culture, and engineering wisdom with three new domain-specific datasets:

🔥 New Datasets Dropped

Phoronix Articles
What is Phoronix? The definitive source for Linux, open-source, and hardware performance journalism since 2004. For more info visit: https://www.phoronix.com/
Dataset contains: articles with full text, metadata, and comment counts
Want a Linux & hardware news AI? Train models on 50K+ articles tracking 20 years of tech evolution

🔗 Link: https://huggingface.co/datasets/nick007x/phoronix-articles

Hackaday Posts
What is Hackaday? The epicenter of maker culture – DIY projects, hardware hacks, and engineering creativity. For more info visit: https://hackaday.com/
Dataset contains: articles with nested comment threads and engagement metrics
Want a maker community AI? Build assistants that understand electronics projects, 3D printing, and hardware innovation

🔗 Link: https://huggingface.co/datasets/nick007x/hackaday-posts

EEVblog Posts
What is EEVblog? The largest electronics engineering forum – a popular online platform and YouTube channel for electronics enthusiasts, hobbyists, and engineers. For more info visit: https://www.eevblog.com/forum/
Dataset contains: forum posts with author expertise levels and technical discussions
Want an electronics expert? Train AI mentors that explain circuits, troubleshoot designs, and guide hardware projects

🔗 Link: https://huggingface.co/datasets/nick007x/eevblog-posts

submitted by /u/its_just_me_007x
[link] [comments]

0

Im Looking For A Dataset Of Meme Gifs.

im working on an app and id like to be able to search for gifs locally. i understand there are many services for this already, but im looking for a dataset i can host myself.

it would be good id the dataset was also labeled in a way that could make it searchable, if not, then i’ll try figure that part out.

submitted by /u/Accurate-Screen8774
[link] [comments]

0

Master’s Project Ideas To Build Quantitative/data Skills?

Hey everyone,

I’m a master’s student in sociology starting my research project. My main goal is to get better at quantitative analysis, stats, working with real datasets, and python.

I was initially interested in Central Asian migration to France, but I’m realizing it’s hard to find big or open data on that. So I’m open to other sociological topics that will let me really practice data analysis.

I will greatly appreciate suggestions for topics, datasets, or directions that would help me build those skills?

Thanks!

submitted by /u/NebooCHADnezzar
[link] [comments]

0

I Built A Small AI That Reads Spreadsheets And Tells You The Story Inside — Want To Help Test It?

Hey everyone,
I’m testing a small experiment under Aptorie Labs, an AI that looks at your CSV or Excel files and writes a short, plain-English story about what’s really happening in the data.

It’s called Data-to-Narrative, and it’s built around a simple idea:
Instead of dashboards full of numbers, you get a short paragraph that sounds like a human analyst, no jargon, no buzzwords, just what matters.

I’m looking for a few early testers to try it out this week. You upload a dataset (sales, support tickets, survey results, etc.), and I’ll send back a written summary you can actually read and share with your team.

If you’re interested, DM me and I’ll send you the invite link to the beta upload form.
It’s part of a closed test, so I’m keeping the first batch small to make sure the summaries feel right.

Thanks in advance to anyone who wants to kick the tires. I’ll post a few anonymized examples once we’ve run the first round of tests.

Len

submitted by /u/lenbuilds
[link] [comments]

0

I Want To Use The Pushshift Dataset To My Academic Project

I am currently doing a university project in which i want to fine tune an LLM, and i want to use data from reddit. I m not a reddit mod, so i cant access https://pushshift.io
anyone knows where i could find the database?

submitted by /u/Wild-Direction484
[link] [comments]

0

Announcement: Definitely Less Complex Data Analysis Solution, EasyAIBridge

Gap-Filling Intelligence, Smart Ask, Instant Reports, Supporting Multiple Sources. Powered by Fusion Intelligence. Delivers faster and more detail-oriented AI-based data analysis, visualization. reporting, scheduling, and exporting. Launching on producthunt today: https://www.producthunt.com/products/easy-ai-bridge

submitted by /u/Infamous-Win834
[link] [comments]

0

Looking For Guidance On Open-sourcing A Hierarchical Recommendation Dataset (user–chapter–series Interactions)

submitted by /u/Just_Plantain142
[link] [comments]

0

Is AI Going To Replace Data Analyst Jobs Soon?

submitted by /u/Infamous_Chapter9623
[link] [comments]

0

Is There Any Subreddit/place On The Internet That Works As A Datasets Repository? Like Not Well Known But Credible Ones?

Or is this subreddit the right place for that?

submitted by /u/Wrong_Talk781
[link] [comments]

0

“All I Want For Christmas Is You” By Mariah Carey Streams For Spotify And AppleMusic Daily Since Their Start?

Hi y’all, it would be super cool to have a dataset of daily streams of “All I Want For Christmas Is You” by Mariah Carey for Spotify and AppleMusic since these each started recording that data (prob 2013?). Would anyone be able to provide something like that? Would be much appreciated.

submitted by /u/GeoMicroSoares
[link] [comments]

0

Category: Datatards

Tag The Picture Activity. Netherlands Museum Of The World

Need A Messy Dataset For A Class I’m In, Where Can I Go To Get One?

Uncleaned Dataset With At Least 20k Entries

Does Anyone Has An Extensive Case Study (data Based) That I Can Use To Practice Some Analytics And Analysis?

Just Came Across A New List Of Open-access Databases.

To Everyone In The Datasets Community, I Would Like To Give An Update

Like Will Smith Said In His Apology Video, “It’s Been A Minute (although I Didn’t Slap Anyone)

What Is That One Problem Which Is Taking Too Much Time Or Effort?

Tideon AI Makes Analyzing Excel Datasets 5x Faster — Try It Free

Looking For Fraud Detection Dataset And SOTA Model For This Task

VC Contact And Funded Startups Datasets

Looking For Disease Prevalence Per Country Dataset

Made My First Dataset! Ca. 100 Scanned Pages Of Books From 1910-1920, Serbian Cyrillic. Kaggle And HF

[REQUEST] Dataset Of Firefighting Radio Traffic Transcripts.

[P] Training Better LLMs With 30% Less Data – Entropy-Based Data Distillation

I’ve been experimenting with data-efficient LLM training as part of a project I’m calling Oren, focused on entropy-based dataset filtering.

Dataset Scrapped From The FootballManager23

New EV And Petrol Car Price Dataset. Visualization Beginner

Building A Synthetic Dataset From A 200MB Documented C#/YAML Codebase For LoRA Fine-Tuning

Dataset Search Help Required Urgently!!!

Fine Tuning Scene Classification Fine Tuning

Appreciation And Continued Contribution Of Tech Datasets

Im Looking For A Dataset Of Meme Gifs.

Master’s Project Ideas To Build Quantitative/data Skills?

I Built A Small AI That Reads Spreadsheets And Tells You The Story Inside — Want To Help Test It?

I Want To Use The Pushshift Dataset To My Academic Project

Announcement: Definitely Less Complex Data Analysis Solution, EasyAIBridge

Looking For Guidance On Open-sourcing A Hierarchical Recommendation Dataset (user–chapter–series Interactions)

Is AI Going To Replace Data Analyst Jobs Soon?

Is There Any Subreddit/place On The Internet That Works As A Datasets Repository? Like Not Well Known But Credible Ones?

“All I Want For Christmas Is You” By Mariah Carey Streams For Spotify And AppleMusic Daily Since Their Start?

Recent Posts

Recent Comments

18+ Content

I’ve been experimenting with data-efficient LLM training as part of a project I’m calling Oren, focused on entropy-based dataset filtering.

Recent Posts

Recent Comments