Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Dataset Explorer – Tool To Search Any Public Datasets (Free Forever)

Dataset Explorer is now LIVE, and will stay free forever.

Finding the right dataset shouldn’t be this painful.

There are millions of quality datasets on Kaggle, data.gov, and elsewhere – but actually locating the one you need is still like hunting for a needle in a haystack.

From seasonality trends, weather data, holiday calendars, and currency rates to political datasets, tech layoffs, and geo info – the right dataset is out there.

That’s why we created dataset-explorer. Just describe what you want to analyze, and it uses Perplexity, scraping (Firecrawl), and other sources to bring relevant datasets.

Quick example: I analyzed tech layoffs from 2020–2025 and found:

📊 2023 was the worst year — 264K layoffs 🏢 Post-IPO companies made 58% of the cuts 💻 Hardware firms were hit hardest — Intel topping the list 📅 Jan 2023 = worst month ever — 89K people lost jobs in 30 days

Once you find your dataset, you can run a full analysis for free on Hunch, an AI data analytics platform.

Dataset Explorer – https://hunch.dev/data-explorer Demo – https://screen.studio/share/bLnYXAvZ

Give it a try and let us know what you think.

submitted by /u/matkley12
[link] [comments]

[self-promotion] WildChat-4.8M: 4.8M Real User–Chatbot Conversations (Public + Gated Versions)

We are releasing WildChat-4.8M, a dataset of 4.8 million real user-chatbot conversations collected from our public chatbots

  • Total collected: 4,804,190 conversations from Apr 9, 2023 to Jul 31, 2025.
  • After removing conversations flagged with “sexual/minors” by OpenAI Moderations, 4,743,336 conversations remain.
  • From this, the non-toxic public release contains 3,199,860 conversations (all toxic conversations removed from this version).
  • The remaining 1,543,476 toxic conversations are available in a gated full version for approved research use cases.

Why we built this dataset:

  • Real user prompts are rare in open datasets. Large LLM companies have them, but they are rarely shared with the open-source communities.
  • Includes 122K conversations from reasoning models (o1-preview, o1-mini), which are real-world reasoning use cases (instead of synthetic ones) that often involve complex problem solving and are very costly to collect.

Access:

Original Source:

submitted by /u/yuntiandeng
[link] [comments]

Fundamentals Of Deep Learning Building Practical Deep Learning Projects

Deep learning is revolutionizing industries by enabling computers to learn from complex data with remarkable accuracy. From training your first CNN to leveraging pre-trained LLMs, the fundamentals covered in this article provide a solid foundation for building AI solutions. By mastering tools like PyTorch, techniques like transfer learning, and applications in computer vision and NLP, you’re well-equipped to tackle real-world challenges. Whether creating a personalized doggy door or classifying fruit, deep learning opens a world of possibilities. Start experimenting, set up your AI environment, and join the global community driving innovation through deep learning.

https://open.substack.com/pub/ahmedgamalmohamed/p/fundamentals-of-deep-learning?r=58fr2v&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

submitted by /u/ahmed4929
[link] [comments]

911 Calls Analysis For A Research Project

hello, I have a research project about 911 calls, I need a dataset for 911 call audio to listen to them to analysis them and answer our research questions

if you know AI model to listen to calls and analyze them, please share it with me

also if there are publications about analysis of 911 audio calls, please share them with me

submitted by /u/AhmedUSMLE
[link] [comments]

Looking For Some Kind Of Data Correlated With BT Corn Adoption

I have a resource showing BT, HT, and hybrid GMO corn adoption in the years since 2000 and I want data that correlates with it somehow.

Examples:

-European Corn Borer Populations (By State)

-European Corn Borer Diversity/Species Richness (By State)

-European Corn Borer Larvae In Non-BT Corn (By State)

-European Corn Borer Larvae In (Crop other than BT Corn) By State

-Non-BT Corn Deaths Due to Insects

-(Crop other than BT corn) Deaths due to Insects

If anyone knows how to get data related to anything above, it would be a lot of help. It can be a species other than European Corn Borers and a crop other than corn. It can also be about weeds instead of insects.

submitted by /u/Empty-Wing7678
[link] [comments]

Built An IDE For Web Scraping — Introducing Crawbots

We’ve been working on a desktop app called Crawbots — an all-in-one IDE for web data extraction. It’s designed to simplify the scraping process, especially for developers working with Puppeteer, Playwright, or Selenium.

We’re aiming to make Crawbots powerful yet beginner-friendly, so junior devs can jump in without fighting boilerplate or complex setups.

Would appreciate any thoughts, questions, or brutal feedback

submitted by /u/varvolta
[link] [comments]

Looking For Support Dataset With Issue Title, Root Cause, And Clarifying Questions

I’m building a student project an AI-powered assistant that helps support agents resolve product issues faster.

For this, I’m looking for any dataset (even a small one) with structured entries that include:

  • Issue Title
  • Root Cause (or suspected cause)
  • Clarifying Questions (asked to narrow down the issue)
  • (Optional) Symptoms or issue description

I’ve explored Bitext and open support corpora but couldn’t find datasets with structured clarifying questions or diagnostic trails.

If anyone has access to such a dataset even partial, synthetic, or export from internal knowledge bases I’d deeply appreciate your help.
Thanks in advance!

submitted by /u/AlbertEinsteinTG
[link] [comments]

10 SQL Masterstrokes The Data Elite Guard Jealously

Elite data engineers don’t just write SQL — they wield it like a secret weapon. While most of us struggle with basic SELECT statements or grapple with sluggish joins, these pros quietly deploy advanced tactics that save hours, avoid disasters, and impress stakeholders. I’ve spent years in the trenches, reverse-engineering their tricks, and now I’m spilling the beans. Here are 10 advanced SQL strategies the top 1% use but rarely talk about — until today. To continue this blog, please open this article this free, not paid https://open.substack.com/pub/ahmedgamalmohamed/p/10-sql-masterstrokes-the-data-elite?r=58fr2v&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

submitted by /u/ahmed4929
[link] [comments]

Dataset On HT Corn And Weed Species Diversity

For a paper, I am trying to answer the following research question:

“To what extent does the adoption of HT corn (Zea Mays) (% of planted acres in region, 0-100%), impact the diversity of weed species (measured via the Shannon index) in [region] corn fields?”

Does anyone know any good datasets about this information or information that is similar enough so the RQ could be easily altered to fit it (like using a measurement other than the Shannon index)?

submitted by /u/Empty-Wing7678
[link] [comments]

[self-promotion] Spanish Hotel Reviews Dataset (2019–2024) — Sentiment-labeled, 1,500 Reviews In Spanish

Hi everyone,

I’ve compiled a dataset of 1,500 real hotel reviews from Spain, covering the years 2019 to 2024. Each review includes:

  • ⭐ Star rating (1–5)
  • 😃 Sentiment label (positive/negative)
  • 📍 City
  • 🗓️ Date
  • 📝 Full review text (in Spanish)

🧪 This dataset may be useful for:

  • Sentiment analysis in Spanish
  • Training or benchmarking NLP models
  • AI apps in tourism/hospitality

Sample on Hugging Face (original source):
https://huggingface.co/datasets/Karpacious/hotel-reviews-es

Feedback, questions, or suggestions are welcome! Thanks!

submitted by /u/negrobayor
[link] [comments]

[self-promotion] Map The Global Electrical Grid With This 100% Open Source Toolchain

We build a 100% Open Source Toolchain to map the global electrical grid using:

  1. OpenStreetMap as a database
  2. JOSM as a OpenStreetMap editor
  3. Osmose for validation
  4. mkdocs material for the website
  5. Leaflet for the interactive map
  6. You will find details of all the smaller tools and repositories that we have integrated on the README page of the website repository. https://github.com/open-energy-transition/MapYourGrid

Read more about how you can support mapping the electrical grid at https://mapyourgrid.org/

submitted by /u/augspurger
[link] [comments]

Trying To Find Pancreatic Cancer Datasets With HBV/HCV Status Running Into A Wall, I NEED HELP.

Hey everyone,
This is my first time ever on Reddit. Im in a minicrisis.
I’m a second-year medical student working on a research project focused on how chronic Hepatitis B and C infections (HBV and HCV) might influence both the risk and prognosis of pancreatic cancer. I’m especially interested in looking at this from a transcriptomic standpoint, ideally through differential gene expression and immune pathway analysis in HBV/HCV-positive vs negative patients.

The problem I’m facing is that I can’t find any pancreatic cancer RNA-seq datasets that include HBV or HCV status in the metadata. I’ve scoured GEO, ArrayExpress, dbGaP, and a couple of other repositories. Some of the most cited pancreatic cancer datasets (like GSE15471, GSE28735, and GSE71729) don’t seem to include viral infection status.

One dataset that does stand out is GSE183795, which comes from a paper that looked into the HNF1B/Clusterin axis in a highly aggressive subset of pancreatic cancer patients. The corresponding author is Dr. Parwez Hussain (NCI/NIH), and I’ve emailed him to ask if the HBV/HCV status for that cohort is available.

That said, I wanted to post here in case anyone has:

  • Come across any pancreatic cancer RNA-seq dataset with viral status (even private or controlled-access would help).
  • Worked on a similar question and found a workaround (like inferred infection status, use of liver cancer datasets as a proxy, etc.)
  • Tips on filtering patients from large multi-cancer cohorts (e.g. TCGA) based on co-morbidities or ICD codes, if possible.
  • MOST IMPORTANTLY HELP ME CURATE A DIFFERENT WORKFLOW FOR MY HYPOTHESIS since the data I need isnt available.

Basically, anything that might help me move forward. If not pancreatic cancer, I’m open to suggestions on related cancers or models where HBV/HCV co-infection is better documented but still biologically relevant. I have a tight deadline.

submitted by /u/RingEnvironmental580
[link] [comments]

Looking For A Reference To Access MIMIC-IV On PhysioNet (independent Researcher)

Hi everyone,
I’m an independent researcher working on a non-commercial machine learning + NLP project related to patient safety. My main goal is to explore how clinical notes can be used to identify or even predict preventable medical errors, such as medication issues, documentation mistakes, or delayed interventions.

I’m especially interested in the early detection of patterns that could support error prevention, not just classification.

I’ve already completed the required CITI “Data or Specimens Only Research” ethics training.

Now I just need a reference to confirm my identity and seriousness. This only involves responding to a short email from PhysioNet.

If you’re a researcher or professional familiar with MIMIC or medical NLP and would be willing to vouch for me, I’d be incredibly grateful. I’d be happy to explain my approach in more detail via DM.

Thanks in advance 🙂

submitted by /u/cansu_28
[link] [comments]

Looking For Mental Health Datasets For AI Project On Predicting Mental Health Disorders

Hi all,

I’m currently working on an AI project aimed at predicting mental health disorders, and I’m in need of a reliable dataset to help train and test my model. Ideally, I’m looking for datasets that include information on various mental health conditions (e.g., depression, anxiety, schizophrenia, etc.), symptoms, demographics, or treatment history.

If anyone knows of any publicly available mental health datasets or resources that might be helpful for my project, I would greatly appreciate your recommendations or links.

Thank you!

submitted by /u/Either_Sentence_5280
[link] [comments]

Golf Course Datasets – Tees, Location, Rating, Etc.

Hey there, I’ve been looking for a dataset for golf courses for a personal project of mine. I’m trying to build something similar to the other golf scorekeeping apps that are out there but I’m having a hard time finding a good dataset to use. I’ve made my own up for a couple of my local courses but it’s extremely time consuming, and not all the courses around me have their scorecards posted. Some of the free ones I’ve found have been good but are missing data for Canadian courses which is what I’m more focused on. Other ones have been absurdly priced for a personal project and so I’m just wondering if anyone knows where I could find something. Any help would be appreciated!

submitted by /u/AdCreative205
[link] [comments]

Released Bhagavad Gita Dataset – 500+ Downloads In 30 Days! Fine-tune, Analyze, Build 🙌

Hey everyone,

I recently released a dataset on Hugging Face containing the Bhagavad Gita (translated by Edwin Arnold) aligned verse-by-verse with Sanskrit and English. In the last 20–30 days, it has received 500+ downloads, and I’d love to see more people experiment with it!

👉 Dataset: Bhagavad-Gita-Vyasa-Edwin-Arnold

Whether you want to fine-tune language models, explore translation patterns, build search tools, or create something entirely new—please feel free to use it and add value to it. Contributions, feedback, or forks are all welcome 🙏

Let me know what you think or if you create something cool with it!

submitted by /u/Competitive-Fact-313
[link] [comments]

I’m Searching A Dataset Similar To This One But I Can’t Find Anything: Multiphase Mnufacturing Machine With Cycle Time For Every Phase

Hi everyone, I’m currently working with a dataset to analyse the cycle time of an industrial machine for a project, but the data I have is too small.

I need to find a dataset with a similar structure, especially with the :

Lot/ID Product ID Good Scraps Cycle time OP 1 [s] Cycle Time OP 2 [s] Cycle time OP 13 [s]
CA424920 VBSBN 50 4 3.2 2.7 5.4
CA243253 BMDSD 64 2 3.0 0 5.0

Does anyone know where or how to find a similar dataset? I’ve searched through paper reviews and online repositories, but haven’t found anything. Thanks in advance!

submitted by /u/Reffa_
[link] [comments]