submitted by /u/Cute-Inflation-7944
[link] [comments]
I’m looking for a database with dental radiograph to evaluate whether an algorithm can make a diagnosis from dental radiographs. Does anyone have any suggestions of where to look?
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
submitted by /u/Cute-Inflation-7944
[link] [comments]
Hey everyone,
I am a contractor that uses statista data from time to time to get data from specific sectors and execute my consulting work with a bit more ease.
Well, I decided to have a baby! Yay!
This meant I was going to be out of work for a while. I also noticed that Statista started charging me for simply being able to see the data sets they used to offer for free so I decided it was time to cancel with the intention of re-subscribing when my baby goes into daycare.
So I went to the portal to see where I could cancel and you can’t. But you do have a customer service rep.
Ok great so I email them.
Nothing.
I wait a week and email again. Nothing.
So then I started emailing everyone I could – nothing.
Then I pay to call long distance to Europe after weeks of not hearing anything back and I end up getting a human in Germany who indicated that they cannot help I have to contact the rep.
So at this point I was charged 2 months worth of fees just going through the process of cancellation.
At this point I let the guy know that this was against consumer protection laws in my country (which it is). At that point, he emailed someone internally to allow me to cancel after a lot of hassle.
They credited ONE of the two months I was charged even though I had been requesting cancellation. Either way, I was just glad my credit card was no longer held hostage.
I just wanted to share this information in case anyone else had been using Statista. This is a really sketchy practice and wanted to call it out.
submitted by /u/Stunning-Radio-9104
[link] [comments]
As the title suggests, I’m looking for funny datasets, like one containing only puns.
I’m also interested in character-trait-specific humor, such as a dataset filled with funny and outrageous conspiracy theories or self-deprecating, dark humor.
Any humorous datasets that could turn an LLM into a joke machine are welcome!
submitted by /u/Omega0736
[link] [comments]
Hi everyone,
I’m looking for the DISCO-10M: A Large-Scale Music Dataset. It was previously available through Huggingface, but it is not there anymore. Someone who can share a copy?
submitted by /u/jellek03
[link] [comments]
Hi everyone! I wanted to share a benchmark we developed for testing our LLM-based symptom checker app. We built this because existing static benchmarks (like MedQA, PubMedQA) didn’t fully capture the real-world utility of our app. With no suitable benchmark available, we created our own and are open-sourcing it in the spirit of transparency.
GitHub: https://github.com/medaks/symptomcheck-bench
Quick Summary:
We call it SymptomCheck Bench because it tests the core functionality of symptom checker apps—extracting symptoms through text-based conversations and generating possible diagnoses. It’s designed to evaluate how well an LLM-based agent can perform this task in a simulated setting.
The benchmark has three main components:
Patient Simulator: Responds to agent questions based on clinical vignettes. Symptom Checker Agent: Gathers information (limited to 12 questions) to form a diagnosis. Evaluator agent: Compares symptom checker diagnoses against the ground truth diagnosis.
Key Features:
400 clinical vignettes from a study comparing commercial symptom checkers. Multiple LLM support (GPT series, Mistral, Claude, DeepSeek) Auto-evaluation system validated against human medical experts
We know it’s not perfect, but we believe it’s a step in the right direction for more realistic medical AI evaluation. Would love to hear your thoughts and suggestions for improvement!
submitted by /u/Significant-Pair-275
[link] [comments]
Hello everyone,
I’m working on an image segmentation project aimed at aiding rescue missions by detecting human bodies in underwater crash site images. Specifically, the goal is to identify and segment human figures from underwater images, which could be instrumental in emergency response and recovery operations.
I’m reaching out to see if anyone has, or knows of, a dataset that includes underwater human imagery, especially from crash sites or similar scenarios. Ideally, the dataset would contain varied conditions like different lighting, depths, and visibility to better simulate real-world underwater environments.
If such a dataset isn’t readily available, any resources, advice on data collection, or possible collaboration opportunities to create one would be greatly appreciated! I’m open to any suggestions, as I understand this is a unique and challenging request.
Thank you in advance for any help you can provide!
submitted by /u/GDSAI4903
[link] [comments]
This is for a university project. Thus far I’ve tried Guidestar, the American hospital directory, CMS, and more to no avail. I am really struggling to obtain any data but am passionate about this topic (and unfamiliar with datasets lol). Looking for financials and/or patient outcomes. I would really appreciate anything!
submitted by /u/Able_Delivery9912
[link] [comments]
Hi everyone,
I’m working on a project that requires datasets related to two areas:
1. Soil characteristics: I need data on soil and whether the soil is suitable for farming or not. 2. Water consumption: Datasets that track water usage, ideally in agriculture, industrial settings, or residential homes. Information on seasonal or regional usage trends would be especially helpful.
If anyone knows where I could find reliable datasets for these, or if you’ve come across anything similar in your own work, I’d really appreciate your guidance. Thanks in advance for any recommendations or resources!
submitted by /u/Fridge-Fridge
[link] [comments]
Detect PII and PHI with Gretel’s latest synthetic dataset and fine-tuned NER models 🚀:
– 50k train / 5k validation / 5k test examples
– 40 PII/PHI types
– Diverse real world industry contexts
– Apache 2.0
Dataset: https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1
Fine-tuned GliNER PII/PHI models: https://huggingface.co/gretelai/gretel-gliner-bi-large-v1.0
Blog / docs: https://gretel.ai/blog/gliner-models-for-pii-detection
submitted by /u/meowterspace42
[link] [comments]
Hey r/Datasets! We’re excited to announce K2Q, a newly curated dataset collection for anyone working with visually rich documents and large language models (LLMs) in document understanding. If you want to push the boundaries on how models handle complex, natural prompt-response queries, K2Q could be the dataset you’ve been looking for! The paper can be found here and is accepted to the Empirical Methods in Natural Language Processing (EMNLP) Conference.
What’s K2Q All About?
As LLMs continue to expand into document understanding, the need for prompt-based datasets is growing fast. Most existing datasets rely on basic templates like “What is the value for {key}?”, which don’t fully reflect the varied, nuanced questions encountered in real-world use. K2Q steps in to fill this gap by:
Converting five Key Information Extraction (KIE) datasets into a diverse, prompt-response format with multi-entity, extractive, and boolean questions. Using bespoke templates that better capture the types of prompts LLMs face in real applications.
Why Use K2Q?
Our empirical studies on generative models show that K2Q’s diversity significantly boosts model robustness and performance compared to simpler, template-based datasets.
Who Can Benefit from K2Q?
Researchers and practitioners can use K2Q to:
Test zero-shot or fine-tuned models with realistic, challenging questions. Improve model performance on KIE tasks through diverse prompt-response training. Contribute to future studies on data quality for generative model training.
📄 Dataset & Paper: K2Q will be presented at the Findings of EMNLP, so feel free to dive into our paper for in-depth analyses and results! We’d love to see K2Q inspire your own projects and findings in Document AI.
submitted by /u/blisferatu
[link] [comments]
Hi all,
As the title describes, I am looking for a timeseries sales data set of atleast 3 years with minimum of 10 different products. The dataset should be monthly, weekly or daily.
Can someone recommend me one? I am really struggling to find one on Kaggle.
Hope you guys can help me out!!
submitted by /u/embraceitt
[link] [comments]
Hi,
Has anyone used the Mushroom Observer dataset for image classification? Unless I’m getting something badly wrong, they all reference image IDs but do not supply the images.
i think the images can be gathered through the API using the image ID but they do not want you to scrape them this way.
Does anyone have any experience woerkin with it? It’s for an image classification application.
submitted by /u/Gostinker
[link] [comments]
Ads data published in vanityfair magazines published from 1913 to November 2024.
Data Format:
{ [year]: { year: “1913”, issues: [{ id: “issue’s month”, ads: [ articleKey: “articleKey”, issueKye: “issueKey”, title: “Ad title”, slug: “ad-slug”, coverDate: “coverDate”, pageRange: “page number on which ad was published”, wordCount: “word count” ] }] } }
Link: Google Drive
NOTE: VF was shutdown in 1936 and relaunched in 1983, so in-between years data isn’t available.
submitted by /u/waqarHocain
[link] [comments]
Hello everyone!
I’m currently working on a research project aimed at improving early-stage detection of ovarian cancer using deep learning applied to ultrasound images. Right now, I’m in the dataset collection phase and have encountered some challenges in finding accessible datasets.
I’ve come across the PLCO and MMOTU datasets:
PLCO requires a project proposal to gain access, which I’m considering but may take some time. MMOTU offers segmentation data but doesn’t include the full range of diagnostic images needed for my work.
After reviewing literature, I’ve noticed that many researchers use clinical study datasets that are private, hospital-specific patient data, or other datasets that aren’t publicly available.
If anyone here has worked on similar projects or faced these challenges, I’d be very grateful for any pointers! Specifically, I’m looking for:
Publicly accessible ultrasound datasets focused on ovarian or gynecological cancers Datasets that may be available through author requests or by contacting relevant organizations
Thanks in advance for any guidance or resources you can share!
submitted by /u/Swimming-Car-6055
[link] [comments]
Doesn’t have to be up to date necessarily, but i’d prefer it obviously.
Preferably formatted like this
Blinding Lights | 21 | 45 | 13 |
Heat Waves | 89 | 56 | 34
submitted by /u/Vault_8166
[link] [comments]
Hi everyone,
I’m currently working on a project that involves detecting changepoints in time series data, and I’m looking for benchmark datasets that are commonly used for evaluating changepoint detection algorithms.
Thanks in advance!
submitted by /u/garikdza
[link] [comments]
Hello all! I hope you are well. I just found out about this dataset and would love to use it for a medical research project. Unfortunately in Pakistan, my institution does not subscribe to it and there’s no way I could ask them. Hence, reaching out to everyone here. Would really appreciate any and all help!
submitted by /u/Ok_Weird_833
[link] [comments]
I wanted to do a quick analysis of a subreddit. Can someone teach me on how to use this? https://github.com/pushshift/api please
submitted by /u/Anxiousbutter_
[link] [comments]
I was using gdc cancer portal but they dont have annotation I was wondering is there any resourse for it plsss help me out
submitted by /u/Careful-Economy-3571
[link] [comments]
Hi everyone!
I’m part of a team working on a capstone project focused on crime scene reconstruction and analysis using machine learning and 3D simulations(blender/unity )
What We’re Doing: 3D Crime Scene Reconstruction: Creating an interactive model that lets investigators explore and “rewind” scenes to see potential sequences of events (e.g., weapon use, bullet trajectories).
Simulated Evidence Analysis: Replaying crime scenes based on data to visualize how evidence like blood spatter patterns or object placements might have occurred
We’re specifically looking for datasets that contain information related to crime scenes, including data on:
Crime types (especially homicide) Evidence details (e.g., weapon type, trajectory info, blood spatter)
If anyone has worked on a similar project before or knows where we can find reliable and detailed crime scene datasets, we’d greatly appreciate any guidance! We’re especially curious if there’s any open-source or academic dataset available, or if there are any other resources that might be useful for this type of project.
Also any other help related to any aspect of this project will be appreciated and is needed
Thanks in advance for any help, suggestions, or shared experiences!
submitted by /u/AdSquare9152
[link] [comments]
Greetings! I am currently conducting research on the US. To start the analysis I require data from BEA that dates back to 1990s (specifically 1997, when the NAICS has been introduced). I am pretty new to the BEA website, so I may be lost. The data I need is county-level. When I head to the archive for GDP by county and metro level, the only data that’s available dates back to 2017. Maybe I am doing something wrong? Where can I find older data for county and metro? I may need other county level data from other categories on the website. Maybe there is a website like nhgis but for BEA data?
submitted by /u/tasyaaaaa
[link] [comments]
Hi!
I struggled a lot to find the inflation data for France from an official source. I either found articles from INSEE (National Institute for Statistics and Economic Studies) on the inflation for each month which had a link for that data, and even that was only a subset of all the data for that month. Or I found auxiliary websites that didn’t cite the source for their data.
I also looked for official APIs but didn’t find something that directly provided the consumption index (inflation index) or a preprocessing of it (year-over-year variation for example). But I stumbled randomly on this https://www.insee.fr/fr/statistiques/series/102342213 (it’s an official source, it’s the INSEE) for which the title might be confusing. The title suggests that the data there is grouped by products and detailed products (a special nomenclature named COICOP).
I preprocessed it here https://github.com/ReinforcedKnowledge/france-inflation-data-cleaned (includes raw data, preprocessing scripts and preprocessed data). The README is in French but it explains the data a bit and explains how I got granular datasets from that big raw data. I found it a bit messy and confusing at the beginning when I started looking at it, but I was able to extract every unique combination of the modalities (region/department, index type, index variation, if product is under the COICOP nomenclature, household type).
I hope it can help if someone is looking for that data or understand it because it really took me some time and effort to find it and make sense of it.
submitted by /u/ReinforcedKnowledge
[link] [comments]
I wanna make local dataset i don t know how and where to start i need help
submitted by /u/m_rain_bow
[link] [comments]
Hello everyone, I need a spam messages dataset to train a LLM based spam message detection bot for Telegram. Any help is appreciated. (Data from Discord would be enough also)
submitted by /u/Fun-Refrigerator6526
[link] [comments]
Hello everyone, I am currently in a class at the moment that requires me to use a classification dataset and a regression dataset that is not from the UCI ML repository and I want to do my project about something in the social sciences (I have a poli sci background) however I’ve been struggling to find datasets that align with what I’m looking for. Does anyone have good recs for places to look for the kind of datasets I wan?
submitted by /u/jeanxette
[link] [comments]
I’m looking for a dataset/database of good quality (NO Al) food recipes with PICTURES that go alongside with instruction steps for commercial use. I would like to use it in an app l’m creating.
I don’t mind paying for it- preferably one time payment, rather than a subscription.
I would have to translate the instructions anyway, so what l’m really worried about are the pictures because of the copyright issues.
And NO APIs, I want to store the database locally.
Thank you
submitted by /u/3prisms
[link] [comments]
Feel free to request datasets on the platform, and take a look to see if there are any datasets you could source or produce.
These are non-free datasets that will pay generously for your work.
With community help, we can connect data suppliers with data consumers.
submitted by /u/Opendatabay
[link] [comments]
I’m looking for a dataset/database of good quality (NO AI) food recipes with PICTURES that go alongside with instruction steps, for commercial use. I would like to use it in an app I’m creating.
I don’t mind paying for it- preferably one time payment, rather than a subscription type of thing.
I would have to translate the instructions anyway, so what I’m really worried about are the pictures because of the copyright issues.
And NO APIs, I want to store the database locally.
Thank you
submitted by /u/AdministrativePie300
[link] [comments]