submitted by /u/MainPuzzleheaded8880
[link] [comments]
Category: Datatards
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
What we need is the dataset that contains Disease image, label, Description of disease, remedies.If possible please provide some resources. Thanks in advance
submitted by /u/Total-Narwhal-3263
[link] [comments]
Hand-verified investors backing AI and machine learning companies.
submitted by /u/project_startups
[link] [comments]
I’ve been building an independent weekly cannabis price index focused on consumer retail prices, not revenue or licensing data. Most cannabis market reporting tracks sales, licenses, or company performance. I couldn’t find a public dataset that consistently tracks what consumers actually pay week to week, so I started aggregating prices from public online retail listings and publishing a fixed-baseline index. High-level approach: Weekly index with a fixed baseline Category-level aggregation (CBD, THC, etc.) No merchant or product promotion Transparent, public methodology Intended as a complementary signal to macro market reports Methodology and latest index are public here: https://cannabisdealsus.com/cannabis-price-index/ https://cannabisdealsus.com/cannabis-price-index/methodology/ I’m mainly posting to get methodological feedback: Does this approach seem sound for tracking consumer price movement? Any obvious biases or gaps you’d expect from this type of data source? Anything you’d want clarified if you were citing something like this? Not selling anything and not looking for promotion — genuinely interested in critique.
submitted by /u/theov666
[link] [comments]
Hi everyone,
I’m working on an academic project focused on Subjective Cognitive Decline (SCD) in menopausal women, using machine learning and explainable AI techniques.
While reviewing prior work, I found the paper “Clinical-Grade Hybrid Machine Learning Framework for Post-Menopausal subjective cognitive decline” particularly helpful. The hybrid ML approach and the focus on post-menopausal sleep-related health conditions closely align with the direction of my research.
Project overview (brief):
Machine learning–based risk prediction for cognitive issues in menopausal women
Use of Explainable AI (e.g., SHAP) to interpret contributing factors
Intended strictly for academic and educational purposes
Fully anonymous — no personally identifiable information is collected or stored
Goal is awareness and early screening support, not clinical diagnosis
submitted by /u/Small-Day-8755
[link] [comments]
About Dataset –
https://www.kaggle.com/datasets/prashanthan24/synthetic-emotions-dataset-14k-texts-7-emotions
Overview
High-quality synthetic dataset with 13,970 text samples labeled across 7 emotions (Anger, Happiness, Sad, Surprise, Hate, Love and Fun). Generated using Mistral-7B for diverse, realistic emotion expressions in short-to-medium texts. Ideal for benchmarking NLP models like RNNs, BERT, or LLMs in multi-class emotion detection.
Sample
Text: “John clenched his fists, his face turning red as he paced back and forth in the room. His eyes flashed with frustration as he muttered under his breath about the latest setback at work.”
Emotion: Anger
Key Stats
- Rows: 13970
- Columns: text, emotion
- Emotions: 7 balanced classes
- Generator: Mistral-7B (synthetic, no PII/privacy risks)
- Format: CSV (easy import to Kaggle notebooks)
Use Cases
- Train/fine-tune emotion classifiers (e.g., DistilBERT, LSTM)
- Compare traditional ML vs. LLMs (zero-shot/few-shot)
- Augment real datasets for imbalanced classes
- Educational projects in NLP/sentiment analysis
Notes Fully synthetic—labels auto-generated via LLM prompting for consistency. Check for duplicates/biases before heavy use. Pairs well with emotion notebooks!
submitted by /u/prashanthpavi
[link] [comments]
Hi all,
I’m trying to track down free / open datasets that contain real human open ends for testing and research. I have tried using AI but they just don’t capture the nuance of a real market research project.
If anyone knows of good public sources, I’d really appreciate being pointed in the right direction.
Thanks!
submitted by /u/472826
[link] [comments]
Been trying to build a dataset of tweets for a research project (analyzing discourse patterns around specific topics) and the official X API is basically unusable unless you want to drop $5k+/month for reasonable limits.
I’ve tried a few different approaches:
- Official API → rate limits killed me immediately
- Manual scraping → got my IP banned within a day
- Some random npm packages → half of them are broken now
Found a breakdown comparing different methods and it actually explained why most DIY scrapers fail (anti-bot stuff has gotten way more aggressive lately). Makes sense why so many tools just stopped working after Elon’s changes.
Anyone here working with Twitter data regularly? What’s actually reliable right now? Need something that can pull ~50k tweets/day without constant babysitting.
Not trying to do anything shady – just need public tweet text, timestamps, and basic engagement metrics for academic analysis.
submitted by /u/Technical_Fee4829
[link] [comments]
Are you sitting on a compiled Instagram creator database with depth beyond just handles?
I’m looking to buy a dataset outright that includes:
- Instagram handle
- District / city
- State
- Phone number
Creator range: nano / micro influencers
Geo focus: South India
This is a clean purchase, not rev-share, not scraping on demand, not ongoing work.
If you already have the data, we can close quickly.
If interested, DM with:
- Approx record count
- Fields available
- Price expectation
Only reaching out to people with ready data at this depth.
submitted by /u/mined_it
[link] [comments]
Was working on a deepfake research paper and was trying to get access to DFDC dataset but for some reason the dfdc official website ain’t working, is it because I didnt acquire access to it ??? Is there any other way I can get hands on the dataset???
submitted by /u/Ok_Concert6723
[link] [comments]
I’m interested in datasets around engagement, posts, comments, or interaction graphs.
A lot of older datasets seem outdated or incomplete now.
Are there still good public sources, or is most social data now API/scrape-only?
submitted by /u/crowpng
[link] [comments]
Recently, I posted here on Reddit asking for ideas on what I could build with a dataset of ~2 million pairs of messy/clean Brazilian addresses. A few kind folks shared some great suggestions, and one idea that really stood out was building an address parser.
That pushed me into the world of LLM fine-tuning for the first time.
I decided to partially fine-tune LLaMA 3.2 1B, focusing specifically on address normalization and field extraction (address, complement, neighborhood, city, state, country, coordinates, etc.). Surprisingly, the early results look quite promising.
To properly evaluate it, I also built a small API to:
- Run inference tests
- Perform post-inference validation
- Compute a confidence score based on consistency checks (postal code, city/state match, field presence, etc.)
Below is an example request body and the corresponding response.
Request
{ "inputs": [ "quadra -42.93386179 quadra arse 102 alameda 12 a, 5045 77023-582 brasil -21.26567258 palmas", "torre -43.02525939 bela vista 5 brasil minas gerais são joão do paraíso beco do pôr do sol, 4289 -19.14142529" ] }
Response
[ { "address": "Quadra Arse 102 Alameda 12 A, 5045", "complement": "quadra", "city": "Palmas", "country": "Brasil", "postal_code": "77023-582", "latitude": "-21.26567258", "longitude": "-42.93386179", "confidence": 1.0, "validation": { "postal_code_validation": { "is_valid": true, "found_in_input": true, "city_match": true }, "field_validation": { "address_found": true, "complement_found": true, "neighborhood_found": false, "city_found": true, "state_found": false, "country_found": true } } }, { "address": "Beco Do Pôr Do Sol, 4289", "complement": "torre", "neighborhood": "Bela Vista 5", "city": "São João Do Paraíso", "state": "Minas Gerais", "country": "Brasil", "latitude": "-19.14142529", "longitude": "-43.02525939", "confidence": 0.92, "validation": { "postal_code_validation": { "is_valid": false }, "field_validation": { "address_found": true, "complement_found": true, "neighborhood_found": true, "city_found": true, "state_found": true, "country_found": true, "city_in_state": false, "neighborhood_in_city": false } } } ]
I’d really appreciate honest feedback from people more experienced with:
- Fine-tuning small LLMs
- Address parsing / entity extraction
- Post-inference validation strategies
- Confidence scoring approaches
Does this look like a reasonable direction for a 1B model?
Anything you’d improve architecturally or evaluation-wise?
Thanks in advance — this project has been a great learning experience so far 🙏
submitted by /u/Hour-Dirt-8505
[link] [comments]
This dataset contains information on what technologies were found on domains that were crawled in December 2025.
A few common use cases for this type of data
- You’re a developer who had built a particular solution for a client, and you want to replicate your success by finding more leads based on that client’s profile. For example, find me all electrical wholesalers using WordPress that have a `.com.au` domain.
- You’re performing market research and you want to see who is already paying for your competitors. For example, find me all companies using my competitors product who are also paying for enterprise technologies (indicates high technology expenditure).
- You’re a security researcher who is evaluating the impact of your findings. For example, give me all sites running a particular version of a WordPress plugin.
The 67K domain dataset can be found here: https://www.dropbox.com/scl/fi/d4l0gby5b5wqxn52k556z/sample_dec_2025.zip?rlkey=zfqwxtyh4j0ki2acxv014ibnr&e=1&st=xdcahaqm&dl=0
The full 5M+ domains can be purchased for 99 USD at: https://versiondb.io/
VersionDB’s WordPress catalogue can be found here: https://versiondb.io/technologies/wordpress/
Enjoy!
submitted by /u/Upper-Character-6743
[link] [comments]
I’m experimenting with using prompt-based object detection (open-vocabulary / vision-language models) as a way to auto-generate training datasets for downstream models like YOLO.
Instead of fixed classes, the detector takes any text prompt (e.g. “white Toyota Corolla”, “people wearing safety helmets”, “parked cars near sidewalks”) and outputs bounding boxes. Those detections are then exported as YOLO-format annotations to train a specialized model.
Observations so far:
- Detection quality is surprisingly high for many niche or fine-grained prompts
- Works well as a bootstrapping or data expansion step
- Inference is expensive and not suitable for real-time use. this is strictly a dataset creation / offline pipeline idea
I’m trying to evaluate:
- How usable these auto-generated labels are in practice
- Where they fail compared to human-labeled data
- Whether people would trust this for pretraining or rapid prototyping
Demo / tool I’m using for the experiment (Don’t abuse, it will crash if bombarded with requests:
I’m mainly looking for feedback, edge cases, and similar projects. similar approaches before, I’d be very interested to hear what worked (or didn’t).
submitted by /u/eyasu6464
[link] [comments]
I was a heavy automeris.io (WebPlotDigitizer) user until the v5 version. Somewhat inspired by it, I’ve been working on a combined chart snipper and OCR text+table sampler. Desktop rather than web-based and built using Python, tesseract, and openCV. MIT licensed. Some instructions to get started in the readme.
Chart snipping should be somewhat familiar to automeris.io users but it starts with a screengrab. The tool is currently interactive but I’m thinking about more automated workflows. IMO the line detection is a bit easier to manage than it is in automeris with just a sequence of clicks but you can also drag individual points around. Still adding features and support for more chart types, better x-axis date handling etc. The Tkinter GUI has some limitations (e.g., hi-res screen support is a bit flaky) but is cross-platform and a Python built-in. Requests welcome.
submitted by /u/foldedcard
[link] [comments]
I am looking for platforms with listings of commercial/proprietary datasets. Any recommendations where to find them?
submitted by /u/Latter-Gift630
[link] [comments]
I compiled and structured a global automotive specifications dataset covering more than 12,000 vehicle variants from over 100 brands, model years 1990-2025.
Each record includes: Brand, model, year, trim Engine specifications (fuel type, cylinders, power, torque, displacement) Dimensions (length, width, height, wheelbase, weight) Performance data (0-
100 km/h, top speed, COz emissions, fuel consumption) Price, warranty, maintenance, total cost per km Feature list (safety, comfort, convenience)
Available in CSV, JSON, and SQL formats. Useful for developers, researchers, and Al or data analysis projects.
GitHub (sample, details and structure):
submitted by /u/Ok_Cucumber_131
[link] [comments]
Clean, analysis-ready Harris County (TX) parcel-level real estate dataset.
Fully documented, GIS-ready, delivered in Parquet format.
Perfect for analytics, GIS, and data science workflows.
#realestate #HarrisCounty #Texas #GIS #parceldata #dataset #Parquet #opendata #HCAD #propertyrecords #datascience #analytics #geospatial
submitted by /u/ThorImagery
[link] [comments]
Hi guys,
We’re a young company based in Europe and collect a significant amount of telemetry data from smart home devices in residential houses (e.g. temperature, energy consumption, usage patterns).
We believe this data could be valuable for companies across multiple industries (energy, proptech, insurance, analytics, etc.). However, we’re still quite new to the data monetization topic and are trying to better understand:
- How to price such data (typical models, benchmarks, CPMs, subscriptions, etc.)
- Who the realistic buyers might be
- What transaction volumes or market sizes to expect
- Where data like this is usually sold (marketplaces, direct sales, partnerships)
Where would you recommend starting to learn about this? Are there resources, communities, marketplaces, or frameworks you’ve found useful? First-hand experiences are especially welcome.
Thanks a lot for any help!
submitted by /u/Intelligent_Offer954
[link] [comments]
Hi all,
I’m the Co-founder of Datai. We are releasing a 65-hour dataset of spontaneous, two-speaker dialogues focused on Kenyan (KE) and Filipino (PH) English accents.
We built this to solve a specific internal problem: standard datasets (like LibriSpeech) are too clean. We needed data that reflects WebRTC/VoIP acoustics and non-Western prosody.
We are releasing this batch on Hugging Face for the community to use for ASR benchmarking, accent robustness testing, or diarization experiments.
The Specs:
- Total Duration: ~65 hours (Full dataset is 800+ hours)
- Speakers: >150 (Majority Kenyan interviewees, ~15 Filipino interviewers)
- Topic: Natural, unscripted day-to-day life conversations.
- Audio Quality: Recorded via WebRTC in Opus 48kHz, transcoded to
pcm_s16le. - Structure: Split-track (Stereo). Each speaker is on a separate track.
Processing & Segmentation: We processed the raw streams using silero-vad to chunk audio into 1 to 30-second segments.
File/Metadata Structure: We’ve structured the filenames to help with parsing: ROOM-ID_TRACK-ID_START-MS_END-MS
ROOM-ID: Unique identifier for the conversation session.TRACK-ID: The specific audio track (usually one speaker per track).
Technical Caveat (the edge case): Since this is real-world WebRTC data, we are transparent about the dirt in the data: If a speaker drops connection and rejoins, they may appear on a new TRACK-ID within the same ROOM-ID. We are clustering these in v2, but for now, treat Track IDs as session-specific rather than global speaker identities.
Access: The dataset is hosted on Hugging Face (gated to prevent bots/abuse, but I approve manual requests quickly).
Link is in the comments.
submitted by /u/Downtown_Valuable_44
[link] [comments]
I’ve been working on a side project and ended up compiling a dataset that may be useful beyond what I originally needed it for, so I’m considering releasing it publicly.
At a high level, the dataset contains:
- structured records collected over a multi-year period
- consistent timestamps and identifiers
- minimal preprocessing (basic cleaning + deduplication only)
It’s not tied to a specific paper or product, more something that could support exploratory analysis, modeling, or benchmarking, depending on the use case.
Before publishing, I wanted to sanity-check with this community:
- what details do you usually look for to judge dataset quality?
- is light preprocessing preferred, or raw + processed versions?
- anything that would immediately make this more usable for research?
Happy to share more specifics if there’s interest, and open to feedback before release.
submitted by /u/crowpng
[link] [comments]
Does anyone have access to current lists of CPAs in the US? Or ideas on the best way to scrape this information?
submitted by /u/jeremydy
[link] [comments]
Hello, I am currently studying data analytics and data science. I generally want to focus on one of these two fields and learn. But due to the high competition in the market and the negative impact of artificial intelligence on the field, should I start or choose another field? What exactly do I need to know and learn to stand out in the market competition in the DA DS fields and find a job more easily? There is a lot of information on the Internet, so I can’t find the exact required learning path. Recommendations from professionals in this field are very important to me. Is it worth studying this field and how? Thank you very much
submitted by /u/No_Staff_7246
[link] [comments]
Hi everyone,
I’m a master’s student working on academic research and I’m looking for a compiled dataset
for S&P 500 companies that includes:
– Revenue
– Net Income (profit)
– R&D expenses (I know some companies don’t report them)
Ideally:
– Annual data
– Multiple years (e.g. 2010–2024, but flexible)
– Excel or CSV format
This is strictly for non-commercial, academic use (master’s thesis).
If anyone already has this dataset (e.g. from Compustat / Capital IQ / Bloomberg)
and is willing to share or point me in the right direction, I’d really appreciate it.
Thanks a lot!
submitted by /u/SuddenBookkeeper6351
[link] [comments]
Support me with your contribution, ❤️ To get Donations for this project. Thank you!
submitted by /u/Appropriate_West_879
[link] [comments]
Sharing a new open-source (Apache 2.0) image-prompt dataset. Lunara Aesthetic is an image dataset generated using our sub-10B diffusion mixture architecture, then curated, verified, and refined by humans to emphasize aesthetic and stylistic quality.
submitted by /u/paper-crow
[link] [comments]