submitted by /u/tonypaul009
[link] [comments]
Category: Datatards
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
Hi guys , im thinking to present the movies dataset as part of my subject under data visualization , and explain the explaratory analysis i did on the data
But the lecturer has told that it should be like a story telling and not simoly stating the obvious points like for example ” top 20 movies of all time ” etc
Can anyone provide insights on how can i steer this dataset into a good storytelling point and also explore more with the data for the audience
Im seeing the generic datasets on kaggle abt them
If anyone has any other points or choosing a different dataset etc will be more helpful and hearing ur thoughts
I have to present just the stuff im visually plotting and not complete project , for the professor to check where i am at and take feedback to improve
submitted by /u/dishdash-paradox
[link] [comments]
Hey ya’ll, fresher here. I am working on an academic project (Enterprise analytics pipelines and BI systems) and exploring weather my company will remotely consider providing the data, and if this can be anonymized. Does anyone here have experience in anonymizing data ? if so, what are the ways to do that
E.g
- Masking identifiers/ generating synthetic datasets from real distributions
submitted by /u/IamThat_Guy_
[link] [comments]
Posting a dataset I’ve been building for a while:
What it is: The USDA Dr. Duke’s Phytochemical and Ethnobotanical Databases, restructured into a single flat table and enriched with four external data sources.
Schema (8 columns):
chemical— compound name (USDA nomenclature)plant_species— binomial species nameapplication— traditional medicinal use (where recorded)dosage— reported effective dose or concentrationpubmed_mentions_2026— total PubMed publication countclinical_trials_count_2026— ClinicalTrials.gov study countchembl_bioactivity_count— ChEMBL bioassay data pointspatent_count_since_2020— USPTO patents since Jan 2020
Stats: 104,388 records, 24,771 unique compounds, 2,315 species.
Formats: JSON (~18 MB) and Parquet (~900 KB).
Free sample (400 rows, CC BY-NC 4.0): https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON
There’s also a quickstart Jupyter notebook in the repo if you want to run some DuckDB queries against the sample.
The full dataset is commercial (one-time license). The base USDA data is public domain; the enrichment work is what you’re paying for.
I built the dataset solo in Germany, server is a Hetzner VPS running PostgreSQL 15 and Python 3.12. Happy to answer methodology questions.
submitted by /u/DoubleReception2962
[link] [comments]
Hi everyone,
I’m currently involved in a project where we are collecting large volumes of two-speaker conversational call audio intended for AI training purposes (speech recognition, conversational AI, etc.).
We’re trying to understand the best ways to distribute or license this kind of dataset to companies or research teams that need training data.
The recordings are:
• Natural phone-style conversations
• Two participants per recording
• Collected with consent
• PII removed
• Optional transcription and metadata available
I’m curious if anyone here has experience with:
- selling or licensing speech datasets
- platforms/marketplaces for AI training data
- typical pricing per hour of conversational audio
Most information online is very vague, so hearing real experiences from people in the space would be really helpful.
Thanks!
submitted by /u/FaithlessnessWeak199
[link] [comments]
Hi people!
I’d like to share a personal project I’ve been working on, an Edible Plant Database:
Mods, I interpreted your rule as “Self-promotion(of a website/domain you work for or own) without disclosure will be removed” – So I believe this is fine to share, as I am disclosing I made it? Apologies if I misunderstood that rule. Just want to clarify, I make no money from this project, and it’s a small hobby/self-hosted database I never intend to commercialise or monetise in any way, it will always be free.
Recently, I was searching for some kind of database of edible plants around the world to add to my “prepper” library, and I came across this old post: https://old.reddit.com/r/preppers/comments/iedq94/catalogue_of_all_the_worlds_edible_plants/
Basically, it seemed to be exactly what I was looking for, but it’s a 5-year-old post, and unfortunately, none of the download links worked for me.
The original source is a guy named Bruce French: https://www.abc.net.au/news/2020-08-22/food-plant-solutions-malnutrition-farming-edible-plants/12580732
He still maintains his edible plant database here: https://foodplantsinternational.com/. It’s a fantastic resource; I encourage you to check it out.
The actual searchable database is here: https://fms.cmsvr.com/fmi/webd/Food_Plants_World – however, I was unable to find a bulk download, and the search interface is quite clunky/hard to navigate (I’m sure it was created a long time ago).
So, I decided to create a bit of an ADHD passion project for myself in my spare time. However, it’s got to the point where I thought I should give back to the community.
I decided to take Bruce’s amazing collection and package it in a modern Web UI and a Modern Search interface, so I created this website, The Edible Plant DB: https://edibleplantdb.org/. I’m a bit of an amateur web developer and like playing around with stuff like this in my spare time.
I did, however, decide to make some improvements along the way. Most of Bruce’s collection does have images of the plants; however, they were quite small (basically just thumbnail-sized), and I thought, well, if I’m making a prepper edible plant database, there should be clearer images for people trying to identify the plants. So I updated all the plant images in the database with images sourced from https://www.inaturalist.org/ and Wikipedia. I was able to find images for about 80% of the plants in the DB. But I still need to find images/better descriptions for the niche/uncommon species in the database.
I also went a bit over the top and turned it into a really basic form of a “Wiki”, each plant page has an edit button at the top, so anyone can make an edit, as well as contribute images for each plant (especially for the ones with no images): https://edibleplantdb.org/contribute
Then, in terms of packaging, I am a huge supporter of .ZIM files and the organisation Kiwix: it’s basically everything in one file and much more useful for offline browsing, instead of me just providing a DB file and a bunch of directories/files with images, etc.
You can download the torrent here: https://edibleplantdb.org/downloads – however, just a disclaimer, I literally just started seeding this torrent, so it’s going to be a bit slow, unless I get some support from the community to get the seeding going 🙂
Anyway! Let me know what you think!
PS: Still a work in progress, and I am sure my amateur code has some bugs waiting to be discovered!
Also Magnet link (for ZIM file): magnet:?xt=urn:btih:86cb9bd89b458e75dae4be6281ad5522561f6a8b&dn=edibleplantdb.zim&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce
submitted by /u/tmosh
[link] [comments]
I am interested in dataset, preferably LinkedIn data that has following information:
job title, job description, name of company, start and end date
no personal information needed. Any ideas? Even paid.. for reasonable price… I am poor af
need large set, like millions of records. thanks
submitted by /u/BakulkouPoGulkach
[link] [comments]
Hi everyone!
I’ve been working on a project to clean and normalize US equity fundamentals and filings as one thing that always frustrated me was how messy the raw filings from the SEC are.
The underlying data (10-K, 10-Q, 13F, Form 4, etc.) is all publicly available through EDGAR, but the structure can be pretty inconsistent:
- company-specific XBRL tags
- missing or restated periods
- inconsistent naming across filings
- insider transaction data that’s difficult to parse at scale
- 13F holdings spread across XML tables with varying structures
I ended up building a small pipeline to normalize some of this data into a consistent format. The dataset currently includes:
- normalized income statements, balance sheets and cashflow statements
- institutional holdings from 13F filings
- insider transactions (Form 4)
All sourced from SEC filings but cleaned so that fields are consistent across companies and periods.
The goal was to make it easier to pull structured data for feature engineering without spending a lot of time wrangling the raw filings.
For example, querying profitability ratios across multiple years:
/profitability-ratios?ticker=AAPL&start=2020&end=2025
I wrapped it in a small API so it can be used directly in research pipelines or for quick exploration:
Hopefully people find this useful in their research and signal finding!
Disclaimer: This is a project I built. Sharing it here in case it’s useful for others looking for financial data
submitted by /u/myztaki
[link] [comments]
I’ve been working on a mixed-methods research platform, and one thing that kept coming up from users was the pain of cleaning datasets before they could even start analysing them.
Most people were either writing Python/R scripts or doing it manually in Excel. Both of which break the workflow when you just want to get to the analysis.
So I built a data cleaning module directly into the analysis tool. It handles the usual stuff:
- Duplicate removal (exact match or by specific columns)
- Missing value handling (drop rows, fill with mean/median/mode/custom value, forward/backward fill)
- Outlier detection (IQR and Z-score methods)
- String cleaning (trim, case conversion)
- Type conversion
- Find & replace (with regex)
- Row filtering by conditions
Each operation shows a preview with before/after diffs so you can review changes row by row before applying. There’s also inline cell editing for quick manual fixes and one-click undo.
Curious how others approach this:
- Do you clean data in a separate tool or prefer it integrated into your analysis workflow?
- What operations do you find yourself doing most often?
- Anything obvious I’m missing?
Happy to share a link if anyone wants to try it out. Works with CSV, Excel, and SPSS files.
submitted by /u/Sensitive-Corgi-379
[link] [comments]
Hi! I am starting my Master’s thesis in Business Intelligence and I am looking for large datasets to perform either annual budget forecasting or churn prevention. Thanks!
submitted by /u/Equivalent_Ad_1566
[link] [comments]
Hi! I am starting my Master’s thesis in Business Intelligence and I am looking for large datasets to perform either annual budget forecasting or churn prevention. Thanks!
submitted by /u/Equivalent_Ad_1566
[link] [comments]
We are currently sourcing large-scale programming code datasets to support enterprise clients developing AI and large language models (LLMs).
We are looking for high-quality datasets containing raw source code or structured code repositories across multiple programming languages.
Examples of relevant datasets include:
• Raw source code collections
• Curated open-source repositories
• Code with documentation or comments
• Code paired with explanations or Q&A
• Version-controlled project snapshots
Preferred characteristics
• Multi-language coverage (e.g. Python, JavaScript, Java, Solidity, C++, Go, Rust)
• Large-scale datasets suitable for AI/LLM training
• Clear licensing and commercial usage rights
• Structured formats such as JSON, CSV, Parquet, or repository archives
If you are a data provider, research group, or organisation holding code datasets, we would be interested in discussing potential collaboration and licensing terms.
Please reach out
submitted by /u/Winter-Lake-589
[link] [comments]
I am looking for a Data set that shows Medicaid population growth by zip code in the State of Missouri.
submitted by /u/Vlosuriello
[link] [comments]
Hello!
I was wondering if there were any big twitter datasets? I was thinking like the big dataset which exist for Reddit (i dont remember the name but it is pretty known I think), but just for tweets instead?
submitted by /u/AffectWizard0909
[link] [comments]
I compiled 200k+ human-written code reviews from top OSS projects including React, Tensorflow, VSCode, and more.
This dataset helped me finetune a version of Qwen2.5-Coder-32B-Instruct specialized in code reviews.
The finetuned model showed significant improvements in generating better code fixes and review comments as it achieved 4x improved BLEU-4, ROUGE-L, SBERT scores compared to base model.
Feel free to integrate this dataset into your LLM training and see improvements in coding skills!
submitted by /u/Ok_Employee_6418
[link] [comments]
Model architectures keep improving, but a lot of teams I talk to struggle more with training data than models.
Things like:
- noisy datasets
- inconsistent labeling
- missing metadata
- lack of domain coverage
Do people here feel the same, or is data not the biggest bottleneck in your experience?
submitted by /u/JayPatel24_
[link] [comments]
Hi guys,
I’m building a real time aviation monitoring dashboard using python n right now I’m using the opensky api to get live aircraft positions.
The issue is that opensky only provides aircraft state data (lat, lon, altitude, callsign, etc.), but it doesn’t include the flight’s origin and destination airports.
I’m looking for a free api that provides:
• real-time flight positions
• origin airport
• destination airport
• preferably no strict monthly request limits (or at least generous ones)
I’ve looked at a few options like aviation and airlabs, but their free tiers are very limited in the number of requests.
Does anyone know of:
- A free api that provides route info with live flight data?
- A workaround people use to infer origin/destination from ads-b data?
- Any open datasets or community feeds that include this info?
Thanks!
submitted by /u/Appropriate-Tip935
[link] [comments]
hi!! I have an assignment on mlr and i need a dataset to work on it but i want something kinda unique and i am panicking cause the deadline is approaching
submitted by /u/Big-Pirate-1184
[link] [comments]
Hi everyone,
I’m a computer science student at EPFL (Switzerland), and I’m currently working on a side project: an automated database analyzer that detects toxic/expensive SQL queries and uses AI to actively rewrite them into optimized code.
I’ve built the local MVP in Python, but testing it against my own “fake” mock data isn’t enough anymore. I need real-world chaos.
Would anyone be willing to share an anonymized export of their
pg_stat_statements (CSV) and the basic DDL Schema of their database?
- No PII or customer data needed.
- I just need the query structure, execution time, calls, and I/O blocks.
In exchange, I will run your data through my engine and send you the generated “Optimization & Cost-Saving Audit” report for free. It might actually help you spot a bottleneck!
Let me know if you are open to helping a student out, send me a DM! Thanks!
submitted by /u/Foreign-Bison-7826
[link] [comments]
I’m working on a system that processes large medical record packets and generates a chronological timeline with evidence citations (think: turning hundreds or thousands of pages of medical records into a structured chronology).
Right now I’m trying to find datasets that resemble real world medical record packets so I can test robustness. Most of the datasets I’ve found so far are either:
• purely structured EHR tables (diagnoses, labs, etc.)
• small sets of individual clinical notes
• synthetic datasets
What I’m ideally looking for:
• Long clinical documents (discharge summaries, physician notes, operative reports)
• Multi-document patient records
• Collections of clinical PDFs or reports
• Narrative-heavy hospital documentation
• Anything resembling actual chart records rather than isolated notes
Datasets I already know about:
• MIMIC-IV / MIMIC-IV-Note (waiting for credentials, anyone have a mirror?)
• i2b2 / n2c2 clinical NLP datasets (registration to download it is closed?)
• MTSamples medical transcription dataset
submitted by /u/deputy1389
[link] [comments]
Hi everyone! We just released a large European (e-)bike-sharing dataset and thought people here might find it useful.
What’s inside:
- ~25M bike trips
- ~38M station status snapshots
- ~13k stations
- 267 cities across Europe
- bike type information (e-bike vs classic)
- geographic coordinates (WGS-84)
- timestamps in UTC Unix seconds
The dataset combines trip-level data and high-frequency station snapshots, so it’s useful for things like:
- demand prediction
- fleet balancing / rebalancing research
- urban mobility analysis
- sustainability studies
- infrastructure planning
We originally compiled the dataset for a research paper:
“Data-Driven Insights into (E-)Bike-Sharing: Mining a Large-Scale Dataset on Usage and Urban Characteristics – Descriptive Analysis and Performance Modeling” (Waldner et al., 2025, Transportation).
License: CC BY-NC 4.0
Link to dataset: https://huggingface.co/datasets/PellelNitram/european-bike-sharing-dataset
Happy to answer questions! 🙂
submitted by /u/martin_lellep
[link] [comments]