Hi! I am starting my Master’s thesis in Business Intelligence and I am looking for large datasets to perform either annual budget forecasting or churn prevention. Thanks!
submitted by /u/Equivalent_Ad_1566
[link] [comments]
Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?
Hi! I am starting my Master’s thesis in Business Intelligence and I am looking for large datasets to perform either annual budget forecasting or churn prevention. Thanks!
submitted by /u/Equivalent_Ad_1566
[link] [comments]
We are currently sourcing large-scale programming code datasets to support enterprise clients developing AI and large language models (LLMs).
We are looking for high-quality datasets containing raw source code or structured code repositories across multiple programming languages.
Examples of relevant datasets include:
• Raw source code collections
• Curated open-source repositories
• Code with documentation or comments
• Code paired with explanations or Q&A
• Version-controlled project snapshots
Preferred characteristics
• Multi-language coverage (e.g. Python, JavaScript, Java, Solidity, C++, Go, Rust)
• Large-scale datasets suitable for AI/LLM training
• Clear licensing and commercial usage rights
• Structured formats such as JSON, CSV, Parquet, or repository archives
If you are a data provider, research group, or organisation holding code datasets, we would be interested in discussing potential collaboration and licensing terms.
Please reach out
submitted by /u/Winter-Lake-589
[link] [comments]
I am looking for a Data set that shows Medicaid population growth by zip code in the State of Missouri.
submitted by /u/Vlosuriello
[link] [comments]
Hello!
I was wondering if there were any big twitter datasets? I was thinking like the big dataset which exist for Reddit (i dont remember the name but it is pretty known I think), but just for tweets instead?
submitted by /u/AffectWizard0909
[link] [comments]
I compiled 200k+ human-written code reviews from top OSS projects including React, Tensorflow, VSCode, and more.
This dataset helped me finetune a version of Qwen2.5-Coder-32B-Instruct specialized in code reviews.
The finetuned model showed significant improvements in generating better code fixes and review comments as it achieved 4x improved BLEU-4, ROUGE-L, SBERT scores compared to base model.
Feel free to integrate this dataset into your LLM training and see improvements in coding skills!
submitted by /u/Ok_Employee_6418
[link] [comments]
Model architectures keep improving, but a lot of teams I talk to struggle more with training data than models.
Things like:
Do people here feel the same, or is data not the biggest bottleneck in your experience?
submitted by /u/JayPatel24_
[link] [comments]
Hi guys,
I’m building a real time aviation monitoring dashboard using python n right now I’m using the opensky api to get live aircraft positions.
The issue is that opensky only provides aircraft state data (lat, lon, altitude, callsign, etc.), but it doesn’t include the flight’s origin and destination airports.
I’m looking for a free api that provides:
• real-time flight positions
• origin airport
• destination airport
• preferably no strict monthly request limits (or at least generous ones)
I’ve looked at a few options like aviation and airlabs, but their free tiers are very limited in the number of requests.
Does anyone know of:
Thanks!
submitted by /u/Appropriate-Tip935
[link] [comments]
hi!! I have an assignment on mlr and i need a dataset to work on it but i want something kinda unique and i am panicking cause the deadline is approaching
submitted by /u/Big-Pirate-1184
[link] [comments]
Hi everyone,
I’m a computer science student at EPFL (Switzerland), and I’m currently working on a side project: an automated database analyzer that detects toxic/expensive SQL queries and uses AI to actively rewrite them into optimized code.
I’ve built the local MVP in Python, but testing it against my own “fake” mock data isn’t enough anymore. I need real-world chaos.
Would anyone be willing to share an anonymized export of their
pg_stat_statements (CSV) and the basic DDL Schema of their database?
In exchange, I will run your data through my engine and send you the generated “Optimization & Cost-Saving Audit” report for free. It might actually help you spot a bottleneck!
Let me know if you are open to helping a student out, send me a DM! Thanks!
submitted by /u/Foreign-Bison-7826
[link] [comments]
I’m working on a system that processes large medical record packets and generates a chronological timeline with evidence citations (think: turning hundreds or thousands of pages of medical records into a structured chronology).
Right now I’m trying to find datasets that resemble real world medical record packets so I can test robustness. Most of the datasets I’ve found so far are either:
• purely structured EHR tables (diagnoses, labs, etc.)
• small sets of individual clinical notes
• synthetic datasets
What I’m ideally looking for:
• Long clinical documents (discharge summaries, physician notes, operative reports)
• Multi-document patient records
• Collections of clinical PDFs or reports
• Narrative-heavy hospital documentation
• Anything resembling actual chart records rather than isolated notes
Datasets I already know about:
• MIMIC-IV / MIMIC-IV-Note (waiting for credentials, anyone have a mirror?)
• i2b2 / n2c2 clinical NLP datasets (registration to download it is closed?)
• MTSamples medical transcription dataset
submitted by /u/deputy1389
[link] [comments]
Hi everyone! We just released a large European (e-)bike-sharing dataset and thought people here might find it useful.
What’s inside:
The dataset combines trip-level data and high-frequency station snapshots, so it’s useful for things like:
We originally compiled the dataset for a research paper:
“Data-Driven Insights into (E-)Bike-Sharing: Mining a Large-Scale Dataset on Usage and Urban Characteristics – Descriptive Analysis and Performance Modeling” (Waldner et al., 2025, Transportation).
License: CC BY-NC 4.0
Link to dataset: https://huggingface.co/datasets/PellelNitram/european-bike-sharing-dataset
Happy to answer questions! 🙂
submitted by /u/martin_lellep
[link] [comments]
Hi everyone!
I know that this is a bit of an ask but I’m currently helping organize a school competition for undergraduate accounting students, and we’re currently looking for an Excel-based case study that we could use for the event.
Ideally, it would include: A dataset in Excel that participants can use as raw data. Questions or tasks requiring analysis or computations in Excel Topics related to accounting, finance, or business analysis
If possible, it would also help if there’s a sample expected output or reference solution to guide the evaluation.
This is a student-led initiative, so unfortunately we’re unable to provide any compensation, but If anyone has existing Excel case studies, teaching materials, datasets with questions, or knows where we could find something like this, I’d really appreciate the help. We would be very grateful for any materials, resources, or guidance you could share.
Hoping for your kind consideration and thank you so much!
submitted by /u/Noctis-Aeternae
[link] [comments]
Hi everyone,
I’m sharing a large-scale metadata archive we’ve built at QTE Technologies. It contains over 1,000,000 records of industrial products (MRO) and scientific instruments.
We believe this is a valuable resource for training industrial LLMs and supply chain research.
Access the data here:
License: CC BY 4.0. Looking forward to seeing how the community uses this!
submitted by /u/Heavy_Guitar_7428
[link] [comments]
Hello everyone, I need a dataset in the instruction-response format of HTML code, can anyone give me some tips?
submitted by /u/pedrodev2026
[link] [comments]
I am currently working on an object detection model that detects food ingredients in a refrigerator. However, I can’t seem to find a complete dataset that includes vegetables, meat, fruits, etc. The closest results I could get were from Recipe Ingredients Image Dataset and Fruits-360 dataset. The both of them do not include meat. Any help is greatly appreciated.
submitted by /u/SortDull
[link] [comments]
I’m looking for a list of countries by estimated sexual assault rates. Not reported rates, since that’s pretty irrelevant, but estimated rates. Necessarily this will need to have been done by social scientists who impose a normative definition of “sexual assault”.
Thanks.
submitted by /u/___xXx__xXx__xXx__
[link] [comments]
Most PHI datasets evaluate masking on static single-modality documents. This one is different.
It captures per-event masking decisions across a simulated longitudinal stream, the same subject appearing across clinical notes, ASR transcripts, imaging proxies, waveform data, and audio metadata over time. The idea is to evaluate how re-identification risk accumulates across events rather than within a single record.
Five policies are included for comparison: raw, weak, pseudo, redact, and adaptive. The adaptive controller is the interesting one, it escalates masking strength only when cumulative exposure actually justifies it.
Dataset is fully open, no DUA required. Everything runs on synthetic data, no real patient records anywhere.
Hugging Face: https://huggingface.co/datasets/vkatg/streaming-phi-deidentification-benchmark
Code to regenerate: https://github.com/azithteja91/phi-exposure-guard
Happy to answer questions on the schema or the benchmark design.
submitted by /u/Visual_Music_4833
[link] [comments]
I just made Hey everyone, I’ve been working on an ESG Data API and just launched it publicly.
It covers 500+ publicly traded companies across the US, Europe, and Asia-Pacific and includes:
Built it because ESG data is either locked behind expensive Bloomberg/Refinitiv terminals or scattered across inconsistent PDF reports. Wanted to make it accessible for developers, researchers, and fintech builders.
Free tier available. Would love feedback from anyone building in the sustainability or finance space.
Disclaimer: I built this and am the developer behind it. Sharing here because I think it’s useful for the community — happy to answer any questions.
submitted by /u/Choice_Classroom_703
[link] [comments]
I am doing a research project on Influence of digital financial resources on financial understanding of young adults aged 18-24, but my data is too male dominated please help me to diversify the data with female and other options
This is for academic purpose and will only take 1 ot 2 min to fill out.
submitted by /u/Moonandtheearth8
[link] [comments]
Hi everyone,
I’m looking for a large e-commerce dataset (at least ~5GB) for a personal data engineering project. Ideally I’m hoping to find something with raw CSV files rather than already processed datasets.
The dataset could include things like:
I’m mainly trying to simulate a realistic transactional dataset for building a small data warehouse and running analytics queries.
Requirements:
If you know any Kaggle datasets, public data dumps, GitHub repos, or open data sources that match this, please share.
Thanks!
submitted by /u/Historical-Web3638
[link] [comments]
I need a favour from this group.
I’m deep in research on how AI teams actually source and license training data (text, audio, video, synthetic). Not the theory, but real, messy, day-to-day process.
I’m NOT pitching or selling anything. I’m having short 15-minute conversations with people who work on this daily, and the insights have been genuinely eye-opening.
Happy to share what I’m learning in return.
If you know someone who fits any of these, I’d massively appreciate an intro or a tag in the comments.
Possible targets:
ML engineers or data leads at companies training or fine-tuning LLMs.
Anyone responsible for sourcing or procuring training data.
Teams building domain-specific AI models (healthcare, legal, finance, speech) People working on multilingual model training
submitted by /u/Winter-Lake-589
[link] [comments]