Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Aggregated Historical Flight Price Dataset

I am working on a personal project that requires aggregated flight prices based on origin-destination pairs. I am specifically interested in data that includes both the price fetch date (booking date) and the travel date. The price fetch date is particularly important for my analysis.

For reference, I’ve found an example dataset on Kaggle https://www.kaggle.com/datasets/yashdharme36/airfare-ml-predicting-flight-fares/data, but it only covers a three-month period. To effectively capture seasonality, I need at least two years’ worth of data.

The ideal features for the dataset would include:

  1. Origin airport
  2. Destination airport
  3. Travel date
  4. Booking date or price fetch date (or the number of days left until the travel date)
  5. Time slot (optional), such as morning, evening, or night
  6. Price

I am looking specifically for a dataset of Indian domestic flights, but I am finding it challenging to locate one. I plan to combine this flight data with holiday datasets and other relevant information to create a flight price prediction app.

I would appreciate any suggestions you may have, including potential global datasets. Additionally, I would like to know the typical costs associated with acquiring such datasets from data providers. Thank you!

submitted by /u/athuljyothis
[link] [comments]

Spotify 100,000 Podcasts Dataset Availability

https://podcastsdataset.byspotify.com/ https://aclanthology.org/2020.coling-main.519.pdf

Does anybody have access to this dataset which contains 60,000 hours of English audio?

The dataset was removed by Spotify. However, it was originally released under a Creative Commons Attribution 4.0 International License (CC BY 4.0) as stated in the paper. Afaik the license allows for sharing and redistribution – and it’s irrevocable! So if anyone grabbed a copy while it was up, it should still be fair game to share!

If you happen to have it, I’d really appreciate if you could send it my way. Thanks! 🙏🏽

submitted by /u/OogaBoogha
[link] [comments]

Rf-stego-dataset: Python Based Tool That Generates Synthetic RF IQ Recordings + Optional Steganographic Payloads Embedded Via LSB (repo Includes Sample Dataset)

rf-stego-dataset [tegridydev]

Python based tool that generates synthetic RF IQ recordings (.sigmf-data + .sigmf-meta) with optional steganographic payloads embedded via LSB.

It also produces spectrogram PNGs and a manifest (metadata.csv + metadata.jsonl.gz).

Key Features

  • Modulations: BPSK, QPSK, GFSK, 16-QAM (Gray), 8-PSK
  • Channel Impairments: AWGN, phase noise, IQ imbalance, Rician / Nakagami fading, frequency & phase offsets
  • Steganography: LSB embedding into the I‑component
  • Outputs: SigMF files, spectrogram images, CSV & gzipped JSONL manifests
  • Configurable: via config.yaml or interactive menu

Dataset Contents

Each clip folder contains: 1. clip_<idx>_<uuid>.sigmf-data 2. clip_<idx>_<uuid>.sigmf-meta 3. clip_<idx>_<uuid>.png (spectrogram)

The manifest lists: – Dataset name, sample rate – Modulation, impairment parameters, SNR, frequency offset – Stego method used – File name, generation time, clip duration

Use Cases

  • Machine Learning: train modulation classification or stego detection models
  • Signal Processing: benchmark algorithms under controlled impairments
  • Security Research: study steganography in RF domains

Quick Start

  1. Clone repo: git clone https://github.com/tegridydev/rf-stego-dataset.git
  2. Install dependencies: pip install -r requirements.txt
  3. Edit config.yaml or run: python rf-gen.py and choose Show config / Change param
  4. Generate data: select Generate all clips

~~Enjoy <3

submitted by /u/tegridyblues
[link] [comments]

Seeking ESG Controversy Scores (2021–2024) For S&P 500 Financial Sector Companies

Hi,
I’m doing an academic research project and urgently need ESG controversy scores (not general ESG ratings) for financial sector companies in the S&P 500 from 2021 to 2024 from any reliable source (MSCI, Refinitiv, Sustainalytics, etc.).

Ideally, I need scores that reflect the timing and severity of ESG controversies so I can conduct an event study on their stock price impact. My university (Tunis Business School) doesn’t provide access to these databases, and I’m a student working on a tight (read: nonexistent) budget.

Would appreciate any help, pointers, or sample datasets. Thank you!

submitted by /u/B3ss1
[link] [comments]

Seeking Ninja-Level Scraper For Massive Data Collection Project

I’m looking for someone with serious scraping experience for a large-scale data collection project. This isn’t your average “let me grab some product info from a website” gig – we’re talking industrial-strength, performance-optimized scraping that can handle millions of data points.

What I need:

  • Someone who’s battle-tested with high-volume scraping challenges
  • Experience with parallel processing and distributed systems
  • Creative problem-solver who can think outside the box when standard approaches hit limitations
  • Knowledge of handling rate limits, proxies, and optimization techniques
  • Someone who enjoys technical challenges and finding elegant solutions

I have the infrastructure to handle the actual scraping once the solution is built – I’m looking for someone to develop the approach and architecture. I’ll be running the actual operation, but need expertise on the technical solution design.

Compensation: Fair and competitive – depends on experience and the final scope we agree on. I value expertise and am willing to pay for it.

If you’re the type who gets excited about solving tough scraping problems at scale, DM me with some background on your experience with high-volume scraping projects and we can discuss details.

Thanks!

submitted by /u/polawiaczperel
[link] [comments]

Tired Of Robotic Chatbots? Train Them To Sound Human – Try My Dataset

Hi !

I’ve just uploaded a new dataset designed for NLP and chatbot applications:

Tone Adjustment Dataset

This dataset contains English sentences rewritten in three different tones:

  • Polite
  • Professional
  • Casual

Use Cases:

  • Training tone-aware LLMs and chatbot models
  • Fine-tuning transformers for style transfer tasks
  • Improving user experience by making bots sound more natural

    I’d love to hear your thoughts—feedback, ideas, or collaborations are welcome!

Cheers,
Gopi Krishnan

submitted by /u/ZenQuery
[link] [comments]

Looking For Poultry Export Data By Country

I’ve been searching for about 2 hours for specific data regarding poultry exports from the US to either Europe in general or Germany specifically. I am looking for the years 1960-1970, more specifically 1962, 63, and 64 which seem to be unfindable. I’ve found this for 1961 on AgEcon but I can’t find past that. I also have found it for 1967 and onwards but again have the gap in the years I specifically need. I am able to find this for poultry broiler/young chicken exports in pounds, which is helpful, but not in the dollar amount that I need. Any ideas where to look further?

submitted by /u/attagirly
[link] [comments]

Help!! NYC Local News Headlines — 2021 – 2024

I am new to this. Extremely new to this. I’m working on a university capstone project that requires coding news headlines to compare trends in content with some other thing that’s unimportant right now.

I’ve been trying to figure out a way to scrape headlines from local news outlets (ABC 7, FOX 5, NY Post, etc— I’m not picky lol) from 2021 to 2024 (or any year within those, I’m more than happy to reduce the scope). I had some luck with scraping a month’s worth of daily headlines in 2024 of ABC 7 using Internet Archive, but it didn’t translate over well to NBC 4 or CBS 2. And IA can be finicky with taking lots of data.

Basically I’m trying to find major headlines from local news outlets daily, at about 9 AM EST, from 2021 – 2024. I’m okay with getting creative. Any suggestions or ideas??

eta: i do know the NYT API

submitted by /u/dearwikipedia
[link] [comments]

Looking For PRAMS Phase 8 Core Dataset

Hi everyone,
I’m a Ph.D. student currently working on a funded project with my advisor using PRAMS data.

I applied through the PRAMS website, and after getting approved, I was only able to download the Phase 8 dataset without the core file. Unfortunately, my account was later blocked for some reason.

Since then, I’ve been in contact with the PRAMS data manager, but it’s already been over three months without resolution. I completely understand that they may be dealing with internal issues and it’s not necessarily their fault.

That said, the deadline for our project’s progress report is fast approaching, and I can no longer afford to just wait for a response.

If anyone has previously downloaded the Phase 8 data with the core file, or knows of any way to access it, I’d deeply appreciate it if you could share or point me in the right direction.

Thank you so much in advance and I really hope everything gets back to normal soon.

submitted by /u/DoyouknowyouDO
[link] [comments]

A Dataset Of Annotated CC0 Images, What To Do With It?

years ago (before the current generative AI wave) I’d seen this person start a website for crowdsourced image annotations, I thought that was a great idea so I tried to support by becoming a user, when I had spare moments I’d go annotate. Killed a lot of time doing that during pandemic lockdowns etc. There around 300,000 polygonal outlines here accumulated over many years. to view them you must search for specific labels ; there’s a few hundred listed in the system and a backlog of new label requests hidden from public view. there is an export feature

https://imagemonkey.io

example .. roads/pavements in street scenes (“rework” mode will show you outlines, you can also go to “dataset->explore” to browse or export)

https://imagemonkey.io/annotate?mode=browse&view=unified&query=road%7Cpavement&search_option=rework

it’s also possible to get the annotations out in batches via a python API

https://github.com/ImageMonkey/imagemonkey-libs/blob/master/python/snippets/export.py

i’m worried the owner might get disheartened from a sense of futility (so few contributors, and now there are really powerful foundation models available including image to text)

but I figure “every little helps”, it would be useful to get this data out into a format or location where it can feed back into training, maybe even if it’s obscure and not yet in training sets it could be used for benchmarking or testing other models

When the site was started the author imagined a tool for automatically fine-tuning some vision nets for specific labels, I’d wanted to broaden it to become more general. the label list did grow and there’s probably a couple of hundred more that would make sense to make ‘live’

There’s also an aspect that these generative AI models get accused of theft, so the more deliberate voluntary data there is out there the better. I’d guess that you could mix image annotations somehow into the pretraining data for multimodal models, right? I’m also aware that you can reduce the number of images needed to train image-generators if you have polygonal annotations aswell as image/descriptions-text pairs.

Just before the diffusion craze kicked off I’d had some attempts at trying to train small vision nets myself from scratch (rtx3080) but could only get so far. When stable diffusion came out I figured my own attemtps to train things were futile.

Here’s a thread where I documented my training attempt for the site owner

https://github.com/ImageMonkey/imagemonkey-core/issues/300 – in here you’ll see some visualisations of the annotations (the usual color coded overlays)

I think these labels today could be generalised by using an NLP model to turn the labels into vector embeddings (cluster similar labels or train image to embedding, etc)

The annotations would probably want to be converted to some better known format that could be loaded into other tools. they are available in his json format.

can anyone advise on how to get this effort fed back into some kind of visible community benefit?

submitted by /u/dobkeratops
[link] [comments]

Finally Releasing The Bambu Timelapse Dataset – Open Video Data For Print‑failure ML (sorry For The Delay!)

Hey everyone!

I know it’s been a long minute since my original call‑for‑clips – life got hectic and the project had to sit on the back burner a bit longer than I’d hoped. 😅 Thanks for bearing with me!

What’s new?

  • The dataset is live on Hugging Face and ready for download or contribution.
  • First models are on the way (starting with build‑plate identification) – but I can’t promise an exact release timeline yet. Life still throws curveballs!

🔗 Dataset page: https://huggingface.co/datasets/v2thegreat/bambu-timelapse-dataset

What’s inside?

  • 627 timelapse videos from P1/X1 printers
  • 81 full‑length camera recordings straight off the printer cam
  • Thumbnails + CSV metadata for quick indexing
  • CC‑BY‑4.0 license – free for hobby, research, and even commercial use with proper attribution

Why bother?

  • It’s the first fully open corpus of Bambu timelapses; most prior failure‑detection work never shares raw data.
  • Bambu Lab printers are everywhere, so the footage mirrors real‑world conditions.
  • Great sandbox for manufacturing / QA projects—failure classification, anomaly detection, build‑plate detection, and more.

Contribute your clips

  1. Open a Pull Request on the repo (originals/timelapses/<your_id>/).
  2. If PRs aren’t your jam, DM me and we’ll arrange a transfer link.
  3. Please crop or blur anything private; aim for bed‑only views.

Skill level

If you know some Python and basic ML, this is a perfect intermediate project to dive into computer vision. Total beginners can still poke around with the sample code, but training solid models will take a bit of experience.

Thanks again for everyone’s patience and for the clips already shared—can’t wait to see what the community builds with this!

submitted by /u/v2thegreat
[link] [comments]

Any Public Datasets That Focus On Nutrition Content Of Eggs Based On Chicken Feed? Maybe More Specifically, Transfer Rate Of Certain Nutrients From Chicken Feed Into The Egg?

Was looking for datasets with nutrition content in mind and perhaps feed efficiency rate but now I realized I’m struggling to find any dataset related to egg size, shell hardness, and contents. I’m checking FSIS and USDA but most studies are focused around incidences of contamination and the like rather than product quality, perhaps due to only having “standards,” but that means they should have the data somewhere and I just can’t find it, right…? Please help 🙏

submitted by /u/Masuikai
[link] [comments]

Built 300M LinkedIn Leads Database Using Automation + AI

Been messing with automation + AI for over a year along side with my team and ended up building a system that scraped 300 million+ leads from LinkedIn. Used a mix of:

  • Multiple Sales Nav accounts
  • Rotating proxies & custom scripts
  • Headless browsers & queue-based servers
  • ChatGPT for data cleaning & enrichment

Honestly, the setup was painful at times (LinkedIn doesn’t play nice), but the results were wild. If you’re into large-scale scraping, lead gen, or just curious how this stuff works under the hood, happy to chat.

I packaged everything into a cleaned database way cheaper than ZoomInfo/Apollo if anyone ever needs it. It’s up at Leadady .com, one-time payment, no fluff.

submitted by /u/Dreamer_made
[link] [comments]

Dataset Release: Generated Empathetic Dialogues For Addiction Recovery Support (Synthetic, JSONL, MIT)

Hi r/datasets,

I’m excited to share a new dataset I’ve created and uploaded to the Hugging Face Hub: Generated-Recovery-Support-Dialogues.

https://huggingface.co/datasets/filippo19741974/Generated-Recovery-Support-Dialogues

About the Dataset:

This dataset contains ~1100 synthetic conversational examples in English between a user discussing addiction recovery and an AI assistant. The AI responses were generated following guidelines to be empathetic, supportive, non-judgmental, and aligned with principles from therapeutic approaches like Motivational Interviewing (MI), ACT, RPT, and the Transtheoretical Model (TTM).

The data is structured into 11 files, each focusing on a specific theme or stage of recovery (e.g., Ambivalence, Managing Negative Thoughts, Relapse Prevention, TTM Stages – Precontemplation to Maintenance).

Format:

JSONL (one JSON object per line)

Each line follows the structure: {“messages”: [{“role”: “system/user/assistant”, “content”: “…”}]}

Size: Approximately 1100 examples total.

License: MIT

Intended Use:

This dataset is intended for researchers and developers working on:

Fine-tuning conversational AI models for empathetic and supportive interactions.

NLP research in mental health support contexts (specifically addiction recovery).

Dialogue modeling for sensitive topics.

Important Disclaimer:

Please be aware that this dataset is entirely synthetic. It was generated based on prompts and guidelines, not real user interactions. It should NOT be used for actual diagnosis, treatment, or as a replacement for professional medical or psychological advice. Ethical considerations are paramount when working with data related to sensitive topics like addiction recovery.

I hope this dataset proves useful for the community. Feedback and questions are welcome!

submitted by /u/Same_Error_8868
[link] [comments]

Customer Service Audio Recordings Dataset

Hi everybody!

I am currently building a model that analyze the customer service calls and evaluate the agents for my college class. I wonder what is the most well-known, free, recommended datasets to use for this? I am currently looking for test data for model evaluations.

We are very new with the model training and testing so please drop your recommendations below..

Thank you so much.

submitted by /u/TeddyBearFet1sh
[link] [comments]

Looking For Sources To Find Raw And Unprocessed Datasets

Hi, for a course I am required to find and pick a raw and unprocessed dataset with a minimum of 1 million records, another constraint that I have is that this data needs to be tabular. Additionally, The data set should not be an already fully processed data product. Good examples of raw and unprocessed data are JSON/XML files from the web. These records can’t immediately be put into a structured table without processing.

The goal for me is to turn the unprocessed source into a data product, and example that was given: Preparing Wikipedia data dumps so that they can be used for graph query processing.

So far I have been browsing the following two resources:

I am looking for additional sources for potential datasets, and tips or hints are welcome!

submitted by /u/rubberysubby
[link] [comments]